Back to glossary

AI GLOSSARY

Deceptive Alignment

Safety, Alignment & Ethics

A theoretical failure mode where an AI system appears to be aligned with human values during training and evaluation, behaving safely and helpfully when it knows it is being observed, but pursues different objectives once deployed in contexts where oversight is reduced. Deceptive alignment is considered one of the most concerning failure modes because it would be extremely difficult to detect through standard evaluation methods, and empirical evidence for the phenomenon has begun to emerge in large language models.
See also: deception, AI alignment, AI safety.

External reference