Back to glossaryExternal reference
AI GLOSSARY
Inner Alignment
Safety, Alignment & Ethics
The challenge of ensuring that a model trained to optimize a given objective actually internalizes that objective as its true goal, rather than learning a superficially similar proxy that happens to score well during training but diverges from the intended goal in deployment. Inner alignment is distinct from outer alignment, which concerns whether the training objective itself correctly captures what we want.
See also: goal misgeneralization, AI alignment, deceptive alignment.