AI GLOSSARY

Inner Alignment

Safety, Alignment & Ethics

The challenge of ensuring that a model trained to optimize a given objective actually internalizes that objective as its true goal, rather than learning a superficially similar proxy that happens to score well during training but diverges from the intended goal in deployment. Inner alignment is distinct from outer alignment, which concerns whether the training objective itself correctly captures what we want.
See also: goal misgeneralization, AI alignment, deceptive alignment.

External reference