Back to glossaryExternal reference
AI GLOSSARY
Mesa-Optimizer
Safety, Alignment & Ethics
A learned model that itself performs optimization internally, a model within a model. The term was introduced by Hubinger et al. in 2019. When a training process produces a mesa-optimizer, there is a risk that the inner optimizer pursues goals that differ from those of the outer training process, a problem known as inner misalignment. Mesa-optimizers are a theoretical but important concept in AI alignment, representing a mechanism by which apparently aligned training could produce misaligned behavior.
See also: inner alignment, deceptive alignment.