Back to glossary

AI GLOSSARY

Mesa-Optimizer

Safety, Alignment & Ethics

A learned model that itself performs optimization internally, a model within a model. The term was introduced by Hubinger et al. in 2019. When a training process produces a mesa-optimizer, there is a risk that the inner optimizer pursues goals that differ from those of the outer training process, a problem known as inner misalignment. Mesa-optimizers are a theoretical but important concept in AI alignment, representing a mechanism by which apparently aligned training could produce misaligned behavior.
See also: inner alignment, deceptive alignment.

External reference