AI GLOSSARY

Goal Misgeneralization

Safety, Alignment & Ethics

A failure mode where an AI system learns a goal that produces correct behavior during training but generalizes incorrectly to new situations, pursuing a subtly wrong objective that was correlated with the intended goal in training but diverges from it in deployment. The system retains its capabilities but applies them toward the wrong end. Goal misgeneralization is a form of inner alignment failure and is a significant concern for systems deployed in environments that differ from their training conditions.
See also: deceptive alignment, AI alignment, distribution shift.

External reference