Back to glossaryExternal reference
AI GLOSSARY
Scalable Oversight
Safety, Alignment & Ethics
A set of techniques for maintaining meaningful human supervision of AI systems even as those systems become more capable than the humans overseeing them. The core problem: as AI tackles increasingly complex tasks, humans may lack the expertise or time to evaluate outputs directly. Proposed approaches — including debate, recursive reward modeling, and iterated amplification — aim to keep human values in the loop by structuring interactions so that a less capable evaluator can still catch errors or deception in a more capable system.
See also: reinforcement learning from human feedback, corrigibility.