Back to glossary

AI GLOSSARY

Scalable Oversight

Safety, Alignment & Ethics

A set of techniques for maintaining meaningful human supervision of AI systems even as those systems become more capable than the humans overseeing them. The core problem: as AI tackles increasingly complex tasks, humans may lack the expertise or time to evaluate outputs directly. Proposed approaches — including debate, recursive reward modeling, and iterated amplification — aim to keep human values in the loop by structuring interactions so that a less capable evaluator can still catch errors or deception in a more capable system.
See also: reinforcement learning from human feedback, corrigibility.

External reference