AI GLOSSARY

Scalable Oversight

Safety, Alignment & Ethics

A set of techniques for maintaining meaningful human supervision of AI systems even as those systems become more capable than the humans overseeing them. The core problem: as AI tackles increasingly complex tasks, humans may lack the expertise or time to evaluate outputs directly. Proposed approaches (including debate, recursive reward modeling, and iterated amplification) aim to keep human values in the loop by structuring interactions so that a less capable evaluator can still catch errors or deception in a more capable system.
See also: reinforcement learning from human feedback, corrigibility.

External reference