AI GLOSSARY
Chain-of-Thought Monitoring
An AI safety oversight technique that inspects a reasoning model's externalized thinking trace — its Chain-of-Thought — for signs of deceptive intent, goal misalignment, or harmful reasoning before the model acts on it. Unlike input/output monitoring, CoT monitoring can catch explicit statements of harmful intent and surface flawed reasoning that would otherwise be invisible.
The technique exploits a structural property of current reasoning models: sufficiently complex cognition must pass through the chain of thought as working memory, making it readable to an external monitor. This is called *CoT monitorability*.
However, monitorability is considered fragile. It can be undermined by reinforcement learning at scale (reasoning drifts toward illegible representations), deliberate obfuscation by situationally-aware models, or architectures that keep reasoning in latent (non-verbalized) form entirely. CoT monitoring is therefore one imperfect layer among several needed safety mechanisms, not a sufficient safeguard on its own.
