AI GLOSSARY

Chain-of-Thought Monitoring

CoT MonitoringSafety, Alignment & Ethics

An AI safety oversight technique that inspects a reasoning model's externalized thinking trace (its Chain-of-Thought) for signs of deceptive intent, goal misalignment, or harmful reasoning before the model acts on it. Unlike input/output monitoring, CoT monitoring can catch explicit statements of harmful intent and surface flawed reasoning that would otherwise be invisible.

The technique exploits a structural property of current reasoning models: sufficiently complex cognition must pass through the chain of thought as working memory, making it readable to an external monitor. This is called *CoT monitorability*.

However, monitorability is considered fragile. It can be undermined by reinforcement learning at scale (reasoning drifts toward illegible representations), deliberate obfuscation by situationally-aware models, or architectures that keep reasoning in latent (non-verbalized) form entirely. CoT monitoring is therefore one imperfect layer among several needed safety mechanisms, not a sufficient safeguard on its own.

External reference