AI GLOSSARY

Mechanistic Interpretability

Research & Advanced Concepts

A research field that aims to reverse-engineer neural networks at a granular level, understanding not just what they can do, but precisely how they do it at the level of individual neurons, weights, and circuits. The term was coined by Chris Olah, and the field seeks to open the black box of deep learning with the goal of making AI systems more transparent, predictable, and safe. It is a core area of AI safety research.
See also: interpretability, circuit.

External reference