Back to glossary

AI GLOSSARY

Interpretability

Safety, Alignment & Ethics

The degree to which a human can understand the internal mechanisms of an AI model, not just its inputs and outputs, but the computations and representations that connect them. Interpretability research seeks to open the black box of deep learning, making it possible to verify that models are reasoning correctly and identify potential failure modes before they manifest in deployment.
See also: explainability, mechanistic interpretability, explainable AI.

External reference