Back to glossaryExternal reference
AI GLOSSARY
Sparse Autoencoder
Research & Advanced Concepts
A type of autoencoder trained to produce representations where most activations are zero, with only a small number of features active for any given input. Sparse autoencoders have become a central tool in mechanistic interpretability research, used to decompose the dense, polysemantic activations of large language models into more interpretable features that correspond to specific, human-understandable concepts.