AI GLOSSARY

Sparse Autoencoder

Research & Advanced Concepts

A type of autoencoder trained to produce representations where most activations are zero, with only a small number of features active for any given input. Sparse autoencoders have become a central tool in mechanistic interpretability research, used to decompose the dense, polysemantic activations of large language models into more interpretable features that correspond to specific, human-understandable concepts.

External reference