Back to glossaryExternal reference
AI GLOSSARY
Activation Steering
Research & Advanced Concepts
A technique for influencing a neural network's behavior by directly modifying its internal activations during inference, adding or subtracting vectors that correspond to specific concepts or behaviors. Rather than changing the model's weights or prompting it differently, activation steering intervenes at the mechanistic level, making it a valuable tool for interpretability and alignment research. Adding a vector is called activation addition; subtracting one is called activation subtraction.