Back to glossary

AI GLOSSARY

Activation Steering

Research & Advanced Concepts

A technique for influencing a neural network's behavior by directly modifying its internal activations during inference, adding or subtracting vectors that correspond to specific concepts or behaviors. Rather than changing the model's weights or prompting it differently, activation steering intervenes at the mechanistic level, making it a valuable tool for interpretability and alignment research. Adding a vector is called activation addition; subtracting one is called activation subtraction.

External reference