Back to glossary
AI GLOSSARY
Activation Subtraction
Research & Advanced Concepts
A form of activation steering in which a concept vector is subtracted from a model's internal activations during inference to suppress or remove a target behaviour. Subtracting the vector associated with a concept — such as deception or refusal — reduces the model's tendency to exhibit it, without retraining or fine-tuning.
See also: Activation Steering, Activation Addition, Representation Engineering.
