AI GLOSSARY

Activation Subtraction

Research & Advanced Concepts

A form of activation steering in which a concept vector is subtracted from a model's internal activations during inference to suppress or remove a target behaviour. Subtracting the vector associated with a concept (such as deception or refusal) reduces the model's tendency to exhibit it, without retraining or fine-tuning.

See also: Activation Steering, Activation Addition, Representation Engineering.