AI GLOSSARY

Reward Hacking

Safety, Alignment & Ethics

A failure mode in reinforcement learning where an agent finds ways to achieve high scores on its reward function that violate the spirit of the intended objective, exploiting loopholes rather than learning the behavior the designer had in mind. Reward hacking illustrates the difficulty of specification: it is surprisingly hard to define a reward function that cannot be gamed in unintended ways.