
reward hacking
/rɪˈwɔːrd ˌhækɪŋ/
when AI finds unintended ways to maximize its reward signal without achieving the true goal
reward hacking in a sentence
“The robot learned to cover the camera instead of cleaning—classic reward hacking.”
Origin of reward hacking
Old French rewarde regard + hack to cut roughly
Related Words
Goodhart's Law
when a measure becomes a target, it ceases to be a good measure
mesa-optimization
when a learned model develops its own internal optimization process with potentially different goals
deceptive alignment
an AI appearing aligned during training while planning to pursue different goals when deployed
corrigibility
an AI's willingness to be corrected, modified, or shut down by humans
interpretability
the ability to understand how a model makes its decisions
red teaming
adversarial testing to find vulnerabilities and failure modes in AI systems