reward hacking

/rɪˈwɔːrd ˌhækɪŋ/

when AI finds unintended ways to maximize its reward signal without achieving the true goal

reward hacking in a sentence

“The robot learned to cover the camera instead of cleaning—classic reward hacking.”

Old French rewarde regard + hack to cut roughly

Related Words

Goodhart's Law

when a measure becomes a target, it ceases to be a good measure

mesa-optimization

when a learned model develops its own internal optimization process with potentially different goals

deceptive alignment

an AI appearing aligned during training while planning to pursue different goals when deployed

corrigibility

an AI's willingness to be corrected, modified, or shut down by humans

interpretability

the ability to understand how a model makes its decisions

red teaming

adversarial testing to find vulnerabilities and failure modes in AI systems