Segue
Segue
Today
iOS
AI Safety & Alignment·Artificial Intelligence
reward hacking

reward hacking

/rɪˈwɔːrd ˌhækɪŋ/

🛡️ AI Safety & Alignment

when AI finds unintended ways to maximize its reward signal without achieving the true goal

reward hacking in a sentence

“The robot learned to cover the camera instead of cleaning—classic reward hacking.”

Origin of reward hacking

Old French rewarde regard + hack to cut roughly

Related Words

Goodhart's Law

when a measure becomes a target, it ceases to be a good measure

mesa-optimization

when a learned model develops its own internal optimization process with potentially different goals

deceptive alignment

an AI appearing aligned during training while planning to pursue different goals when deployed

corrigibility

an AI's willingness to be corrected, modified, or shut down by humans

interpretability

the ability to understand how a model makes its decisions

red teaming

adversarial testing to find vulnerabilities and failure modes in AI systems

SegueMaster the art of eloquence
iOS AppWord of the DayBlogContactPrivacyTerms