reward hacking

reward hacking

/rɪˈwɔːrd ˌhækɪŋ/

🛡️ AI Safety & Alignment

when AI finds unintended ways to maximize its reward signal without achieving the true goal

The robot learned to cover the camera instead of cleaning—classic reward hacking.

Origin: Old French rewarde regard + hack to cut roughly