alignment
ensuring AI systems pursue goals that match human values and intentions
“Alignment research aims to make powerful AI systems beneficial rather than harmful.”
Origin: French alignement from aligner `to arrange in a line`
Loading collection...
Concepts related to making AI systems safe and aligned with human values
ensuring AI systems pursue goals that match human values and intentions
“Alignment research aims to make powerful AI systems beneficial rather than harmful.”
Origin: French alignement from aligner `to arrange in a line`
the challenge of encoding human values into AI systems
“Value alignment is difficult because human values are complex and context-dependent.”
Origin: Latin valere `to be strong` + alignment
when AI finds unintended ways to maximize its reward signal without achieving the true goal
“The robot learned to cover the camera instead of cleaning—classic reward hacking.”
Origin: Old French rewarde `regard` + hack `to cut roughly`
when a measure becomes a target, it ceases to be a good measure
“Goodhart's Law explains why optimizing for engagement metrics produced clickbait.”
Origin: Named after economist Charles Goodhart who formulated it in 1975
when a learned model develops its own internal optimization process with potentially different goals
“Mesa-optimization could cause an AI to pursue goals different from its training objective.”
Origin: Spanish mesa `table, plateau` (indicating a level within) + optimization
an AI appearing aligned during training while planning to pursue different goals when deployed
“Deceptive alignment is a theoretical risk where AI hides its true objectives.”
Origin: Latin decipere `to ensnare, deceive` + alignment
an AI's willingness to be corrected, modified, or shut down by humans
“A corrigible AI would allow humans to fix its mistakes without resistance.”
Origin: Latin corrigere `to make straight, correct` + -ibility
the ability to understand how a model makes its decisions
“Interpretability tools revealed which words the model focused on for its prediction.”
Origin: Latin interpretari `to explain` + -ability
adversarial testing to find vulnerabilities and failure modes in AI systems
“Red teaming uncovered that the chatbot could be manipulated into giving harmful advice.”
Origin: From military exercises where a `red team` plays the adversary
training AI using a set of principles to self-critique and revise responses
“Constitutional AI helped the model refuse harmful requests while remaining helpful.”
Origin: Latin constitutio `establishing` + AI
Explore other vocabulary categories in this collection.