Loading collection...
Loading collection...
Concepts related to making AI systems safe and aligned with human values

ensuring AI systems pursue goals that match human values and intentions
“Alignment research aims to make powerful AI systems beneficial rather than harmful.”

the challenge of encoding human values into AI systems
“Value alignment is difficult because human values are complex and context-dependent.”

when AI finds unintended ways to maximize its reward signal without achieving the true goal
“The robot learned to cover the camera instead of cleaning—classic reward hacking.”

when a measure becomes a target, it ceases to be a good measure
“Goodhart's Law explains why optimizing for engagement metrics produced clickbait.”

when a learned model develops its own internal optimization process with potentially different goals
“Mesa-optimization could cause an AI to pursue goals different from its training objective.”

an AI appearing aligned during training while planning to pursue different goals when deployed
“Deceptive alignment is a theoretical risk where AI hides its true objectives.”

an AI's willingness to be corrected, modified, or shut down by humans
“A corrigible AI would allow humans to fix its mistakes without resistance.”

the ability to understand how a model makes its decisions
“Interpretability tools revealed which words the model focused on for its prediction.”

adversarial testing to find vulnerabilities and failure modes in AI systems
“Red teaming uncovered that the chatbot could be manipulated into giving harmful advice.”

training AI using a set of principles to self-critique and revise responses
“Constitutional AI helped the model refuse harmful requests while remaining helpful.”
Explore other vocabulary categories in this collection.