
corrigibility
/ˌkɒrɪdʒɪˈbɪlɪti/
an AI's willingness to be corrected, modified, or shut down by humans
corrigibility in a sentence
“A corrigible AI would allow humans to fix its mistakes without resistance.”
Origin of corrigibility
Latin corrigere to make straight, correct + -ibility
Related Words
interpretability
the ability to understand how a model makes its decisions
red teaming
adversarial testing to find vulnerabilities and failure modes in AI systems
constitutional AI
training AI using a set of principles to self-critique and revise responses
alignment
ensuring AI systems pursue goals that match human values and intentions
value alignment
the challenge of encoding human values into AI systems
reward hacking
when AI finds unintended ways to maximize its reward signal without achieving the true goal