
interpretability
/ɪnˌtɜːrprɪtəˈbɪlɪti/
the ability to understand how a model makes its decisions
interpretability in a sentence
“Interpretability tools revealed which words the model focused on for its prediction.”
Origin of interpretability
Latin interpretari to explain + -ability
Related Words
red teaming
adversarial testing to find vulnerabilities and failure modes in AI systems
constitutional AI
training AI using a set of principles to self-critique and revise responses
alignment
ensuring AI systems pursue goals that match human values and intentions
value alignment
the challenge of encoding human values into AI systems
reward hacking
when AI finds unintended ways to maximize its reward signal without achieving the true goal
Goodhart's Law
when a measure becomes a target, it ceases to be a good measure