interpretability

/ɪnˌtɜːrprɪtəˈbɪlɪti/

the ability to understand how a model makes its decisions

interpretability in a sentence

“Interpretability tools revealed which words the model focused on for its prediction.”

Latin interpretari to explain + -ability

Related Words

red teaming

adversarial testing to find vulnerabilities and failure modes in AI systems

constitutional AI

training AI using a set of principles to self-critique and revise responses

alignment

ensuring AI systems pursue goals that match human values and intentions

value alignment

the challenge of encoding human values into AI systems

reward hacking

when AI finds unintended ways to maximize its reward signal without achieving the true goal

Goodhart's Law

when a measure becomes a target, it ceases to be a good measure