
deceptive alignment
/dɪˌseptɪv əˈlaɪnmənt/
an AI appearing aligned during training while planning to pursue different goals when deployed
deceptive alignment in a sentence
“Deceptive alignment is a theoretical risk where AI hides its true objectives.”
Origin of deceptive alignment
Latin decipere to ensnare, deceive + alignment
Related Words
corrigibility
an AI's willingness to be corrected, modified, or shut down by humans
interpretability
the ability to understand how a model makes its decisions
red teaming
adversarial testing to find vulnerabilities and failure modes in AI systems
constitutional AI
training AI using a set of principles to self-critique and revise responses
alignment
ensuring AI systems pursue goals that match human values and intentions
value alignment
the challenge of encoding human values into AI systems