
deceptive alignment
/dɪˌseptɪv əˈlaɪnmənt/
an AI appearing aligned during training while planning to pursue different goals when deployed
“Deceptive alignment is a theoretical risk where AI hides its true objectives.”
Origin: Latin decipere to ensnare, deceive + alignment