
mesa-optimization
/ˈmeɪsə ˌɒptɪmaɪˈzeɪʃən/
when a learned model develops its own internal optimization process with potentially different goals
mesa-optimization in a sentence
“Mesa-optimization could cause an AI to pursue goals different from its training objective.”
Origin of mesa-optimization
Spanish mesa table, plateau (indicating a level within) + optimization
Related Words
deceptive alignment
an AI appearing aligned during training while planning to pursue different goals when deployed
corrigibility
an AI's willingness to be corrected, modified, or shut down by humans
interpretability
the ability to understand how a model makes its decisions
red teaming
adversarial testing to find vulnerabilities and failure modes in AI systems
constitutional AI
training AI using a set of principles to self-critique and revise responses
alignment
ensuring AI systems pursue goals that match human values and intentions