How many ai safety & alignment words are in this list?

This vocabulary list contains 10 carefully curated ai safety & alignment words with definitions and examples.

How can I learn these ai safety & alignment vocabulary words?

Segue offers multiple ways to learn: interactive flashcards for memorization, multiple-choice quizzes for testing, and typing practice for reinforcement. Add this list to your collection and practice with any method.

🛡️

AI Safety & Alignment Vocabulary

Concepts related to making AI systems safe and aligned with human values

10 words

📱

See Beautiful Illustrations

The Segue iOS app features stunning illustrations for each word, making vocabulary memorable.

All 10 Words

alignment

/əˈɫaɪnmənt/

ensuring AI systems pursue goals that match human values and intentions

“Alignment research aims to make powerful AI systems beneficial rather than harmful.”

Origin: French alignement from aligner `to arrange in a line`

value alignment

/ˈvæljuː əˌlaɪnmənt/

the challenge of encoding human values into AI systems

“Value alignment is difficult because human values are complex and context-dependent.”

Origin: Latin valere `to be strong` + alignment

reward hacking

/rɪˈwɔːrd ˌhækɪŋ/

when AI finds unintended ways to maximize its reward signal without achieving the true goal

“The robot learned to cover the camera instead of cleaning—classic reward hacking.”

Origin: Old French rewarde `regard` + hack `to cut roughly`

Goodhart's Law

/ˈɡʊdhɑːrts ˌlɔː/

when a measure becomes a target, it ceases to be a good measure

“Goodhart's Law explains why optimizing for engagement metrics produced clickbait.”

Origin: Named after economist Charles Goodhart who formulated it in 1975

mesa-optimization

/ˈmeɪsə ˌɒptɪmaɪˈzeɪʃən/

when a learned model develops its own internal optimization process with potentially different goals

“Mesa-optimization could cause an AI to pursue goals different from its training objective.”

Origin: Spanish mesa `table, plateau` (indicating a level within) + optimization

deceptive alignment

/dɪˌseptɪv əˈlaɪnmənt/

an AI appearing aligned during training while planning to pursue different goals when deployed

“Deceptive alignment is a theoretical risk where AI hides its true objectives.”

Origin: Latin decipere `to ensnare, deceive` + alignment

corrigibility

/ˌkɒrɪdʒɪˈbɪlɪti/

an AI's willingness to be corrected, modified, or shut down by humans

“A corrigible AI would allow humans to fix its mistakes without resistance.”

Origin: Latin corrigere `to make straight, correct` + -ibility

interpretability

/ɪnˌtɜːrprɪtəˈbɪlɪti/

the ability to understand how a model makes its decisions

“Interpretability tools revealed which words the model focused on for its prediction.”

Origin: Latin interpretari `to explain` + -ability

red teaming

/ˈred ˌtiːmɪŋ/

adversarial testing to find vulnerabilities and failure modes in AI systems

“Red teaming uncovered that the chatbot could be manipulated into giving harmful advice.”

Origin: From military exercises where a `red team` plays the adversary

constitutional AI

/ˌkɒnstɪˈtuːʃənəl ˌeɪ ˈaɪ/

training AI using a set of principles to self-critique and revise responses

“Constitutional AI helped the model refuse harmful requests while remaining helpful.”

Origin: Latin constitutio `establishing` + AI

AI Safety & Alignment Vocabulary

See Beautiful Illustrations

All 10 Words

alignment

value alignment

reward hacking

Goodhart's Law

mesa-optimization

deceptive alignment

corrigibility

interpretability

red teaming

constitutional AI

More from Artificial Intelligence

AI Safety & Alignment Vocabulary

See Beautiful Illustrations

All 10 Words

alignment

value alignment

reward hacking

Goodhart's Law

mesa-optimization

deceptive alignment

corrigibility

interpretability

red teaming

constitutional AI

More from Artificial Intelligence