
red teaming
/ˈred ˌtiːmɪŋ/
adversarial testing to find vulnerabilities and failure modes in AI systems
red teaming in a sentence
“Red teaming uncovered that the chatbot could be manipulated into giving harmful advice.”
Origin of red teaming
From military exercises where a red team plays the adversary
Related Words
constitutional AI
training AI using a set of principles to self-critique and revise responses
alignment
ensuring AI systems pursue goals that match human values and intentions
value alignment
the challenge of encoding human values into AI systems
reward hacking
when AI finds unintended ways to maximize its reward signal without achieving the true goal
Goodhart's Law
when a measure becomes a target, it ceases to be a good measure
mesa-optimization
when a learned model develops its own internal optimization process with potentially different goals