red teaming

/ˈred ˌtiːmɪŋ/

adversarial testing to find vulnerabilities and failure modes in AI systems

red teaming in a sentence

“Red teaming uncovered that the chatbot could be manipulated into giving harmful advice.”

From military exercises where a red team plays the adversary

Related Words

constitutional AI

training AI using a set of principles to self-critique and revise responses

alignment

ensuring AI systems pursue goals that match human values and intentions

value alignment

the challenge of encoding human values into AI systems

reward hacking

when AI finds unintended ways to maximize its reward signal without achieving the true goal

Goodhart's Law

when a measure becomes a target, it ceases to be a good measure

mesa-optimization

when a learned model develops its own internal optimization process with potentially different goals