Loading collection...
Loading collection...
Measuring and monitoring AI performance

systematic tests to measure model performance on specific tasks
“Running evals after every prompt change ensures no regressions.”

using a strong LLM to evaluate the outputs of another model
“LLM-as-a-Judge scales evaluation better than human review.”

the actual absolute truth or correct answer used for comparison
“Comparing model output against ground truth reveals accuracy.”

recording the flow of execution and data through a complex system
“Tracing showed exactly which retrieval step failed.”

the frequency with which a model generates incorrect information
“The goal is to minimize the hallucination rate.”

a standardized test used to compare performance
“MMLU is a popular benchmark for general knowledge.”
Explore other vocabulary categories in this collection.