evals
/ɪˈvælz/systematic tests to measure model performance on specific tasks
“Running evals after every prompt change ensures no regressions.”
Origin: Short for evaluations; French évaluer `find the value`
Loading collection...
Measuring and monitoring AI performance
systematic tests to measure model performance on specific tasks
“Running evals after every prompt change ensures no regressions.”
Origin: Short for evaluations; French évaluer `find the value`
using a strong LLM to evaluate the outputs of another model
“LLM-as-a-Judge scales evaluation better than human review.”
Origin: Industry term (Zheng et al., 2023)
the actual absolute truth or correct answer used for comparison
“Comparing model output against ground truth reveals accuracy.”
Origin: Originally from cartography/remote sensing
recording the flow of execution and data through a complex system
“Tracing showed exactly which retrieval step failed.”
Origin: Old French tracier `to look for`
the frequency with which a model generates incorrect information
“The goal is to minimize the hallucination rate.”
Origin: Latin alucinari + rate
a standardized test used to compare performance
“MMLU is a popular benchmark for general knowledge.”
Origin: Surveying term; a surveyor's mark on a stone
Explore other vocabulary categories in this collection.