benchmark

/ˈbentʃmɑːrk/

a standardized test used to compare performance

benchmark in a sentence

“MMLU is a popular benchmark for general knowledge.”

Surveying term; a surveyor's mark on a stone

Related Words

evals

systematic tests to measure model performance on specific tasks

LLM-as-a-Judge

using a strong LLM to evaluate the outputs of another model

ground truth

the actual absolute truth or correct answer used for comparison

tracing

recording the flow of execution and data through a complex system

hallucination rate

the frequency with which a model generates incorrect information