How many evaluation & observability words are in this list?

This vocabulary list contains 15 carefully curated evaluation & observability words with definitions and examples.

How can I learn these evaluation & observability vocabulary words?

Segue offers multiple ways to learn: interactive flashcards for memorization, multiple-choice quizzes for testing, and typing practice for reinforcement. Add this list to your collection and practice with any method.

📏

Evaluation & Observability Vocabulary

Measuring and monitoring AI performance

15 words

All 15 Words

evals

/ɪˈvælz/

systematic tests to measure model performance on specific tasks

“Running evals after every prompt change ensures no regressions.”

LLM-as-a-Judge

/ˌeɫ eɫ ˈem æz ə ˈdʒʌdʒ/

using a strong LLM to evaluate the outputs of another model

“LLM-as-a-Judge scales evaluation better than human review.”

ground truth

/ˈɡraʊnd ˌtruːθ/

a trusted reference label or answer used for evaluation, which may itself contain uncertainty or annotation error

“The team compared model output with human-reviewed reference answers and audited disputed labels.”

tracing

/ˈtreɪsɪŋ/

recording the flow of execution and data through a complex system

“Tracing showed exactly which retrieval step failed.”

hallucination rate

/həˌluːsɪˈneɪʃən ˌreɪt/

the frequency with which a model generates incorrect information

“The goal is to minimize the hallucination rate.”

benchmark

/ˈbentʃmɑːrk/

a standardized test used to compare performance

“MMLU is a popular benchmark for general knowledge.”

golden dataset

/ˈɡoʊldən ˈdeɪtəset/

a hand-verified set of examples used as the standard for judging model output

“Every release is scored against the golden dataset before it ships.”

rubric

/ˈruːbrɪk/

an explicit scoring guide that turns judgment into repeatable criteria

“The judge model grades each answer against a five-point rubric.”

pass@k

/ˌpɑːs ət ˈkeɪ/

the probability that at least one of k sampled attempts is correct

“The coding model hit 90% pass@10 but only 55% pass@1.”

drift

/drɪft/

gradual change in inputs or behavior that quietly erodes performance

“Monitoring caught the drift when user queries shifted to a new product line.”

error analysis

/ˈerər əˌnæləsɪs/

systematically reading failures to find the patterns worth fixing

“Error analysis showed half the misses came from date formats.”

perplexity

/pərˈpleksəti/

a measure of how surprised a model is by text; lower means better prediction

“Perplexity fell steadily as pre-training progressed.”

benchmark saturation

/ˈbentʃmɑːrk ˌsætʃəˈreɪʃən/

when models max out a test so it can no longer tell them apart

“Benchmark saturation forced the lab to commission a harder eval.”

observability

/əbˌzɜːrvəˈbɪləti/

the degree to which a system's inner behavior can be inferred from what it emits

“Tracing gave the pipeline enough observability to find the slow retrieval step.”

verbosity bias

/vərˈbɒsəti ˈbaɪəs/

a judge's tendency to score longer answers higher regardless of quality

“Controlling for verbosity bias flipped the leaderboard.”

More from Artificial Intelligence

Explore other vocabulary categories in this collection.

Loading collection...