Loading collection...
Loading collection...
How language models generate responses at runtime

the process of using a trained model to generate predictions or outputs
“Inference latency determines how quickly the chatbot can respond.”

a parameter controlling randomness in generation—higher means more creative, lower means more deterministic
“Setting temperature to 0.7 balances creativity with coherence.”

randomly selecting the next token from the probability distribution rather than always choosing the most likely
“Top-p sampling only considers tokens whose cumulative probability exceeds a threshold.”

a search algorithm that explores multiple candidate sequences simultaneously
“Beam search with width 5 tracks the five most promising response paths.”

always selecting the highest probability token at each step
“Greedy decoding is fast but may miss better overall sequences.”

sampling only from the k most likely next tokens
“Top-k sampling with k=50 prevents rare, nonsensical tokens from being selected.”

sampling from tokens comprising the top cumulative probability mass (top-p)
“Nucleus sampling with p=0.95 adapts vocabulary size to context certainty.”

raw, unnormalized scores output by the model before conversion to probabilities
“Logits are converted to probabilities using the softmax function.”

a function that converts logits into a probability distribution summing to one
“Softmax exponentiates each logit and normalizes so all probabilities sum to 1.”

cached key-value pairs from previous tokens to speed up autoregressive generation
“The KV cache avoids recomputing attention for earlier tokens at each step.”
Explore other vocabulary categories in this collection.