
speculative decoding
/ˈspekjələtɪv diˈkoʊdɪŋ/
using a small model to draft tokens for verification by a large model
speculative decoding in a sentence
“Speculative decoding doubled the inference speed without losing quality.”
Origin of speculative decoding
Latin speculari to spy out + decoding
Related Words
KV cache
storing attention calculations to speed up generation
context caching
saving the processed state of a prompt prefix to avoid recomputing it
quantization
reducing the precision of model weights (e.g., to 4-bit) to save memory
LoRA
Low-Rank Adaptation; fine-tuning only a small subset of parameters
distillation
training a smaller 'student' model to mimic a larger 'teacher' model
Mixture of Experts
using multiple specialized sub-models (experts) and routing tokens to them