
Mixture of Experts
/ˈmɪkstʃər əv ˈekspɜːrts/
using multiple specialized sub-models (experts) and routing tokens to them
Mixture of Experts in a sentence
“Mixture of Experts (MoE) scales capacity without increasing inference cost.”
Origin of Mixture of Experts
Machine Learning term (Jacobs et al., 1991)
Related Words
speculative decoding
using a small model to draft tokens for verification by a large model
KV cache
storing attention calculations to speed up generation
context caching
saving the processed state of a prompt prefix to avoid recomputing it
quantization
reducing the precision of model weights (e.g., to 4-bit) to save memory
LoRA
Low-Rank Adaptation; fine-tuning only a small subset of parameters
distillation
training a smaller 'student' model to mimic a larger 'teacher' model