Loading collection...
Loading collection...
Techniques for making models faster and more efficient

reducing the precision of model weights (e.g., to 4-bit) to save memory
“4-bit quantization allows running a 70B model on a single GPU.”

Low-Rank Adaptation; fine-tuning only a small subset of parameters
“LoRA makes fine-tuning large models computationally affordable.”

training a smaller 'student' model to mimic a larger 'teacher' model
“Distillation produced a small model with near-GPT-4 performance on specific tasks.”

using multiple specialized sub-models (experts) and routing tokens to them
“Mixture of Experts (MoE) scales capacity without increasing inference cost.”

using a small model to draft tokens for verification by a large model
“Speculative decoding doubled the inference speed without losing quality.”

storing attention calculations to speed up generation
“Optimizing the KV cache usage reduced memory footprint significantly.”

saving the processed state of a prompt prefix to avoid recomputing it
“Context caching is ideal for chatting with long documents.”
Explore other vocabulary categories in this collection.