quantization
/ˌkwɒntɪˈzeɪʃən/reducing the precision of model weights (e.g., to 4-bit) to save memory
“4-bit quantization allows running a 70B model on a single GPU.”
Origin: Latin quantus `how much` + -ization
Loading collection...
Techniques for making models faster and more efficient
reducing the precision of model weights (e.g., to 4-bit) to save memory
“4-bit quantization allows running a 70B model on a single GPU.”
Origin: Latin quantus `how much` + -ization
Low-Rank Adaptation; fine-tuning only a small subset of parameters
“LoRA makes fine-tuning large models computationally affordable.”
Origin: Acronym: Low-Rank Adaptation (Hu et al., 2021)
training a smaller 'student' model to mimic a larger 'teacher' model
“Distillation produced a small model with near-GPT-4 performance on specific tasks.”
Origin: Latin distillare `to drip down`
using multiple specialized sub-models (experts) and routing tokens to them
“Mixture of Experts (MoE) scales capacity without increasing inference cost.”
Origin: Machine Learning term (Jacobs et al., 1991)
using a small model to draft tokens for verification by a large model
“Speculative decoding doubled the inference speed without losing quality.”
Origin: Latin speculari `to spy out` + decoding
storing attention calculations to speed up generation
“Optimizing the KV cache usage reduced memory footprint significantly.”
Origin: Key-Value + French cacher `to hide`
saving the processed state of a prompt prefix to avoid recomputing it
“Context caching is ideal for chatting with long documents.”
Origin: Latin contextus + caching
Explore other vocabulary categories in this collection.