Lesson 528 of 2116
Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision
A model file's quantization decides how big it is, how fast it runs, and how good it sounds. Learn the formats, the trade-offs, and how to pick the right one.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1What quantization is doing
- 2quantization
- 3GGUF
- 4AWQ
Concept cluster
Terms to connect while reading
Section 1
What quantization is doing
An LLM's weights are originally floating-point numbers — typically FP16 or BF16. Quantization replaces those with lower-precision integers (often 4 or 8 bits per weight). The model gets smaller and faster. Quality drops a little. The question is always: how little, and is the drop worth the savings?
Compare the options
| Precision | Bits / weight | Size for a 7B model | Quality vs FP16 | Speed vs FP16 |
|---|---|---|---|---|
| FP16 / BF16 | 16 | ~14GB | Reference | 1.0x |
| Q8 | ~8 | ~7GB | Effectively identical | ~1.3x |
| Q5 | ~5 | ~5GB | Very close | ~1.6x |
| Q4 | ~4 | ~4GB | Slight degradation | ~2.0x |
| Q3 / Q2 | <4 | ~3GB or less | Noticeable degradation | ~2x+ |
The format zoo: GGUF, AWQ, GPTQ
- GGUF: the format llama.cpp (and therefore Ollama, LM Studio) uses. CPU-and-GPU friendly, the local-model default
- AWQ: Activation-aware Weight Quantization. Common for GPU inference servers like vLLM. Good 4-bit quality
- GPTQ: an older but still common GPU-targeted quantization. Often on Hugging Face for Linux/CUDA workflows
- Native FP16 / BF16: the unquantized weights. Reference quality, large size, GPU only
What the suffixes like Q4_K_M actually mean
GGUF quants come with cryptic suffixes (Q4_K_M, Q5_K_S, Q6_K). The number is the bit width. 'K' means k-quants, a smarter quantization scheme than the original. 'S', 'M', 'L' are size variants — small, medium, large — that trade a tiny bit more space for a bit more quality. Q4_K_M is the most common 'good default' you will see in the wild.
Measuring quality, not vibes
- 1Run llama-perplexity (in llama.cpp) on a representative text sample at FP16 vs your candidate quant
- 2Run a short eval set of your real prompts and compare answers manually
- 3Plot tokens-per-second on your hardware at each quant — the speed gain may surprise you
- 4Pick the highest quant that fits your memory comfortably with your target context size
Apply this
- Pick a 7B and download both Q4_K_M and Q8 versions of the same model
- Compare answers on five representative prompts side by side
- Compare tokens-per-second on your hardware and decide which one wins for your workload
Key terms in this lesson
The big idea: quantization is not a tax — it is a slider. Find the highest setting your hardware allows, and only drop further when you have to.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 9 min
Quantization Tradeoffs (Q4 Vs Q8) For Hermes
Quantization is the dial between model quality and what fits on your hardware. With Hermes, the right setting depends entirely on the task — there is no universal answer.
Creators · 35 min
llama.cpp: The Engine Underneath Almost Everything
Ollama, LM Studio, and most local-model apps are wrappers around llama.cpp. Knowing what it actually does — and how to drop down to it — pays off when defaults are not enough.
Creators · 11 min
AI Model Quantization: 4-bit, 8-bit, FP16 Tradeoffs
How quantization affects quality, speed, and cost for self-hosted Llama, Mistral, and Qwen models.
