Loading lesson…
A model file's quantization decides how big it is, how fast it runs, and how good it sounds. Learn the formats, the trade-offs, and how to pick the right one.
An LLM's weights are originally floating-point numbers — typically FP16 or BF16. Quantization replaces those with lower-precision integers (often 4 or 8 bits per weight). The model gets smaller and faster. Quality drops a little. The question is always: how little, and is the drop worth the savings?
| Precision | Bits / weight | Size for a 7B model | Quality vs FP16 | Speed vs FP16 |
|---|---|---|---|---|
| FP16 / BF16 | 16 | ~14GB | Reference | 1.0x |
| Q8 | ~8 | ~7GB | Effectively identical | ~1.3x |
| Q5 | ~5 | ~5GB | Very close | ~1.6x |
| Q4 | ~4 | ~4GB | Slight degradation | ~2.0x |
| Q3 / Q2 | <4 | ~3GB or less | Noticeable degradation | ~2x+ |
GGUF quants come with cryptic suffixes (Q4_K_M, Q5_K_S, Q6_K). The number is the bit width. 'K' means k-quants, a smarter quantization scheme than the original. 'S', 'M', 'L' are size variants — small, medium, large — that trade a tiny bit more space for a bit more quality. Q4_K_M is the most common 'good default' you will see in the wild.
The big idea: quantization is not a tax — it is a slider. Find the highest setting your hardware allows, and only drop further when you have to.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-local-quantization-explained-creators
What is the main idea of "Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision"?
Which concept is most central to "Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision"?
Which use of AI fits this topic best?
What should a careful learner remember about "How to decide"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about quantization be treated?
Name one way to verify an AI answer about quantization.
Which action would help you apply "Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision" responsibly?