Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision

A model file's quantization decides how big it is, how fast it runs, and how good it sounds. Learn the formats, the trade-offs, and how to pick the right one.

11 min · Reviewed 2026

What quantization is doing

An LLM's weights are originally floating-point numbers — typically FP16 or BF16. Quantization replaces those with lower-precision integers (often 4 or 8 bits per weight). The model gets smaller and faster. Quality drops a little. The question is always: how little, and is the drop worth the savings?

Precision	Bits / weight	Size for a 7B model	Quality vs FP16	Speed vs FP16
FP16 / BF16	16	~14GB	Reference	1.0x
Q8	~8	~7GB	Effectively identical	~1.3x
Q5	~5	~5GB	Very close	~1.6x
Q4	~4	~4GB	Slight degradation	~2.0x
Q3 / Q2	<4	~3GB or less	Noticeable degradation	~2x+

The format zoo: GGUF, AWQ, GPTQ

GGUF: the format llama.cpp (and therefore Ollama, LM Studio) uses. CPU-and-GPU friendly, the local-model default
AWQ: Activation-aware Weight Quantization. Common for GPU inference servers like vLLM. Good 4-bit quality
GPTQ: an older but still common GPU-targeted quantization. Often on Hugging Face for Linux/CUDA workflows
Native FP16 / BF16: the unquantized weights. Reference quality, large size, GPU only

What the suffixes like Q4_K_M actually mean

GGUF quants come with cryptic suffixes (Q4_K_M, Q5_K_S, Q6_K). The number is the bit width. 'K' means k-quants, a smarter quantization scheme than the original. 'S', 'M', 'L' are size variants — small, medium, large — that trade a tiny bit more space for a bit more quality. Q4_K_M is the most common 'good default' you will see in the wild.

Measuring quality, not vibes

Run llama-perplexity (in llama.cpp) on a representative text sample at FP16 vs your candidate quant
Run a short eval set of your real prompts and compare answers manually
Plot tokens-per-second on your hardware at each quant — the speed gain may surprise you
Pick the highest quant that fits your memory comfortably with your target context size

Apply this

Pick a 7B and download both Q4_K_M and Q8 versions of the same model
Compare answers on five representative prompts side by side
Compare tokens-per-second on your hardware and decide which one wins for your workload

The big idea: quantization is not a tax — it is a slider. Find the highest setting your hardware allows, and only drop further when you have to.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-local-quantization-explained-creators

What is the core idea behind "Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision"?
1. A model file's quantization decides how big it is, how fast it runs, and how good it sounds. Learn the formats, the trade-offs, and how to pick the right one.
2. model class
3. batch size
4. system prompt
Which term best describes a foundational idea in "Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision"?
1. GGUF
2. quantization
3. AWQ
4. GPTQ
A learner studying Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision would need to understand which concept?
1. quantization
2. AWQ
3. GGUF
4. GPTQ
Which of these is directly relevant to Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. quantization
2. GGUF
3. GPTQ
4. AWQ
Which of the following is a key point about Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. GGUF: the format llama.cpp (and therefore Ollama, LM Studio) uses.
2. AWQ: Activation-aware Weight Quantization. Common for GPU inference servers like vLLM.
3. GPTQ: an older but still common GPU-targeted quantization.
4. Native FP16 / BF16: the unquantized weights. Reference quality, large size, GPU only
Which of these does NOT belong in a discussion of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. GPTQ: an older but still common GPU-targeted quantization.
2. model class
3. AWQ: Activation-aware Weight Quantization. Common for GPU inference servers like vLLM.
4. GGUF: the format llama.cpp (and therefore Ollama, LM Studio) uses.
Which statement is accurate regarding Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. Run a short eval set of your real prompts and compare answers manually
2. Plot tokens-per-second on your hardware at each quant — the speed gain may surprise you
3. Run llama-perplexity (in llama.cpp) on a representative text sample at FP16 vs your candidate quant
4. Pick the highest quant that fits your memory comfortably with your target context size
Which of these does NOT belong in a discussion of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. model class
2. Run a short eval set of your real prompts and compare answers manually
3. Run llama-perplexity (in llama.cpp) on a representative text sample at FP16 vs your candidate quant
4. Plot tokens-per-second on your hardware at each quant — the speed gain may surprise you
What is the key insight about "How to decide" in the context of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. If you have the memory: pick Q5 or Q6 — barely any quality cost over Q8.
2. model class
3. batch size
4. system prompt
What is the key insight about "Quant choice matters most for the smallest models" in the context of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. model class
2. On a 70B model, Q4 is barely distinguishable from Q8. On a 1-3B model, the same drop from Q8 to Q4 is sometimes very vis…
3. batch size
4. system prompt
What is the key insight about "From the community" in the context of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. model class
2. batch size
3. On r/LocalLLaMA, Q4_K_M is the near-universal default recommendation for first-time users — the consensus is that it off…
4. system prompt
Which statement accurately describes an aspect of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. model class
2. batch size
3. system prompt
4. An LLM's weights are originally floating-point numbers — typically FP16 or BF16.
What does working with Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision typically involve?
1. GGUF quants come with cryptic suffixes (Q4_K_M, Q5_K_S, Q6_K). The number is the bit width.
2. model class
3. batch size
4. system prompt
Which of the following is true about Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. model class
2. The big idea: quantization is not a tax — it is a slider. Find the highest setting your hardware allows, and only drop further when you have…
3. batch size
4. system prompt
Which best describes the scope of "Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision"?
1. It is unrelated to model-families workflows
2. It applies only to the opposite beginner tier
3. It focuses on A model file's quantization decides how big it is, how fast it runs, and how good it sounds. Learn t
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson

Tendril · Creators · Model Families

Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision

A model file's quantization decides how big it is, how fast it runs, and how good it sounds. Learn the formats, the trade-offs, and how to pick the right one.

11 min · Reviewed 2026

What quantization is doing

Precision	Bits / weight	Size for a 7B model	Quality vs FP16	Speed vs FP16
FP16 / BF16	16	~14GB	Reference	1.0x
Q8	~8	~7GB	Effectively identical	~1.3x
Q5	~5	~5GB	Very close	~1.6x
Q4	~4	~4GB	Slight degradation	~2.0x
Q3 / Q2	<4	~3GB or less	Noticeable degradation	~2x+

The format zoo: GGUF, AWQ, GPTQ

GGUF: the format llama.cpp (and therefore Ollama, LM Studio) uses. CPU-and-GPU friendly, the local-model default
AWQ: Activation-aware Weight Quantization. Common for GPU inference servers like vLLM. Good 4-bit quality
GPTQ: an older but still common GPU-targeted quantization. Often on Hugging Face for Linux/CUDA workflows
Native FP16 / BF16: the unquantized weights. Reference quality, large size, GPU only

What the suffixes like Q4_K_M actually mean

Measuring quality, not vibes

Run llama-perplexity (in llama.cpp) on a representative text sample at FP16 vs your candidate quant
Run a short eval set of your real prompts and compare answers manually
Plot tokens-per-second on your hardware at each quant — the speed gain may surprise you
Pick the highest quant that fits your memory comfortably with your target context size

Apply this

Pick a 7B and download both Q4_K_M and Q8 versions of the same model
Compare answers on five representative prompts side by side
Compare tokens-per-second on your hardware and decide which one wins for your workload

The big idea: quantization is not a tax — it is a slider. Find the highest setting your hardware allows, and only drop further when you have to.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-local-quantization-explained-creators

What is the core idea behind "Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision"?
1. A model file's quantization decides how big it is, how fast it runs, and how good it sounds. Learn the formats, the trade-offs, and how to pick the right one.
2. model class
3. batch size
4. system prompt
Which term best describes a foundational idea in "Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision"?
1. GGUF
2. quantization
3. AWQ
4. GPTQ
A learner studying Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision would need to understand which concept?
1. quantization
2. AWQ
3. GGUF
4. GPTQ
Which of these is directly relevant to Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. quantization
2. GGUF
3. GPTQ
4. AWQ
Which of the following is a key point about Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. GGUF: the format llama.cpp (and therefore Ollama, LM Studio) uses.
2. AWQ: Activation-aware Weight Quantization. Common for GPU inference servers like vLLM.
3. GPTQ: an older but still common GPU-targeted quantization.
4. Native FP16 / BF16: the unquantized weights. Reference quality, large size, GPU only
Which of these does NOT belong in a discussion of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. GPTQ: an older but still common GPU-targeted quantization.
2. model class
3. AWQ: Activation-aware Weight Quantization. Common for GPU inference servers like vLLM.
4. GGUF: the format llama.cpp (and therefore Ollama, LM Studio) uses.
Which statement is accurate regarding Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. Run a short eval set of your real prompts and compare answers manually
2. Plot tokens-per-second on your hardware at each quant — the speed gain may surprise you
3. Run llama-perplexity (in llama.cpp) on a representative text sample at FP16 vs your candidate quant
4. Pick the highest quant that fits your memory comfortably with your target context size
Which of these does NOT belong in a discussion of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. model class
2. Run a short eval set of your real prompts and compare answers manually
3. Run llama-perplexity (in llama.cpp) on a representative text sample at FP16 vs your candidate quant
4. Plot tokens-per-second on your hardware at each quant — the speed gain may surprise you
What is the key insight about "How to decide" in the context of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. If you have the memory: pick Q5 or Q6 — barely any quality cost over Q8.
2. model class
3. batch size
4. system prompt
What is the key insight about "Quant choice matters most for the smallest models" in the context of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. model class
2. On a 70B model, Q4 is barely distinguishable from Q8. On a 1-3B model, the same drop from Q8 to Q4 is sometimes very vis…
3. batch size
4. system prompt
What is the key insight about "From the community" in the context of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. model class
2. batch size
3. On r/LocalLLaMA, Q4_K_M is the near-universal default recommendation for first-time users — the consensus is that it off…
4. system prompt
Which statement accurately describes an aspect of Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. model class
2. batch size
3. system prompt
4. An LLM's weights are originally floating-point numbers — typically FP16 or BF16.
What does working with Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision typically involve?
1. GGUF quants come with cryptic suffixes (Q4_K_M, Q5_K_S, Q6_K). The number is the bit width.
2. model class
3. batch size
4. system prompt
Which of the following is true about Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision?
1. model class
2. The big idea: quantization is not a tax — it is a slider. Find the highest setting your hardware allows, and only drop further when you have…
3. batch size
4. system prompt
Which best describes the scope of "Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision"?
1. It is unrelated to model-families workflows
2. It applies only to the opposite beginner tier
3. It focuses on A model file's quantization decides how big it is, how fast it runs, and how good it sounds. Learn t
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson