neural-forge.io

Sign inStartStart learning

Tendril

Model Families0%

Lesson 528 of 2116

Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision

A model file's quantization decides how big it is, how fast it runs, and how good it sounds. Learn the formats, the trade-offs, and how to pick the right one.

CreatorsModel Families~7 min readBI1 · PerceptionBI2 · Representation & ReasoningBI4 · Natural InteractionPrint / PDF

Lesson map

What this lesson covers

11 min18 blocks5 concepts

Learning path

The main moves in order

1What quantization is doing
2quantization
3GGUF
4AWQ

Concept cluster

Terms to connect while reading

quantizationGGUFAWQGPTQperplexity

Read3

Sections5

Lists3

Notes5

Compare1

Terms1

Section 1

What quantization is doing

An LLM's weights are originally floating-point numbers — typically FP16 or BF16. Quantization replaces those with lower-precision integers (often 4 or 8 bits per weight). The model gets smaller and faster. Quality drops a little. The question is always: how little, and is the drop worth the savings?

Compare the options

Precision	Bits / weight	Size for a 7B model	Quality vs FP16	Speed vs FP16
FP16 / BF16	16	~14GB	Reference	1.0x
Q8	~8	~7GB	Effectively identical	~1.3x
Q5	~5	~5GB	Very close	~1.6x
Q4	~4	~4GB	Slight degradation	~2.0x
Q3 / Q2	<4	~3GB or less	Noticeable degradation	~2x+

The format zoo: GGUF, AWQ, GPTQ

GGUF: the format llama.cpp (and therefore Ollama, LM Studio) uses. CPU-and-GPU friendly, the local-model default
AWQ: Activation-aware Weight Quantization. Common for GPU inference servers like vLLM. Good 4-bit quality
GPTQ: an older but still common GPU-targeted quantization. Often on Hugging Face for Linux/CUDA workflows
Native FP16 / BF16: the unquantized weights. Reference quality, large size, GPU only

Check-in 1. Got it so far?

What the suffixes like Q4_K_M actually mean

GGUF quants come with cryptic suffixes (Q4_K_M, Q5_K_S, Q6_K). The number is the bit width. 'K' means k-quants, a smarter quantization scheme than the original. 'S', 'M', 'L' are size variants — small, medium, large — that trade a tiny bit more space for a bit more quality. Q4_K_M is the most common 'good default' you will see in the wild.

Measuring quality, not vibes

1Run llama-perplexity (in llama.cpp) on a representative text sample at FP16 vs your candidate quant
2Run a short eval set of your real prompts and compare answers manually
3Plot tokens-per-second on your hardware at each quant — the speed gain may surprise you
4Pick the highest quant that fits your memory comfortably with your target context size

Check-in 2. Got it so far?

Apply this

Pick a 7B and download both Q4_K_M and Q8 versions of the same model
Compare answers on five representative prompts side by side
Compare tokens-per-second on your hardware and decide which one wins for your workload

Key terms in this lesson

Check-in 3. Got it so far?

The big idea: quantization is not a tax — it is a slider. Find the highest setting your hardware allows, and only drop further when you have to.

Check-in 4. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going