Lesson 527 of 2116
Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities
Whether a model runs well — or at all — depends on the hardware you put under it. Here is the practical map of what hardware can run which class of model.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The first question is always memory
- 2VRAM
- 3unified memory
- 4Apple Silicon
Concept cluster
Terms to connect while reading
Section 1
The first question is always memory
An LLM has to fit into memory before it can run. On a discrete GPU, that means VRAM. On Apple Silicon, that means unified memory shared between CPU and GPU. On a CPU-only machine, that means RAM and a lot of patience. Whatever runs is whatever fits. So the buying decision is really a memory-sizing decision.
Compare the options
| Hardware | Useful memory | Realistic model class | Vibe |
|---|---|---|---|
| 8GB integrated GPU laptop | ~6GB usable | Up to ~7B at Q4 | Toy projects, learning |
| 16GB Apple Silicon Mac | ~10-12GB usable | Up to ~13B at Q4 | Solid daily driver |
| 24GB consumer GPU (e.g. high-end RTX class) | ~22GB usable | Up to ~30B at Q4 or 13B at Q8 | Comfortable workhorse |
| 48GB+ Mac Studio class | ~40GB+ usable | Up to ~70B at Q4 | Power user / small team server |
| 80GB+ datacenter GPU | ~78GB+ | 70B at Q8 or 405B at low quant | Serious self-host |
Apple Silicon's unfair advantage
Apple's unified memory architecture means a 64GB Mac Studio can hold a 70B-class model that a 24GB consumer GPU simply cannot. Throughput is not as high as a top-end discrete GPU, but the ceiling on model size is dramatically higher per dollar. For local inference, M-series Macs punch far above their weight.
CPU-only is a thing — barely
- A modern desktop CPU can run a 7-8B model at a few tokens per second
- Useful for batch processing where latency does not matter
- Dramatically slower than even a modest GPU
- Good for running a coding assistant in the background — bad for chat UX
How to size before you buy
- 1Decide which model class you actually need (7B, 13B, 30B, 70B)
- 2Pick the quantization you can tolerate quality-wise (Q4 is the sweet spot for most)
- 3Add 25% for KV cache, runtime, and the rest of your OS
- 4Buy hardware whose usable memory comfortably exceeds that number — not just barely matches it
Apply this
- Look up the unified-memory or VRAM number on your current hardware
- Compute the largest model you can comfortably run at Q4 with 8k context
- Identify the smallest hardware upgrade that would unlock the next class up
Key terms in this lesson
The big idea: pick the model first, then size the memory, then pick the hardware. Reversing that order is how teams end up with great GPUs that cannot run the model they actually want.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
Local Model Family: Gemma
Gemma is Google DeepMind open-model family, useful for local and single-accelerator experiments when students want polished small models.
Creators · 19 min
MLX on Apple Silicon: Local Models for Macs
MLX gives Mac users a native path for local model generation and fine-tuning on Apple Silicon.
Creators · 9 min
Hermes On A Mac: Apple Silicon Performance Notes
Apple Silicon is the most accessible serious AI hardware most creators will ever own. Knowing how to get the best out of it for Hermes is a 30-minute investment with months of payoff.
