neural-forge.io

Sign inStartStart learning

Tendril

Model Families0%

Lesson 527 of 2116

Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities

Whether a model runs well — or at all — depends on the hardware you put under it. Here is the practical map of what hardware can run which class of model.

CreatorsModel Families~6 min readBI1 · PerceptionBI3 · LearningBI4 · Natural InteractionPrint / PDF

Lesson map

What this lesson covers

10 min19 blocks5 concepts

Learning path

The main moves in order

1The first question is always memory
2VRAM
3unified memory
4Apple Silicon

Concept cluster

Terms to connect while reading

VRAMunified memoryApple SiliconCPU inferencemodel size class

Read3

Sections5

Lists3

Notes6

Compare1

Terms1

Section 1

The first question is always memory

An LLM has to fit into memory before it can run. On a discrete GPU, that means VRAM. On Apple Silicon, that means unified memory shared between CPU and GPU. On a CPU-only machine, that means RAM and a lot of patience. Whatever runs is whatever fits. So the buying decision is really a memory-sizing decision.

Compare the options

Hardware	Useful memory	Realistic model class	Vibe
8GB integrated GPU laptop	~6GB usable	Up to ~7B at Q4	Toy projects, learning
16GB Apple Silicon Mac	~10-12GB usable	Up to ~13B at Q4	Solid daily driver
24GB consumer GPU (e.g. high-end RTX class)	~22GB usable	Up to ~30B at Q4 or 13B at Q8	Comfortable workhorse
48GB+ Mac Studio class	~40GB+ usable	Up to ~70B at Q4	Power user / small team server
80GB+ datacenter GPU	~78GB+	70B at Q8 or 405B at low quant	Serious self-host

Apple Silicon's unfair advantage

Apple's unified memory architecture means a 64GB Mac Studio can hold a 70B-class model that a 24GB consumer GPU simply cannot. Throughput is not as high as a top-end discrete GPU, but the ceiling on model size is dramatically higher per dollar. For local inference, M-series Macs punch far above their weight.

Check-in 1. Got it so far?

CPU-only is a thing — barely

A modern desktop CPU can run a 7-8B model at a few tokens per second
Useful for batch processing where latency does not matter
Dramatically slower than even a modest GPU
Good for running a coding assistant in the background — bad for chat UX

Check-in 2. Got it so far?

How to size before you buy

1Decide which model class you actually need (7B, 13B, 30B, 70B)
2Pick the quantization you can tolerate quality-wise (Q4 is the sweet spot for most)
3Add 25% for KV cache, runtime, and the rest of your OS
4Buy hardware whose usable memory comfortably exceeds that number — not just barely matches it

Apply this

Look up the unified-memory or VRAM number on your current hardware
Compute the largest model you can comfortably run at Q4 with 8k context
Identify the smallest hardware upgrade that would unlock the next class up

Key terms in this lesson

Check-in 3. Got it so far?

The big idea: pick the model first, then size the memory, then pick the hardware. Reversing that order is how teams end up with great GPUs that cannot run the model they actually want.

Check-in 4. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going