Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities

Whether a model runs well — or at all — depends on the hardware you put under it. Here is the practical map of what hardware can run which class of model.

10 min · Reviewed 2026

The first question is always memory

An LLM has to fit into memory before it can run. On a discrete GPU, that means VRAM. On Apple Silicon, that means unified memory shared between CPU and GPU. On a CPU-only machine, that means RAM and a lot of patience. Whatever runs is whatever fits. So the buying decision is really a memory-sizing decision.

Hardware	Useful memory	Realistic model class	Vibe
8GB integrated GPU laptop	~6GB usable	Up to ~7B at Q4	Toy projects, learning
16GB Apple Silicon Mac	~10-12GB usable	Up to ~13B at Q4	Solid daily driver
24GB consumer GPU (e.g. high-end RTX class)	~22GB usable	Up to ~30B at Q4 or 13B at Q8	Comfortable workhorse
48GB+ Mac Studio class	~40GB+ usable	Up to ~70B at Q4	Power user / small team server
80GB+ datacenter GPU	~78GB+	70B at Q8 or 405B at low quant	Serious self-host

Apple Silicon's unfair advantage

Apple's unified memory architecture means a 64GB Mac Studio can hold a 70B-class model that a 24GB consumer GPU simply cannot. Throughput is not as high as a top-end discrete GPU, but the ceiling on model size is dramatically higher per dollar. For local inference, M-series Macs punch far above their weight.

CPU-only is a thing — barely

A modern desktop CPU can run a 7-8B model at a few tokens per second
Useful for batch processing where latency does not matter
Dramatically slower than even a modest GPU
Good for running a coding assistant in the background — bad for chat UX

How to size before you buy

Decide which model class you actually need (7B, 13B, 30B, 70B)
Pick the quantization you can tolerate quality-wise (Q4 is the sweet spot for most)
Add 25% for KV cache, runtime, and the rest of your OS
Buy hardware whose usable memory comfortably exceeds that number — not just barely matches it

Apply this

Look up the unified-memory or VRAM number on your current hardware
Compute the largest model you can comfortably run at Q4 with 8k context
Identify the smallest hardware upgrade that would unlock the next class up

The big idea: pick the model first, then size the memory, then pick the hardware. Reversing that order is how teams end up with great GPUs that cannot run the model they actually want.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-local-hardware-sizing-creators

What is the core idea behind "Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities"?
1. Whether a model runs well — or at all — depends on the hardware you put under it. Here is the practical map of what hardware can run which class of model.
2. total cost of ownership
3. failure note
4. recall
Which term best describes a foundational idea in "Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities"?
1. unified memory
2. VRAM
3. KV cache
4. quantization
A learner studying Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities would need to understand which concept?
1. VRAM
2. KV cache
3. unified memory
4. quantization
Which of these is directly relevant to Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. VRAM
2. unified memory
3. quantization
4. KV cache
Which of the following is a key point about Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. A modern desktop CPU can run a 7-8B model at a few tokens per second
2. Useful for batch processing where latency does not matter
3. Dramatically slower than even a modest GPU
4. Good for running a coding assistant in the background — bad for chat UX
Which of these does NOT belong in a discussion of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. Useful for batch processing where latency does not matter
3. A modern desktop CPU can run a 7-8B model at a few tokens per second
4. Dramatically slower than even a modest GPU
Which statement is accurate regarding Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. Pick the quantization you can tolerate quality-wise (Q4 is the sweet spot for most)
2. Add 25% for KV cache, runtime, and the rest of your OS
3. Decide which model class you actually need (7B, 13B, 30B, 70B)
4. Buy hardware whose usable memory comfortably exceeds that number — not just barely matches it
Which of these does NOT belong in a discussion of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. Pick the quantization you can tolerate quality-wise (Q4 is the sweet spot for most)
3. Add 25% for KV cache, runtime, and the rest of your OS
4. Decide which model class you actually need (7B, 13B, 30B, 70B)
What is the key insight about "The math: parameters times bytes" in the context of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. A 7-billion-parameter model at FP16 uses ~14GB. At Q4, ~4GB. At Q8, ~7GB.
2. total cost of ownership
3. failure note
4. recall
What is the key insight about "Context length costs memory too" in the context of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. The KV cache scales with context length. A 7B model at 32k context uses noticeably more memory than the same model at 4k.
3. failure note
4. recall
What is the key insight about "From the community" in the context of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. failure note
3. Apple Silicon hardware threads on r/LocalLLaMA are full of users discovering the macOS quirk that caps GPU usage at roug…
4. recall
Which statement accurately describes an aspect of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. failure note
3. recall
4. An LLM has to fit into memory before it can run. On a discrete GPU, that means VRAM.
What does working with Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities typically involve?
1. Apple's unified memory architecture means a 64GB Mac Studio can hold a 70B-class model that a 24GB consumer GPU simply cannot.
2. total cost of ownership
3. failure note
4. recall
Which of the following is true about Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. The big idea: pick the model first, then size the memory, then pick the hardware.
3. failure note
4. recall
Which best describes the scope of "Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities"?
1. It is unrelated to model-families workflows
2. It applies only to the opposite beginner tier
3. It focuses on Whether a model runs well — or at all — depends on the hardware you put under it. Here is the practi
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson

Tendril · Creators · Model Families

Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities

Whether a model runs well — or at all — depends on the hardware you put under it. Here is the practical map of what hardware can run which class of model.

10 min · Reviewed 2026

The first question is always memory

Hardware	Useful memory	Realistic model class	Vibe
8GB integrated GPU laptop	~6GB usable	Up to ~7B at Q4	Toy projects, learning
16GB Apple Silicon Mac	~10-12GB usable	Up to ~13B at Q4	Solid daily driver
24GB consumer GPU (e.g. high-end RTX class)	~22GB usable	Up to ~30B at Q4 or 13B at Q8	Comfortable workhorse
48GB+ Mac Studio class	~40GB+ usable	Up to ~70B at Q4	Power user / small team server
80GB+ datacenter GPU	~78GB+	70B at Q8 or 405B at low quant	Serious self-host

Apple Silicon's unfair advantage

CPU-only is a thing — barely

A modern desktop CPU can run a 7-8B model at a few tokens per second
Useful for batch processing where latency does not matter
Dramatically slower than even a modest GPU
Good for running a coding assistant in the background — bad for chat UX

How to size before you buy

Decide which model class you actually need (7B, 13B, 30B, 70B)
Pick the quantization you can tolerate quality-wise (Q4 is the sweet spot for most)
Add 25% for KV cache, runtime, and the rest of your OS
Buy hardware whose usable memory comfortably exceeds that number — not just barely matches it

Apply this

Look up the unified-memory or VRAM number on your current hardware
Compute the largest model you can comfortably run at Q4 with 8k context
Identify the smallest hardware upgrade that would unlock the next class up

The big idea: pick the model first, then size the memory, then pick the hardware. Reversing that order is how teams end up with great GPUs that cannot run the model they actually want.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-local-hardware-sizing-creators

What is the core idea behind "Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities"?
1. Whether a model runs well — or at all — depends on the hardware you put under it. Here is the practical map of what hardware can run which class of model.
2. total cost of ownership
3. failure note
4. recall
Which term best describes a foundational idea in "Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities"?
1. unified memory
2. VRAM
3. KV cache
4. quantization
A learner studying Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities would need to understand which concept?
1. VRAM
2. KV cache
3. unified memory
4. quantization
Which of these is directly relevant to Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. VRAM
2. unified memory
3. quantization
4. KV cache
Which of the following is a key point about Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. A modern desktop CPU can run a 7-8B model at a few tokens per second
2. Useful for batch processing where latency does not matter
3. Dramatically slower than even a modest GPU
4. Good for running a coding assistant in the background — bad for chat UX
Which of these does NOT belong in a discussion of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. Useful for batch processing where latency does not matter
3. A modern desktop CPU can run a 7-8B model at a few tokens per second
4. Dramatically slower than even a modest GPU
Which statement is accurate regarding Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. Pick the quantization you can tolerate quality-wise (Q4 is the sweet spot for most)
2. Add 25% for KV cache, runtime, and the rest of your OS
3. Decide which model class you actually need (7B, 13B, 30B, 70B)
4. Buy hardware whose usable memory comfortably exceeds that number — not just barely matches it
Which of these does NOT belong in a discussion of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. Pick the quantization you can tolerate quality-wise (Q4 is the sweet spot for most)
3. Add 25% for KV cache, runtime, and the rest of your OS
4. Decide which model class you actually need (7B, 13B, 30B, 70B)
What is the key insight about "The math: parameters times bytes" in the context of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. A 7-billion-parameter model at FP16 uses ~14GB. At Q4, ~4GB. At Q8, ~7GB.
2. total cost of ownership
3. failure note
4. recall
What is the key insight about "Context length costs memory too" in the context of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. The KV cache scales with context length. A 7B model at 32k context uses noticeably more memory than the same model at 4k.
3. failure note
4. recall
What is the key insight about "From the community" in the context of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. failure note
3. Apple Silicon hardware threads on r/LocalLLaMA are full of users discovering the macOS quirk that caps GPU usage at roug…
4. recall
Which statement accurately describes an aspect of Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. failure note
3. recall
4. An LLM has to fit into memory before it can run. On a discrete GPU, that means VRAM.
What does working with Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities typically involve?
1. Apple's unified memory architecture means a 64GB Mac Studio can hold a 70B-class model that a 24GB consumer GPU simply cannot.
2. total cost of ownership
3. failure note
4. recall
Which of the following is true about Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities?
1. total cost of ownership
2. The big idea: pick the model first, then size the memory, then pick the hardware.
3. failure note
4. recall
Which best describes the scope of "Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities"?
1. It is unrelated to model-families workflows
2. It applies only to the opposite beginner tier
3. It focuses on Whether a model runs well — or at all — depends on the hardware you put under it. Here is the practi
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson