Lesson 422 of 1596
llama.cpp: The Engine Underneath Almost Everything
Ollama, LM Studio, and most local-model apps are wrappers around llama.cpp. Knowing what it actually does — and how to drop down to it — pays off when defaults are not enough.
Creators · Model Families · ~21 min read
What llama.cpp actually is
llama.cpp is an open-source C/C++ implementation of LLM inference, originally written to run Meta's LLaMA models on a MacBook with no special hardware. It has since become the de facto inference engine for the local-model world: efficient on CPUs, well-tuned on Apple Silicon, with optional GPU offload via CUDA, ROCm, Metal, and Vulkan. If you are running a GGUF file anywhere on the planet, llama.cpp is probably involved.
Why this is worth your attention
- It is the layer where performance is actually decided — wrappers inherit its tuning
- Knowing its flags lets you wring 2-5x more throughput out of the same hardware
- It compiles cleanly on almost every platform — including embedded devices
- Its tools (llama-bench, llama-perplexity) are how you objectively compare quantizations
The same engine that powers Ollama, exposed directly. -ngl 99 offloads all layers to GPU.
# Build and run llama.cpp directly git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && make # Run a chat with a downloaded GGUF ./llama-cli -m models/llama-3.1-8b-instruct.Q5_K_M.gguf \ -ngl 99 \ -c 8192 \ -p "Hello." # Server mode — same OpenAI-compatible API ./llama-server -m models/llama-3.1-8b-instruct.Q5_K_M.gguf \ -ngl 99 -c 8192 --port 8080Compare the options
| Layer | What it does | When to drop down to it |
|---|---|---|
| Ollama / LM Studio | Friendly UX over llama.cpp | Most workflows |
| llama.cpp directly | Engine flags, custom builds, embedded targets | Performance tuning, weird hardware |
| Custom kernel work | Modify the C++ for research | Almost never — read the issues first |
The flags that actually matter
- 1-ngl N: number of layers to offload to GPU. More is faster until you run out of VRAM
- 2-c N: context size in tokens — must be set high enough for your prompts but not wastefully
- 3-b / -ub: batch and micro-batch sizes — affects throughput on long prompts
- 4--threads N: CPU thread count — usually no benefit beyond physical cores
- 5-fa: flash attention, when supported, often a free speedup
Apply this
- Build llama.cpp from source and run a GGUF you already have via Ollama
- Run llama-bench on the same model with two different -ngl values and compare tokens/second
- Read the README's Build section once — the optional features list is full of small wins
Key terms in this lesson
The big idea: every local-model tool you love is mostly llama.cpp underneath. Knowing the engine pays off the moment a wrapper's defaults stop being enough.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “llama.cpp: The Engine Underneath Almost Everything”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 40 min
AI model families: Meta's Llama (open source)
Understand why Llama matters as a free, open AI model anyone can run.
Creators · 9 min
Quantization Tradeoffs (Q4 Vs Q8) For Hermes
Quantization is the dial between model quality and what fits on your hardware. With Hermes, the right setting depends entirely on the task — there is no universal answer.
Creators · 11 min
Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision
A model file's quantization decides how big it is, how fast it runs, and how good it sounds. Learn the formats, the trade-offs, and how to pick the right one.
