llama.cpp: The Engine Underneath Almost Everything

Section 1

What llama.cpp actually is

The same engine that powers Ollama, exposed directly. -ngl 99 offloads all layers to GPU.

bash

# Build and run llama.cpp directly
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make

# Run a chat with a downloaded GGUF
./llama-cli -m models/llama-3.1-8b-instruct.Q5_K_M.gguf \
    -ngl 99 \
    -c 8192 \
    -p "Hello."

# Server mode — same OpenAI-compatible API
./llama-server -m models/llama-3.1-8b-instruct.Q5_K_M.gguf \
    -ngl 99 -c 8192 --port 8080

Compare the options

Layer	What it does	When to drop down to it
Ollama / LM Studio	Friendly UX over llama.cpp	Most workflows
llama.cpp directly	Engine flags, custom builds, embedded targets	Performance tuning, weird hardware
Custom kernel work	Modify the C++ for research	Almost never — read the issues first

Key terms in this lesson

Section 2

llama.cpp and GGUF: The Runtime Under Many Local Apps

Section 3

The operational idea: llama.cpp and GGUF

Compare the options

Layer	What to decide	What can go wrong
Runtime	llama.cpp and GGUF	The model runs, but the workflow is slow or brittle
Evaluation	A small task-specific test set	A flashy demo hides routine failures
Safety and ops	Permissions, provenance, logging, and rollback	Downloading random GGUF files without checking source, license, quantization, or chat template.

A local-model operations sketch students can adapt.

text

filename_decoder:
  Qwen3-8B-Instruct-Q4_K_M.gguf

  family: Qwen3
  size: 8B
  type: instruct
  quantization: Q4_K_M
  format: GGUF

question: who made this file and what template does it need?

Key terms in this lesson

llama.cpp: The Engine Underneath Almost Everything

What llama.cpp actually is

Why this is worth your attention

The flags that actually matter

Apply this

llama.cpp and GGUF: The Runtime Under Many Local Apps

The operational idea: llama.cpp and GGUF

Current source signal

Build the small version

Curious about “llama.cpp: The Engine Underneath Almost Everything”?

Keep going

llama.cpp: The Engine Underneath Almost Everything

What llama.cpp actually is

Why this is worth your attention

The flags that actually matter

Apply this

llama.cpp and GGUF: The Runtime Under Many Local Apps

The operational idea: llama.cpp and GGUF

Current source signal

Build the small version

Curious about “llama.cpp: The Engine Underneath Almost Everything”?

Keep going