Lesson 415 of 2116
Running Hermes Locally With Ollama / LM Studio
Open-weight models like Hermes are useful only if you can actually run them. Ollama and LM Studio are the two paths most people take, and the trade-offs are real.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The two on-ramps
- 2Ollama
- 3LM Studio
- 4local inference
Concept cluster
Terms to connect while reading
Section 1
The two on-ramps
Ollama is the CLI-first runtime — you type `ollama run hermes3:8b` and you have a model. LM Studio is the GUI-first runtime — you point and click, browse models, and chat in a familiar window. They run the same underlying llama.cpp engine. Choose based on whether your eventual goal is automation (Ollama) or exploration (LM Studio). Many users keep both.
Ollama in three commands
Ollama is opinionated about model naming — the exact tag depends on what is mirrored in its library at the time you check.
# Install (macOS via Homebrew)
brew install ollama
# Pull a Hermes variant — model name varies by maintainer; check Ollama's library
ollama pull nous-hermes2:latest
# Run it
ollama run nous-hermes2LM Studio in three clicks
- 1Download LM Studio for your platform.
- 2Use the model browser to search for 'Hermes' and download a quantized GGUF file.
- 3Open the chat window, select the loaded model, and start prompting.
Compare the options
| Need | Ollama | LM Studio |
|---|---|---|
| Scripting / automation | Best | OK with the local server feature |
| Try-before-you-buy on different quants | Workable | Best — easy to swap |
| Apple Silicon performance | Strong | Strong, sometimes faster on MLX backend |
| OpenAI-compatible API | Built in (localhost:11434) | Built in (configurable port) |
| Headless server | Best | Possible but not the default |
| Beginner UX | Terminal-shaped | Friendlier |
Sizing for your hardware
- 8B models in 4-bit quant fit in roughly 6 GB of unified memory or VRAM. Most modern laptops handle them.
- 13B-class models in 4-bit quant want ~10 GB. M-series Macs with 16GB+ are comfortable.
- 70B models want a Mac Studio or a real GPU box. Plan around 40+ GB even at aggressive quantization.
- Always leave headroom for context — long prompts inflate memory use beyond the model's static footprint.
Applied exercise
- 1Install one of the two runtimes.
- 2Pull a Hermes model that fits your hardware.
- 3Send three prompts through it and time the responses.
- 4Then point a script at the local OpenAI-compatible URL and run the same prompts. Note the latency.
Key terms in this lesson
The big idea: local Hermes is a one-evening setup. After that, the only real question is which size fits your hardware.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Running Hermes Locally With Ollama / LM Studio”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 10 min
Ollama: The Easy On-Ramp to Local Models
Ollama is the curl-and-go answer to running an LLM on your own machine. Here is what it actually does, the commands that matter, and the seams you will hit when you push it.
Creators · 18 min
LM Studio Server: Local Models Behind an API
LM Studio is a friendly way to download, test, and serve local models behind OpenAI-compatible and Anthropic-compatible endpoints.
Creators · 9 min
ChatGPT For Everyday Work: Plus vs Pro vs Team vs Enterprise
Picking the right ChatGPT tier is mostly about who else sees your data and how much heavy reasoning you do. The price differences are obvious; the policy differences are not.
