Lesson 8 of 2116
Probabilistic Systems: Why LLMs Do Not Act Like Code
Writing software on top of an LLM is not like writing software on top of a database. Treat it as a stochastic system or it will bite you.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1LLMs Are Samplers
- 2probabilistic
- 3temperature
- 4determinism
Concept cluster
Terms to connect while reading
Section 1
LLMs Are Samplers
An LLM does not return an answer. It returns a probability distribution over possible next tokens. Sampling from that distribution produces text. Even identical prompts will produce different text unless you control the randomness explicitly.
The sampling knobs
Compare the options
| Parameter | Effect |
|---|---|
| temperature | Scales logits. 0 = greedy, higher = more random |
| top_p (nucleus) | Keep smallest set of tokens whose cumulative probability exceeds p |
| top_k | Keep only the top k tokens |
| repetition_penalty | Down-weight tokens already in context |
| seed | Pin the pseudo-random generator for reproducibility |
Why this breaks classical software instincts
- Unit tests that check exact output strings are brittle
- Caching by prompt hash works only at temperature 0 and even then not perfectly
- Error handling must account for plausible wrong answers, not just failures
- Rate limits and costs scale with token counts, which vary per call
Sample multiple times and reason statistically. One call is anecdote, ten calls are data.
# Robust evaluation of a probabilistic system
import statistics
import anthropic
client = anthropic.Anthropic()
def evaluate(prompt, n=10):
outputs = []
for _ in range(n):
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=200,
messages=[{"role": "user", "content": prompt}],
)
outputs.append(resp.content[0].text)
# Measure pass@k, agreement, or similar
return outputs
results = evaluate("Classify: 'great product'", n=10)
print(f"Unique outputs: {len(set(results))}")Strategies for taming randomness
- 1Constrain output format (JSON schema, regex, tool calls)
- 2Use structured generation libraries like Outlines, Instructor, or native JSON mode
- 3Ensemble multiple samples and majority-vote
- 4Use a cheaper verifier model to grade an expensive generator
- 5Cache known-good outputs and fall through to LLM only on cache miss
Testing probabilistic systems
- Behavioral tests: semantic properties (answer contains X, matches regex, passes a grader)
- Distributional tests: 95 percent of samples must score above threshold
- A/B tests in production with guardrails
- Canary prompts that track drift between model versions
“An LLM in production is a distribution, not a function. Design accordingly.”
Key terms in this lesson
The big idea: LLMs are stochastic systems. The tooling, testing, and architecture patterns that assume deterministic code need to be re-thought from the ground up when you put one in the loop.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Probabilistic Systems: Why LLMs Do Not Act Like Code”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 32 min
AP Biology: Using AI to Survive the Vocab Tsunami
AP Bio has roughly a thousand terms and four big concepts. NotebookLM and Claude Projects can turn your textbook into a custom tutor that actually knows what you are studying.
Creators · 34 min
AI and Temperature Tuning Method: Calibrating Creativity
AI helps creators tune temperature and sampling parameters to match the task instead of using defaults forever.
Builders · 40 min
Temperature Explained: Why the Same Prompt Gives Different Answers
Temperature controls how 'creative' an AI gets. Knowing how to dial it changes everything.
