Lesson 8 of 1596
Probabilistic Systems: Why LLMs Do Not Act Like Code
Writing software on top of an LLM is not like writing software on top of a database. Treat it as a stochastic system or it will bite you.
Creators · AI Foundations · ~27 min read
LLMs Are Samplers
An LLM does not return an answer. It returns a probability distribution over possible next tokens. Sampling from that distribution produces text. Even identical prompts will produce different text unless you control the randomness explicitly.
The sampling knobs
Compare the options
| Parameter | Effect |
|---|---|
| temperature | Scales logits. 0 = greedy, higher = more random |
| top_p (nucleus) | Keep smallest set of tokens whose cumulative probability exceeds p |
| top_k | Keep only the top k tokens |
| repetition_penalty | Down-weight tokens already in context |
| seed | Pin the pseudo-random generator for reproducibility |
Why this breaks classical software instincts
- Unit tests that check exact output strings are brittle
- Caching by prompt hash works only at temperature 0 and even then not perfectly
- Error handling must account for plausible wrong answers, not just failures
- Rate limits and costs scale with token counts, which vary per call
Sample multiple times and reason statistically. One call is anecdote, ten calls are data.
# Robust evaluation of a probabilistic system import statistics import anthropic client = anthropic.Anthropic() def evaluate(prompt, n=10): outputs = [] for _ in range(n): resp = client.messages.create( model="claude-opus-4-7", max_tokens=200, messages=[{"role": "user", "content": prompt}], ) outputs.append(resp.content[0].text) # Measure pass@k, agreement, or similar return outputs results = evaluate("Classify: 'great product'", n=10) print(f"Unique outputs: {len(set(results))}")Strategies for taming randomness
- 1Constrain output format (JSON schema, regex, tool calls)
- 2Use structured generation libraries like Outlines, Instructor, or native JSON mode
- 3Ensemble multiple samples and majority-vote
- 4Use a cheaper verifier model to grade an expensive generator
- 5Cache known-good outputs and fall through to LLM only on cache miss
Testing probabilistic systems
- Behavioral tests: semantic properties (answer contains X, matches regex, passes a grader)
- Distributional tests: 95 percent of samples must score above threshold
- A/B tests in production with guardrails
- Canary prompts that track drift between model versions
“An LLM in production is a distribution, not a function. Design accordingly.”
Key terms in this lesson
The big idea: LLMs are stochastic systems. The tooling, testing, and architecture patterns that assume deterministic code need to be re-thought from the ground up when you put one in the loop.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Probabilistic Systems: Why LLMs Do Not Act Like Code”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 32 min
AP Biology: Using AI to Survive the Vocab Tsunami
AP Bio has roughly a thousand terms and four big concepts. NotebookLM and Claude Projects can turn your textbook into a custom tutor that actually knows what you are studying.
Creators · 45 min
What Is Intelligence, Really? A Working Framework
Before we can judge whether an AI is intelligent, we need a framework for what intelligence even means. Draw on Chollet, Dennett, and modern evals.
Creators · 55 min
The Three Ingredients: Data, Compute, Algorithms (Capstone)
Every AI breakthrough of the past decade rests on three interacting ingredients. Synthesize everything you have learned into one working model.
