Lesson 220 of 1596
Calibration
A calibrated model's 70 percent means it is right 70 percent of the time. Most LLMs are not calibrated. Here is what that costs you.
Creators · AI Foundations · ~24 min read
Probabilities That Mean Something
A calibrated classifier is one whose probability estimates match real frequencies. If it says 70 percent across a batch of predictions, 70 percent of those should be correct. That is an uncommon property in modern LLMs.
The reliability diagram
To check calibration, bin predictions by confidence (0-10 percent, 10-20 percent, ). For each bin, plot average confidence vs empirical accuracy. A perfectly calibrated model hugs the diagonal. Overconfidence bends the line below; underconfidence bends it above.
Reliability diagram: where the model thinks it is vs where it actually is
Perfect calibration: Accuracy 1.0 | * 0.8 | * 0.6 | * 0.4 | * 0.0 *--------- Confidence 0.0 0.6 1.0 Real models often look like: Accuracy 1.0 | * 0.8 | * * (overconfident) 0.6 | * 0.4 | * 0.0 *--------- Confidence too high relative to truth.Expected Calibration Error (ECE)
ECE sums the gap between confidence and accuracy across bins, weighted by bin size. Lower is better. A well-calibrated model has ECE below 0.05. A raw out-of-the-box LLM often sits at 0.15 or higher.
Why LLMs miscalibrate
- RLHF training rewards confident responses — hedging sounds weak
- Text never contains explicit probability labels to learn from
- The Softmax output is sensitive to fine-tuning choices
- In-context examples can push confidence up or down arbitrarily
Fixes that help
- 1Temperature scaling: post-hoc sharpen or flatten the probability curve
- 2Prompting for probabilities then calibrating offline
- 3Ensemble over samples (semantic entropy)
- 4Few-shot examples that demonstrate appropriate uncertainty
“Modern neural networks are not calibrated — a phenomenon that has worsened as accuracy improved.”
Key terms in this lesson
The big idea: a confident answer is not a correct answer. Calibration is the bridge between the two.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Calibration”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Emergence, Capability Forecasting, and Safety
Emergent abilities make AI both more exciting and more dangerous. How do labs forecast what the next model will do — and what happens when they are wrong?
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
Creators · 55 min
The Three Ingredients: Data, Compute, Algorithms (Capstone)
Every AI breakthrough of the past decade rests on three interacting ingredients. Synthesize everything you have learned into one working model.
