Lesson 262 of 2116
Calibration
A calibrated model's 70 percent means it is right 70 percent of the time. Most LLMs are not calibrated. Here is what that costs you.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Probabilities That Mean Something
- 2calibration
- 3reliability diagram
- 4ECE
Concept cluster
Terms to connect while reading
Section 1
Probabilities That Mean Something
A calibrated classifier is one whose probability estimates match real frequencies. If it says 70 percent across a batch of predictions, 70 percent of those should be correct. That is an uncommon property in modern LLMs.
The reliability diagram
To check calibration, bin predictions by confidence (0-10 percent, 10-20 percent, ...). For each bin, plot average confidence vs empirical accuracy. A perfectly calibrated model hugs the diagonal. Overconfidence bends the line below; underconfidence bends it above.
Reliability diagram: where the model thinks it is vs where it actually is
Perfect calibration:
Accuracy
1.0 | *
0.8 | *
0.6 | *
0.4 | *
0.0 *--------- Confidence
0.0 0.6 1.0
Real models often look like:
Accuracy
1.0 | *
0.8 | * * (overconfident)
0.6 | *
0.4 | *
0.0 *---------
Confidence too high relative to truth.Expected Calibration Error (ECE)
ECE sums the gap between confidence and accuracy across bins, weighted by bin size. Lower is better. A well-calibrated model has ECE below 0.05. A raw out-of-the-box LLM often sits at 0.15 or higher.
Why LLMs miscalibrate
- RLHF training rewards confident responses — hedging sounds weak
- Text never contains explicit probability labels to learn from
- The Softmax output is sensitive to fine-tuning choices
- In-context examples can push confidence up or down arbitrarily
Fixes that help
- 1Temperature scaling: post-hoc sharpen or flatten the probability curve
- 2Prompting for probabilities then calibrating offline
- 3Ensemble over samples (semantic entropy)
- 4Few-shot examples that demonstrate appropriate uncertainty
“Modern neural networks are not calibrated — a phenomenon that has worsened as accuracy improved.”
Key terms in this lesson
The big idea: a confident answer is not a correct answer. Calibration is the bridge between the two.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Calibration”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Emergence, Capability Forecasting, and Safety
Emergent abilities make AI both more exciting and more dangerous. How do labs forecast what the next model will do — and what happens when they are wrong?
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
Creators · 55 min
The Three Ingredients: Data, Compute, Algorithms (Capstone)
Every AI breakthrough of the past decade rests on three interacting ingredients. Synthesize everything you have learned into one working model.
