Tendril

Lesson 262 of 2116

Calibration

A calibrated model's 70 percent means it is right 70 percent of the time. Most LLMs are not calibrated. Here is what that costs you.

CreatorsAI Foundations~24 min readAdvancedProfessionalBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

40 min17 blocks4 concepts

Learning path

The main moves in order

1Probabilities That Mean Something
2calibration
3reliability diagram
4ECE

Concept cluster

Terms to connect while reading

calibrationreliability diagramECEoverconfidence

Sections5

Lists2

Notes3

Code1

Quotes1

Section 1

Probabilities That Mean Something

A calibrated classifier is one whose probability estimates match real frequencies. If it says 70 percent across a batch of predictions, 70 percent of those should be correct. That is an uncommon property in modern LLMs.

The reliability diagram

To check calibration, bin predictions by confidence (0-10 percent, 10-20 percent, ...). For each bin, plot average confidence vs empirical accuracy. A perfectly calibrated model hugs the diagonal. Overconfidence bends the line below; underconfidence bends it above.

Reliability diagram: where the model thinks it is vs where it actually is

text

Perfect calibration:

Accuracy
 1.0 |          *
 0.8 |       *
 0.6 |    *
 0.4 | *
 0.0 *--------- Confidence
      0.0  0.6  1.0

Real models often look like:

Accuracy
 1.0 |             *
 0.8 |        *      * (overconfident)
 0.6 |   *
 0.4 | *
 0.0 *---------
     Confidence too high relative to truth.

Check-in 1. Got it so far?

Expected Calibration Error (ECE)

ECE sums the gap between confidence and accuracy across bins, weighted by bin size. Lower is better. A well-calibrated model has ECE below 0.05. A raw out-of-the-box LLM often sits at 0.15 or higher.

Why LLMs miscalibrate

RLHF training rewards confident responses — hedging sounds weak
Text never contains explicit probability labels to learn from
The Softmax output is sensitive to fine-tuning choices
In-context examples can push confidence up or down arbitrarily

Fixes that help

1Temperature scaling: post-hoc sharpen or flatten the probability curve
2Prompting for probabilities then calibrating offline
3Ensemble over samples (semantic entropy)
4Few-shot examples that demonstrate appropriate uncertainty

Check-in 2. Got it so far?

“Modern neural networks are not calibrated — a phenomenon that has worsened as accuracy improved.”
Guo et al., On Calibration of Modern Neural Networks (2017)

Key terms in this lesson

Check-in 3. Got it so far?

The big idea: a confident answer is not a correct answer. Calibration is the bridge between the two.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Calibration”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Calibration

Probabilities That Mean Something

The reliability diagram

Expected Calibration Error (ECE)

Why LLMs miscalibrate

Fixes that help

Curious about “Calibration”?

Keep going

Calibration

Probabilities That Mean Something

The reliability diagram

Expected Calibration Error (ECE)

Why LLMs miscalibrate

Fixes that help

Curious about “Calibration”?

Keep going