Calibration

A calibrated model's 70 percent means it is right 70 percent of the time. Most LLMs are not calibrated. Here is what that costs you.

40 min · Reviewed 2026

Probabilities That Mean Something

A calibrated classifier is one whose probability estimates match real frequencies. If it says 70 percent across a batch of predictions, 70 percent of those should be correct. That is an uncommon property in modern LLMs.

The reliability diagram

To check calibration, bin predictions by confidence (0-10 percent, 10-20 percent, ...). For each bin, plot average confidence vs empirical accuracy. A perfectly calibrated model hugs the diagonal. Overconfidence bends the line below; underconfidence bends it above.

Perfect calibration:

Accuracy
 1.0 |          *
 0.8 |       *
 0.6 |    *
 0.4 | *
 0.0 *--------- Confidence
      0.0  0.6  1.0

Real models often look like:

Accuracy
 1.0 |             *
 0.8 |        *      * (overconfident)
 0.6 |   *
 0.4 | *
 0.0 *---------
     Confidence too high relative to truth.Reliability diagram: where the model thinks it is vs where it actually is

Expected Calibration Error (ECE)

ECE sums the gap between confidence and accuracy across bins, weighted by bin size. Lower is better. A well-calibrated model has ECE below 0.05. A raw out-of-the-box LLM often sits at 0.15 or higher.

Why LLMs miscalibrate

RLHF training rewards confident responses — hedging sounds weak
Text never contains explicit probability labels to learn from
The Softmax output is sensitive to fine-tuning choices
In-context examples can push confidence up or down arbitrarily

Fixes that help

Temperature scaling: post-hoc sharpen or flatten the probability curve
Prompting for probabilities then calibrating offline
Ensemble over samples (semantic entropy)
Few-shot examples that demonstrate appropriate uncertainty

Modern neural networks are not calibrated — a phenomenon that has worsened as accuracy improved.
— Guo et al., On Calibration of Modern Neural Networks (2017)

The big idea: a confident answer is not a correct answer. Calibration is the bridge between the two.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-calibration

What is the core idea behind "Calibration"?
1. A calibrated model's 70 percent means it is right 70 percent of the time. Most LLMs are not calibrated. Here is what that costs you.
2. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
3. You wrote your prediction BEFORE running — that keeps you honest
4. Show pairs blindly; ask which is better and why
Which term best describes a foundational idea in "Calibration"?
1. reliability diagram
2. calibration
3. ECE
4. temperature scaling
A learner studying Calibration would need to understand which concept?
1. calibration
2. ECE
3. reliability diagram
4. temperature scaling
Which of these is directly relevant to Calibration?
1. calibration
2. reliability diagram
3. temperature scaling
4. ECE
Which of the following is a key point about Calibration?
1. RLHF training rewards confident responses — hedging sounds weak
2. Text never contains explicit probability labels to learn from
3. The Softmax output is sensitive to fine-tuning choices
4. In-context examples can push confidence up or down arbitrarily
Which of these does NOT belong in a discussion of Calibration?
1. Text never contains explicit probability labels to learn from
2. RLHF training rewards confident responses — hedging sounds weak
3. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
4. The Softmax output is sensitive to fine-tuning choices
Which statement is accurate regarding Calibration?
1. Prompting for probabilities then calibrating offline
2. Ensemble over samples (semantic entropy)
3. Temperature scaling: post-hoc sharpen or flatten the probability curve
4. Few-shot examples that demonstrate appropriate uncertainty
Which of these does NOT belong in a discussion of Calibration?
1. Temperature scaling: post-hoc sharpen or flatten the probability curve
2. Prompting for probabilities then calibrating offline
3. Ensemble over samples (semantic entropy)
4. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
What is the key insight about "Overconfidence is the default" in the context of Calibration?
1. In the absence of deliberate calibration work, modern LLMs are systematically overconfident.
2. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
3. You wrote your prediction BEFORE running — that keeps you honest
4. Show pairs blindly; ask which is better and why
What is the recommended tip about "Ground your practice in fundamentals" in the context of Calibration?
1. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
2. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
3. You wrote your prediction BEFORE running — that keeps you honest
4. Show pairs blindly; ask which is better and why
Which statement accurately describes an aspect of Calibration?
1. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
2. You wrote your prediction BEFORE running — that keeps you honest
3. A calibrated classifier is one whose probability estimates match real frequencies.
4. Show pairs blindly; ask which is better and why
What does working with Calibration typically involve?
1. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
2. You wrote your prediction BEFORE running — that keeps you honest
3. Show pairs blindly; ask which is better and why
4. To check calibration, bin predictions by confidence (0-10 percent, 10-20 percent, ...).
Which of the following is true about Calibration?
1. ECE sums the gap between confidence and accuracy across bins, weighted by bin size. Lower is better. A well-calibrated model has ECE below 0.
2. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
3. You wrote your prediction BEFORE running — that keeps you honest
4. Show pairs blindly; ask which is better and why
Which best describes the scope of "Calibration"?
1. It is unrelated to foundations workflows
2. It focuses on A calibrated model's 70 percent means it is right 70 percent of the time. Most LLMs are not calibrat
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Calibration?
1. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
2. You wrote your prediction BEFORE running — that keeps you honest
3. The reliability diagram
4. Show pairs blindly; ask which is better and why

← Back to interactive lesson

Tendril · Creators · AI Foundations

Calibration

A calibrated model's 70 percent means it is right 70 percent of the time. Most LLMs are not calibrated. Here is what that costs you.

40 min · Reviewed 2026

Probabilities That Mean Something

The reliability diagram

Perfect calibration:

Accuracy
 1.0 |          *
 0.8 |       *
 0.6 |    *
 0.4 | *
 0.0 *--------- Confidence
      0.0  0.6  1.0

Real models often look like:

Accuracy
 1.0 |             *
 0.8 |        *      * (overconfident)
 0.6 |   *
 0.4 | *
 0.0 *---------
     Confidence too high relative to truth.Reliability diagram: where the model thinks it is vs where it actually is

Expected Calibration Error (ECE)

ECE sums the gap between confidence and accuracy across bins, weighted by bin size. Lower is better. A well-calibrated model has ECE below 0.05. A raw out-of-the-box LLM often sits at 0.15 or higher.

Why LLMs miscalibrate

RLHF training rewards confident responses — hedging sounds weak
Text never contains explicit probability labels to learn from
The Softmax output is sensitive to fine-tuning choices
In-context examples can push confidence up or down arbitrarily

Fixes that help

Temperature scaling: post-hoc sharpen or flatten the probability curve
Prompting for probabilities then calibrating offline
Ensemble over samples (semantic entropy)
Few-shot examples that demonstrate appropriate uncertainty

Modern neural networks are not calibrated — a phenomenon that has worsened as accuracy improved.
— Guo et al., On Calibration of Modern Neural Networks (2017)

The big idea: a confident answer is not a correct answer. Calibration is the bridge between the two.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-calibration

What is the core idea behind "Calibration"?
1. A calibrated model's 70 percent means it is right 70 percent of the time. Most LLMs are not calibrated. Here is what that costs you.
2. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
3. You wrote your prediction BEFORE running — that keeps you honest
4. Show pairs blindly; ask which is better and why
Which term best describes a foundational idea in "Calibration"?
1. reliability diagram
2. calibration
3. ECE
4. temperature scaling
A learner studying Calibration would need to understand which concept?
1. calibration
2. ECE
3. reliability diagram
4. temperature scaling
Which of these is directly relevant to Calibration?
1. calibration
2. reliability diagram
3. temperature scaling
4. ECE
Which of the following is a key point about Calibration?
1. RLHF training rewards confident responses — hedging sounds weak
2. Text never contains explicit probability labels to learn from
3. The Softmax output is sensitive to fine-tuning choices
4. In-context examples can push confidence up or down arbitrarily
Which of these does NOT belong in a discussion of Calibration?
1. Text never contains explicit probability labels to learn from
2. RLHF training rewards confident responses — hedging sounds weak
3. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
4. The Softmax output is sensitive to fine-tuning choices
Which statement is accurate regarding Calibration?
1. Prompting for probabilities then calibrating offline
2. Ensemble over samples (semantic entropy)
3. Temperature scaling: post-hoc sharpen or flatten the probability curve
4. Few-shot examples that demonstrate appropriate uncertainty
Which of these does NOT belong in a discussion of Calibration?
1. Temperature scaling: post-hoc sharpen or flatten the probability curve
2. Prompting for probabilities then calibrating offline
3. Ensemble over samples (semantic entropy)
4. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
What is the key insight about "Overconfidence is the default" in the context of Calibration?
1. In the absence of deliberate calibration work, modern LLMs are systematically overconfident.
2. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
3. You wrote your prediction BEFORE running — that keeps you honest
4. Show pairs blindly; ask which is better and why
What is the recommended tip about "Ground your practice in fundamentals" in the context of Calibration?
1. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
2. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
3. You wrote your prediction BEFORE running — that keeps you honest
4. Show pairs blindly; ask which is better and why
Which statement accurately describes an aspect of Calibration?
1. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
2. You wrote your prediction BEFORE running — that keeps you honest
3. A calibrated classifier is one whose probability estimates match real frequencies.
4. Show pairs blindly; ask which is better and why
What does working with Calibration typically involve?
1. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
2. You wrote your prediction BEFORE running — that keeps you honest
3. Show pairs blindly; ask which is better and why
4. To check calibration, bin predictions by confidence (0-10 percent, 10-20 percent, ...).
Which of the following is true about Calibration?
1. ECE sums the gap between confidence and accuracy across bins, weighted by bin size. Lower is better. A well-calibrated model has ECE below 0.
2. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
3. You wrote your prediction BEFORE running — that keeps you honest
4. Show pairs blindly; ask which is better and why
Which best describes the scope of "Calibration"?
1. It is unrelated to foundations workflows
2. It focuses on A calibrated model's 70 percent means it is right 70 percent of the time. Most LLMs are not calibrat
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Calibration?
1. Overconfidence: verbose reasoning does not prevent wrong answers — it can make t…
2. You wrote your prediction BEFORE running — that keeps you honest
3. The reliability diagram
4. Show pairs blindly; ask which is better and why

← Back to interactive lesson