Loading lesson…
A calibrated model's 70 percent means it is right 70 percent of the time. Most LLMs are not calibrated. Here is what that costs you.
A calibrated classifier is one whose probability estimates match real frequencies. If it says 70 percent across a batch of predictions, 70 percent of those should be correct. That is an uncommon property in modern LLMs.
To check calibration, bin predictions by confidence (0-10 percent, 10-20 percent, ). For each bin, plot average confidence vs empirical accuracy. A perfectly calibrated model hugs the diagonal. Overconfidence bends the line below; underconfidence bends it above.
Perfect calibration: Accuracy 1.0 | * 0.8 | * 0.6 | * 0.4 | * 0.0 *--------- Confidence 0.0 0.6 1.0 Real models often look like: Accuracy 1.0 | * 0.8 | * * (overconfident) 0.6 | * 0.4 | * 0.0 *--------- Confidence too high relative to truth.Reliability diagram: where the model thinks it is vs where it actually isECE sums the gap between confidence and accuracy across bins, weighted by bin size. Lower is better. A well-calibrated model has ECE below 0.05. A raw out-of-the-box LLM often sits at 0.15 or higher.
Modern neural networks are not calibrated — a phenomenon that has worsened as accuracy improved.
— Guo et al., On Calibration of Modern Neural Networks (2017)
The big idea: a confident answer is not a correct answer. Calibration is the bridge between the two.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-calibration
What is the main idea of "Calibration"?
Which concept is most central to "Calibration"?
Which use of AI fits this topic best?
What should a careful learner remember about "Overconfidence is the default"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about calibration be treated?
Name one way to verify an AI answer about calibration.
Which action would help you apply "Calibration" responsibly?