Lesson 197 of 1570
BLEU, ROUGE, F1 — Automatic Metrics and Their Limits
Before LLMs-as-judges, researchers had hand-made metrics. They still matter — and still mislead.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Old Guard
- 2BLEU
- 3ROUGE
- 4F1
Concept cluster
Terms to connect while reading
Section 1
The Old Guard
Before LLM-as-judge, NLP researchers invented clever string-matching metrics to approximate 'correctness.' They still live in many papers and pipelines. Knowing them — and their weaknesses — is part of AI literacy.
Compare the options
| Metric | Used for | What it measures |
|---|---|---|
| BLEU | Machine translation | N-gram overlap with references (precision) |
| ROUGE | Summarization | N-gram overlap (recall-oriented) |
| F1 | Classification, QA | Harmonic mean of precision and recall |
| Exact match | Short-answer QA | Did the answer string match? |
| BERTScore | Any text | Semantic similarity via embeddings |
Why they still exist
- Cheap — compute in milliseconds
- Reproducible — same inputs give same outputs
- No external model dependency
- Familiar across decades of literature
Why they mislead
- A great paraphrase may share zero n-grams with the reference and score zero
- A copy-paste with one word changed scores nearly perfect
- No sense of factuality — a fluent lie scores well
- Insensitive to ordering and coherence at the document level
When to still use them
- 1Tight feedback loops where LLM-as-judge would be too slow
- 2Regression tests where you want bit-stable scores
- 3Research comparability with older papers
- 4As one signal in a larger battery, never as the only one
“BLEU correlates reasonably with human judgment at the system level, but barely at the sentence level.”
Key terms in this lesson
The big idea: automatic metrics are fast, cheap, and blunt. Use them as seismographs, not scales.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “BLEU, ROUGE, F1 — Automatic Metrics and Their Limits”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 30 min
Is the Model Reasoning or Pattern Matching?
The line between deep reasoning and clever pattern recognition is blurry. Here's how researchers try to tell them apart.
Builders · 28 min
Bayesian Reasoning for Everyday Life
Bayes' rule is just 'update your belief with evidence.' It is shockingly useful.
Builders · 22 min
What a Spreadsheet Actually Is
Excel and Google Sheets hide a lot of complexity behind a pretty grid. Once you see what is really happening, you will never look at a spreadsheet the same way.
