Tendril

Lesson 197 of 1455

BLEU, ROUGE, F1 — Automatic Metrics and Their Limits

Before LLMs-as-judges, researchers had hand-made metrics. They still matter — and still mislead.

Builders · AI Foundations · ~17 min read

The Old Guard

Before LLM-as-judge, NLP researchers invented clever string-matching metrics to approximate 'correctness.' They still live in many papers and pipelines. Knowing them — and their weaknesses — is part of AI literacy.

Compare the options

Metric	Used for	What it measures
BLEU	Machine translation	N-gram overlap with references (precision)
ROUGE	Summarization	N-gram overlap (recall-oriented)
F1	Classification, QA	Harmonic mean of precision and recall
Exact match	Short-answer QA	Did the answer string match?
BERTScore	Any text	Semantic similarity via embeddings

Why they still exist

Cheap — compute in milliseconds
Reproducible — same inputs give same outputs
No external model dependency
Familiar across decades of literature

Why they mislead

A great paraphrase may share zero n-grams with the reference and score zero
A copy-paste with one word changed scores nearly perfect
No sense of factuality — a fluent lie scores well
Insensitive to ordering and coherence at the document level

When to still use them

1Tight feedback loops where LLM-as-judge would be too slow
2Regression tests where you want bit-stable scores
3Research comparability with older papers
4As one signal in a larger battery, never as the only one

“BLEU correlates reasonably with human judgment at the system level, but barely at the sentence level.”
Papineni et al., BLEU paper (2002)

Key terms in this lesson

The big idea: automatic metrics are fast, cheap, and blunt. Use them as seismographs, not scales.

End-of-lesson quiz

Check what stuck

8 questions · Score saves to your progress.

Lesson help

Questions are best handled with a grown-up here.

For this age range, Tendril keeps freeform AI chat paused until parent/guardian consent and child-safe moderation are fully verified. Use the quiz, notes, and related lessons below, or ask a parent, guardian, teacher, or librarian to work through the question with you.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

BLEU, ROUGE, F1 — Automatic Metrics and Their Limits

The Old Guard

Why they still exist

Why they mislead

When to still use them

Questions are best handled with a grown-up here.

Keep going

BLEU, ROUGE, F1 — Automatic Metrics and Their Limits

The Old Guard

Why they still exist

Why they mislead

When to still use them

Questions are best handled with a grown-up here.

Keep going