Loading lesson…
Before LLMs-as-judges, researchers had hand-made metrics. They still matter — and still mislead.
Before LLM-as-judge, NLP researchers invented clever string-matching metrics to approximate 'correctness.' They still live in many papers and pipelines. Knowing them — and their weaknesses — is part of AI literacy.
| Metric | Used for | What it measures |
|---|---|---|
| BLEU | Machine translation | N-gram overlap with references (precision) |
| ROUGE | Summarization | N-gram overlap (recall-oriented) |
| F1 | Classification, QA | Harmonic mean of precision and recall |
| Exact match | Short-answer QA | Did the answer string match? |
| BERTScore | Any text | Semantic similarity via embeddings |
BLEU correlates reasonably with human judgment at the system level, but barely at the sentence level.
— Papineni et al., BLEU paper (2002)
The big idea: automatic metrics are fast, cheap, and blunt. Use them as seismographs, not scales.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-automatic-metrics
What is the main idea of "BLEU, ROUGE, F1 — Automatic Metrics and Their Limits"?
Which concept is most central to "BLEU, ROUGE, F1 — Automatic Metrics and Their Limits"?
Which use of AI fits this topic best?
What should a careful learner remember about "BLEU is a rough guide, not a verdict"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about BLEU be treated?
Name one way to verify an AI answer about BLEU.
Which action would help you apply "BLEU, ROUGE, F1 — Automatic Metrics and Their Limits" responsibly?