Loading lesson…
Before LLMs-as-judges, researchers had hand-made metrics. They still matter — and still mislead.
Before LLM-as-judge, NLP researchers invented clever string-matching metrics to approximate 'correctness.' They still live in many papers and pipelines. Knowing them — and their weaknesses — is part of AI literacy.
| Metric | Used for | What it measures |
|---|---|---|
| BLEU | Machine translation | N-gram overlap with references (precision) |
| ROUGE | Summarization | N-gram overlap (recall-oriented) |
| F1 | Classification, QA | Harmonic mean of precision and recall |
| Exact match | Short-answer QA | Did the answer string match? |
| BERTScore | Any text | Semantic similarity via embeddings |
BLEU correlates reasonably with human judgment at the system level, but barely at the sentence level.
— Papineni et al., BLEU paper (2002)
The big idea: automatic metrics are fast, cheap, and blunt. Use them as seismographs, not scales.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-automatic-metrics
What is the core idea behind "BLEU, ROUGE, F1 — Automatic Metrics and Their Limits"?
Which term best describes a foundational idea in "BLEU, ROUGE, F1 — Automatic Metrics and Their Limits"?
A learner studying BLEU, ROUGE, F1 — Automatic Metrics and Their Limits would need to understand which concept?
Which of these is directly relevant to BLEU, ROUGE, F1 — Automatic Metrics and Their Limits?
Which of the following is a key point about BLEU, ROUGE, F1 — Automatic Metrics and Their Limits?
Which of these does NOT belong in a discussion of BLEU, ROUGE, F1 — Automatic Metrics and Their Limits?
Which statement is accurate regarding BLEU, ROUGE, F1 — Automatic Metrics and Their Limits?
Which of these does NOT belong in a discussion of BLEU, ROUGE, F1 — Automatic Metrics and Their Limits?
What is the key insight about "BLEU is a rough guide, not a verdict" in the context of BLEU, ROUGE, F1 — Automatic Metrics and Their Limits?
What is the recommended tip about "Build your mental model" in the context of BLEU, ROUGE, F1 — Automatic Metrics and Their Limits?
Which statement accurately describes an aspect of BLEU, ROUGE, F1 — Automatic Metrics and Their Limits?
What does working with BLEU, ROUGE, F1 — Automatic Metrics and Their Limits typically involve?
Which best describes the scope of "BLEU, ROUGE, F1 — Automatic Metrics and Their Limits"?
Which section heading best belongs in a lesson about BLEU, ROUGE, F1 — Automatic Metrics and Their Limits?
Which section heading best belongs in a lesson about BLEU, ROUGE, F1 — Automatic Metrics and Their Limits?