Loading lesson…
A trend that appears in every subgroup can reverse when you combine the groups. This is Simpson's Paradox, and it hides in plain sight.
Imagine a study of two treatments for kidney stones. Treatment A beats Treatment B for small stones. Treatment A also beats Treatment B for large stones. But when you combine all patients, Treatment B looks better overall. This actually happened in real medical data. It is Simpson's Paradox.
| Subgroup | Treatment A | Treatment B |
|---|---|---|
| Small stones | 93% cured (81/87) | 87% cured (234/270) |
| Large stones | 73% cured (192/263) | 69% cured (55/80) |
| Overall | 78% cured (273/350) | 83% cured (289/350) |
Simpson's Paradox happens when there is a confounding variable, an unmeasured factor that affects both the input and the outcome. In the kidney stone case, stone size is the confounder. It affects both treatment choice (doctors pick A for harder cases) and cure rate (large stones are harder to treat).
import pandas as pd df = pd.read_csv('treatment_data.csv') # Overall rates (deceptive) print(df.groupby('treatment')['cured'].mean()) # Disaggregated by severity (honest) print(df.groupby(['severity', 'treatment'])['cured'].mean()) # Same analysis as a pivot table print(pd.pivot_table(df, index='treatment', columns='severity', values='cured', aggfunc='mean', margins=True))Always check disaggregated ratesThe big idea: the total is not always the truth. Always slice your data by relevant subgroups before drawing conclusions. Aggregation can reverse reality.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-simpsons-paradox
What is the main idea of "Simpson's Paradox: When Aggregated Data Lies"?
Which concept is most central to "Simpson's Paradox: When Aggregated Data Lies"?
Which use of AI fits this topic best?
What should a careful learner remember about "How can this happen?"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about Simpson's paradox be treated?
Name one way to verify an AI answer about Simpson's paradox.
Which action would help you apply "Simpson's Paradox: When Aggregated Data Lies" responsibly?