Loading lesson…
A trend that appears in every subgroup can reverse when you combine the groups. This is Simpson's Paradox, and it hides in plain sight.
Imagine a study of two treatments for kidney stones. Treatment A beats Treatment B for small stones. Treatment A also beats Treatment B for large stones. But when you combine all patients, Treatment B looks better overall. This actually happened in real medical data. It is Simpson's Paradox.
| Subgroup | Treatment A | Treatment B |
|---|---|---|
| Small stones | 93% cured (81/87) | 87% cured (234/270) |
| Large stones | 73% cured (192/263) | 69% cured (55/80) |
| Overall | 78% cured (273/350) | 83% cured (289/350) |
Simpson's Paradox happens when there is a confounding variable, an unmeasured factor that affects both the input and the outcome. In the kidney stone case, stone size is the confounder. It affects both treatment choice (doctors pick A for harder cases) and cure rate (large stones are harder to treat).
import pandas as pd
df = pd.read_csv('treatment_data.csv')
# Overall rates (deceptive)
print(df.groupby('treatment')['cured'].mean())
# Disaggregated by severity (honest)
print(df.groupby(['severity', 'treatment'])['cured'].mean())
# Same analysis as a pivot table
print(pd.pivot_table(df,
index='treatment',
columns='severity',
values='cured',
aggfunc='mean',
margins=True))Always check disaggregated ratesThe big idea: the total is not always the truth. Always slice your data by relevant subgroups before drawing conclusions. Aggregation can reverse reality.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-simpsons-paradox
What is the core idea behind "Simpson's Paradox: When Aggregated Data Lies"?
Which term best describes a foundational idea in "Simpson's Paradox: When Aggregated Data Lies"?
A learner studying Simpson's Paradox: When Aggregated Data Lies would need to understand which concept?
Which of these is directly relevant to Simpson's Paradox: When Aggregated Data Lies?
Which of the following is a key point about Simpson's Paradox: When Aggregated Data Lies?
Which of these does NOT belong in a discussion of Simpson's Paradox: When Aggregated Data Lies?
What is the key insight about "How can this happen?" in the context of Simpson's Paradox: When Aggregated Data Lies?
What is the key insight about "The cure: stratify" in the context of Simpson's Paradox: When Aggregated Data Lies?
Which statement accurately describes an aspect of Simpson's Paradox: When Aggregated Data Lies?
What does working with Simpson's Paradox: When Aggregated Data Lies typically involve?
Which of the following is true about Simpson's Paradox: When Aggregated Data Lies?
Which best describes the scope of "Simpson's Paradox: When Aggregated Data Lies"?
Which section heading best belongs in a lesson about Simpson's Paradox: When Aggregated Data Lies?
Which section heading best belongs in a lesson about Simpson's Paradox: When Aggregated Data Lies?
Which section heading best belongs in a lesson about Simpson's Paradox: When Aggregated Data Lies?