neural-forge.io

Sign inStartOpen studio

Tendril

AI Foundations0%

Lesson 253 of 1596

Simpson's Paradox: When Aggregated Data Lies

A trend that appears in every subgroup can reverse when you combine the groups. This is Simpson's Paradox, and it hides in plain sight.

Creators · AI Foundations · ~18 min read

A Famous Medical Case

Imagine a study of two treatments for kidney stones. Treatment A beats Treatment B for small stones. Treatment A also beats Treatment B for large stones. But when you combine all patients, Treatment B looks better overall. This actually happened in real medical data. It is Simpson's Paradox.

A toy example

Compare the options

Subgroup	Treatment A	Treatment B
Small stones	93% cured (81/87)	87% cured (234/270)
Large stones	73% cured (192/263)	69% cured (55/80)
Overall	78% cured (273/350)	83% cured (289/350)

Where Simpson's Paradox appears

Berkeley admissions 1973: overall lower acceptance rate for women, but women had higher rates in almost every department (women applied to more competitive departments)
COVID-19 case fatality: overall rates can flip when you stratify by age
A/B test results where a minority group reverses the majority trend
School rankings: combined scores can mislead when student populations differ

The confounder concept

Simpson's Paradox happens when there is a confounding variable, an unmeasured factor that affects both the input and the outcome. In the kidney stone case, stone size is the confounder. It affects both treatment choice (doctors pick A for harder cases) and cure rate (large stones are harder to treat).

Always check disaggregated rates

python

import pandas as pd df = pd.read_csv('treatment_data.csv') # Overall rates (deceptive) print(df.groupby('treatment')['cured'].mean()) # Disaggregated by severity (honest) print(df.groupby(['severity', 'treatment'])['cured'].mean()) # Same analysis as a pivot table print(pd.pivot_table(df, index='treatment', columns='severity', values='cured', aggfunc='mean', margins=True))

Key terms in this lesson

The big idea: the total is not always the truth. Always slice your data by relevant subgroups before drawing conclusions. Aggregation can reverse reality.

End-of-lesson quiz

Check what stuck

8 questions · Score saves to your progress.

Tutor

Curious about “Simpson's Paradox: When Aggregated Data Lies”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going