Lesson 297 of 2116
Outliers: Keep Them, Remove Them, or Investigate?
A single weird value can distort your entire analysis. But outliers are also where the most interesting stories live. Knowing when to remove them is an art.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Value That Does Not Fit
- 2outliers
- 3anomaly detection
- 4robust statistics
Concept cluster
Terms to connect while reading
Section 1
The Value That Does Not Fit
You are analyzing student test scores. 98 percent of scores are between 60 and 100. One score is 3,400,000. Obviously a data-entry error. Remove it. But sometimes outliers are the real story: the one billionaire in an income dataset, the one fraudulent transaction, the one patient whose recovery changed medicine.
Three kinds of outliers
Compare the options
| Type | Example | What to do |
|---|---|---|
| Error | Height of 2,340 cm | Remove or fix |
| Extreme but valid | CEO earning $500M | Keep, but note it |
| Anomaly of interest | Fraudulent transaction | That IS the signal |
Detection methods
- IQR rule: flag anything below Q1 - 1.5*IQR or above Q3 + 1.5*IQR
- Z-score: flag anything with |z| > 3
- Isolation Forest: ML-based anomaly detection, works in high dimensions
- Domain knowledge: the best detector, often ignored
IQR-based outlier detection
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
low = Q1 - 1.5 * IQR
high = Q3 + 1.5 * IQR
outliers = df[(df['value'] < low) | (df['value'] > high)]
print(f'Found {len(outliers)} outliers out of {len(df)} rows')
# Inspect before removing
print(outliers.head(20))Decision framework
- 1Plot your data first — outliers often jump out visually
- 2Check whether the outlier is likely an error (units, typos)
- 3If it is an error, either fix or remove it and document why
- 4If it is real, ask whether your analysis goal can tolerate extreme values
- 5Consider using robust statistics (median, MAD) instead of removing
- 6Report sensitivity: how much do results change with/without outliers?
Robust alternatives
Instead of removing outliers, use statistics that are less sensitive to them. The median is more robust than the mean. Median Absolute Deviation (MAD) is more robust than standard deviation. Robust regression methods like Huber loss can accept outliers without being distorted by them.
Key terms in this lesson
The big idea: outliers are questions, not answers. Investigate each one and decide deliberately, rather than scrubbing them reflexively. Sometimes the anomaly is the discovery.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Outliers: Keep Them, Remove Them, or Investigate?”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
Creators · 40 min
Emergence vs. Scaling
Some capabilities grow smoothly with scale. Others seem to appear out of nowhere. Telling them apart is a whole research program. The Big Question Is AI capability a smooth climb or a staircase?
Creators · 45 min
Running Your Own Small Experiment
The best way to truly understand an AI claim is to try it yourself. Here is how to run a small experiment that actually teaches you something.
