Lesson 210 of 1570
Missing Data and How to Spot It
Real datasets have holes. Blank cells, NaN, NULL, -999, and the dreaded empty string. Learning to see them is a core skill.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Data Has Holes
- 2missing data
- 3NaN
- 4imputation
Concept cluster
Terms to connect while reading
Section 1
Data Has Holes
In a perfect world, every row would have every column filled in. In reality, datasets are full of gaps. A survey respondent skipped a question. A sensor cut out for three seconds. A database migration dropped a field. All of this creates missing data.
Three flavors of missingness
- MCAR — Missing Completely At Random: a sensor glitched. The gap has nothing to do with the value.
- MAR — Missing At Random: men are less likely to answer a health survey question. The missingness depends on another column (gender) but not on the answer itself.
- MNAR — Missing Not At Random: people with very high incomes refuse to report their income. The value itself causes the missingness. This is the dangerous one.
Common ways to handle missing data
- 1Drop rows with missing values (simple but throws away data)
- 2Fill with the mean or median of the column (imputation)
- 3Fill with a predicted value from other columns
- 4Flag missingness as its own feature (is_missing = true)
- 5Leave it for the model to handle (some models tolerate NaN)
Detecting and handling missing data in pandas
import pandas as pd
import numpy as np
df = pd.read_csv('survey.csv', na_values=['-999', 'N/A', 'unknown'])
# How much is missing in each column?
print(df.isna().sum())
print(df.isna().mean()) # as a fraction
# Fill age with the median
df['age'] = df['age'].fillna(df['age'].median())
# Flag missingness before filling
df['income_was_missing'] = df['income'].isna()
df['income'] = df['income'].fillna(df['income'].median())Key terms in this lesson
The big idea: missing data is not just absence, it is often information. Treat every gap as a question, not an error.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Missing Data and How to Spot It”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 30 min
Where Training Data Actually Comes From
You cannot understand modern AI without understanding its diet. Let's map where the data comes from, how it gets cleaned, and what that means.
Builders · 28 min
Statistical Significance and P-Values
P-value is one of the most abused numbers in research. Here is what it actually says — and what it does not. 'Model B is no better than model A.' 'The new prompt does not change user satisfaction.' A low p-value means the boring story would rarely produce data that looks like what you saw.
Builders · 25 min
Correlation vs. Causation
The most famous warning in statistics is also the most ignored. Here is how to actually tell them apart.
