Lesson 213 of 1570
Data Cleaning: The Unglamorous 80 Percent
Surveys consistently find data scientists spend 60 to 80 percent of their time cleaning data. Here is what that actually looks like.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Glamorous Picture vs. Reality
- 2data cleaning
- 3preprocessing
- 4ETL
Concept cluster
Terms to connect while reading
Section 1
The Glamorous Picture vs. Reality
Social media shows data scientists training models and making slick dashboards. Reality shows them wrestling with encoding errors, removing duplicate rows, fixing dates that are stored five different ways, and hunting down why column Z has three million NaNs.
What cleaning actually involves
- 1Fixing encoding (UTF-8 vs Latin-1 vs Windows-1252)
- 2Standardizing date formats (2026-04-23 vs 4/23/26 vs April 23rd)
- 3Normalizing text (lowercase, strip whitespace, unicode normalization)
- 4Removing duplicate rows (exact and near-duplicates)
- 5Handling missing values intentionally
- 6Detecting and removing outliers
- 7Joining data from multiple tables
- 8Reconciling different vocabularies (USA vs. US vs. United States)
A concrete pandas example
A small but real cleaning pipeline
import pandas as pd
df = pd.read_csv('raw.csv', encoding='utf-8')
# Standardize text columns
df['country'] = df['country'].str.strip().str.lower()
df['country'] = df['country'].replace({
'usa': 'united states',
'us': 'united states',
'u.s.a.': 'united states'
})
# Parse dates
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
# Remove duplicate rows
before = len(df)
df = df.drop_duplicates()
print(f'Dropped {before - len(df)} duplicates')
# Save cleaned version
df.to_parquet('clean.parquet')For LLM training specifically
- Strip HTML and boilerplate (nav bars, cookie banners, ads)
- Remove adult content and hate speech
- Filter out pages that are mostly machine-generated spam
- Remove near-duplicate pages (MinHash and LSH are standard)
- Balance languages, domains, and topics
Key terms in this lesson
The big idea: the unglamorous work is where the quality lives. The best models are not just the biggest, they are the ones whose data was cleaned most carefully.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Data Cleaning: The Unglamorous 80 Percent”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 30 min
Is the Model Reasoning or Pattern Matching?
The line between deep reasoning and clever pattern recognition is blurry. Here's how researchers try to tell them apart.
Builders · 28 min
BLEU, ROUGE, F1 — Automatic Metrics and Their Limits
Before LLMs-as-judges, researchers had hand-made metrics. They still matter — and still mislead.
Builders · 28 min
Bayesian Reasoning for Everyday Life
Bayes' rule is just 'update your belief with evidence.' It is shockingly useful.
