Lesson 213 of 1455
Data Cleaning: The Unglamorous 80 Percent
Surveys consistently find data scientists spend 60 to 80 percent of their time cleaning data. Here is what that actually looks like.
Builders · AI Foundations · ~18 min read
The Glamorous Picture vs. Reality
Social media shows data scientists training models and making slick dashboards. Reality shows them wrestling with encoding errors, removing duplicate rows, fixing dates that are stored five different ways, and hunting down why column Z has three million NaNs.
What cleaning actually involves
- 1Fixing encoding (UTF-8 vs Latin-1 vs Windows-1252)
- 2Standardizing date formats (2026-04-23 vs 4/23/26 vs April 23rd)
- 3Normalizing text (lowercase, strip whitespace, unicode normalization)
- 4Removing duplicate rows (exact and near-duplicates)
- 5Handling missing values intentionally
- 6Detecting and removing outliers
- 7Joining data from multiple tables
- 8Reconciling different vocabularies (USA vs. US vs. United States)
A concrete pandas example
A small but real cleaning pipeline
import pandas as pd df = pd.read_csv('raw.csv', encoding='utf-8') # Standardize text columns df['country'] = df['country'].str.strip().str.lower() df['country'] = df['country'].replace({ 'usa': 'united states', 'us': 'united states', 'u.s.a.': 'united states' }) # Parse dates df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce') # Remove duplicate rows before = len(df) df = df.drop_duplicates() print(f'Dropped {before - len(df)} duplicates') # Save cleaned version df.to_parquet('clean.parquet')For LLM training specifically
- Strip HTML and boilerplate (nav bars, cookie banners, ads)
- Remove adult content and hate speech
- Filter out pages that are mostly machine-generated spam
- Remove near-duplicate pages (MinHash and LSH are standard)
- Balance languages, domains, and topics
Key terms in this lesson
The big idea: the unglamorous work is where the quality lives. The best models are not just the biggest, they are the ones whose data was cleaned most carefully.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Lesson help
Questions are best handled with a grown-up here.
For this age range, Tendril keeps freeform AI chat paused until parent/guardian consent and child-safe moderation are fully verified. Use the quiz, notes, and related lessons below, or ask a parent, guardian, teacher, or librarian to work through the question with you.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 30 min
Is the Model Reasoning or Pattern Matching?
The line between deep reasoning and clever pattern recognition is blurry. Here's how researchers try to tell them apart.
Builders · 28 min
BLEU, ROUGE, F1 — Automatic Metrics and Their Limits
Before LLMs-as-judges, researchers had hand-made metrics. They still matter — and still mislead.
Builders · 28 min
Bayesian Reasoning for Everyday Life
Bayes' rule is just 'update your belief with evidence.' It is shockingly useful.
