Tendril

Lesson 213 of 1570

Data Cleaning: The Unglamorous 80 Percent

Surveys consistently find data scientists spend 60 to 80 percent of their time cleaning data. Here is what that actually looks like.

BuildersAI Foundations~18 min readIntermediateResearcherBI3 · LearningBI2 · Representation & ReasoningPrint / PDF

Lesson map

What this lesson covers

30 min13 blocks3 concepts

Learning path

The main moves in order

1The Glamorous Picture vs. Reality
2data cleaning
3preprocessing
4ETL

Concept cluster

Terms to connect while reading

data cleaningpreprocessingETL

Sections4

Lists2

Notes3

Code1

Terms1

Section 1

The Glamorous Picture vs. Reality

Social media shows data scientists training models and making slick dashboards. Reality shows them wrestling with encoding errors, removing duplicate rows, fixing dates that are stored five different ways, and hunting down why column Z has three million NaNs.

What cleaning actually involves

1Fixing encoding (UTF-8 vs Latin-1 vs Windows-1252)
2Standardizing date formats (2026-04-23 vs 4/23/26 vs April 23rd)
3Normalizing text (lowercase, strip whitespace, unicode normalization)
4Removing duplicate rows (exact and near-duplicates)
5Handling missing values intentionally
6Detecting and removing outliers
7Joining data from multiple tables
8Reconciling different vocabularies (USA vs. US vs. United States)

Check-in 1. Got it so far?

A concrete pandas example

A small but real cleaning pipeline

python

import pandas as pd

df = pd.read_csv('raw.csv', encoding='utf-8')

# Standardize text columns
df['country'] = df['country'].str.strip().str.lower()
df['country'] = df['country'].replace({
    'usa': 'united states',
    'us': 'united states',
    'u.s.a.': 'united states'
})

# Parse dates
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')

# Remove duplicate rows
before = len(df)
df = df.drop_duplicates()
print(f'Dropped {before - len(df)} duplicates')

# Save cleaned version
df.to_parquet('clean.parquet')

For LLM training specifically

Strip HTML and boilerplate (nav bars, cookie banners, ads)
Remove adult content and hate speech
Filter out pages that are mostly machine-generated spam
Remove near-duplicate pages (MinHash and LSH are standard)
Balance languages, domains, and topics

Check-in 2. Got it so far?

Key terms in this lesson

The big idea: the unglamorous work is where the quality lives. The best models are not just the biggest, they are the ones whose data was cleaned most carefully.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Data Cleaning: The Unglamorous 80 Percent”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Data Cleaning: The Unglamorous 80 Percent

The Glamorous Picture vs. Reality

What cleaning actually involves

A concrete pandas example

For LLM training specifically

Curious about “Data Cleaning: The Unglamorous 80 Percent”?

Keep going

Data Cleaning: The Unglamorous 80 Percent

The Glamorous Picture vs. Reality

What cleaning actually involves

A concrete pandas example

For LLM training specifically

Curious about “Data Cleaning: The Unglamorous 80 Percent”?

Keep going