Data Cleaning: The Unglamorous 80 Percent

Surveys consistently find data scientists spend 60 to 80 percent of their time cleaning data. Here is what that actually looks like.

30 min · Reviewed 2026

The Glamorous Picture vs. Reality

Social media shows data scientists training models and making slick dashboards. Reality shows them wrestling with encoding errors, removing duplicate rows, fixing dates that are stored five different ways, and hunting down why column Z has three million NaNs.

What cleaning actually involves

Fixing encoding (UTF-8 vs Latin-1 vs Windows-1252)
Standardizing date formats (2026-04-23 vs 4/23/26 vs April 23rd)
Normalizing text (lowercase, strip whitespace, unicode normalization)
Removing duplicate rows (exact and near-duplicates)
Handling missing values intentionally
Detecting and removing outliers
Joining data from multiple tables
Reconciling different vocabularies (USA vs. US vs. United States)

A concrete pandas example

import pandas as pd df = pd.read_csv('raw.csv', encoding='utf-8') # Standardize text columns df['country'] = df['country'].str.strip().str.lower() df['country'] = df['country'].replace({ 'usa': 'united states', 'us': 'united states', 'u.s.a.': 'united states' }) # Parse dates df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce') # Remove duplicate rows before = len(df) df = df.drop_duplicates() print(f'Dropped {before - len(df)} duplicates') # Save cleaned version df.to_parquet('clean.parquet')A small but real cleaning pipeline

For LLM training specifically

Strip HTML and boilerplate (nav bars, cookie banners, ads)
Remove adult content and hate speech
Filter out pages that are mostly machine-generated spam
Remove near-duplicate pages (MinHash and LSH are standard)
Balance languages, domains, and topics

The big idea: the unglamorous work is where the quality lives. The best models are not just the biggest, they are the ones whose data was cleaned most carefully.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-cleaning-80-percent

What is the main idea of "Data Cleaning: The Unglamorous 80 Percent"?
1. Surveys consistently find data scientists spend 60 to 80 percent of their time cleaning data. Here is what that actually looks like.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Data Cleaning: The Unglamorous 80 Percent"?
1. preprocessing
2. data cleaning
3. ETL
4. deduplication
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. Fixing encoding (UTF-8 vs Latin-1 vs Windows-1252)
4. Use the first answer without checking it
What should a careful learner remember about "The Anaconda survey"?
1. Use AI to draft or organize ideas about data cleaning, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. Use the AI answer as a draft, then check it against a reliable source.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about data cleaning be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about data cleaning.
Which action would help you apply "Data Cleaning: The Unglamorous 80 Percent" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Use the first answer without checking it
4. Standardizing date formats (2026-04-23 vs 4/23/26 vs April 23rd)

← Back to interactive lesson

Tendril · Builders · AI Foundations

Data Cleaning: The Unglamorous 80 Percent

Surveys consistently find data scientists spend 60 to 80 percent of their time cleaning data. Here is what that actually looks like.

30 min · Reviewed 2026

The Glamorous Picture vs. Reality

What cleaning actually involves

Fixing encoding (UTF-8 vs Latin-1 vs Windows-1252)
Standardizing date formats (2026-04-23 vs 4/23/26 vs April 23rd)
Normalizing text (lowercase, strip whitespace, unicode normalization)
Removing duplicate rows (exact and near-duplicates)
Handling missing values intentionally
Detecting and removing outliers
Joining data from multiple tables
Reconciling different vocabularies (USA vs. US vs. United States)

A concrete pandas example

import pandas as pd df = pd.read_csv('raw.csv', encoding='utf-8') # Standardize text columns df['country'] = df['country'].str.strip().str.lower() df['country'] = df['country'].replace({ 'usa': 'united states', 'us': 'united states', 'u.s.a.': 'united states' }) # Parse dates df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce') # Remove duplicate rows before = len(df) df = df.drop_duplicates() print(f'Dropped {before - len(df)} duplicates') # Save cleaned version df.to_parquet('clean.parquet')A small but real cleaning pipeline

For LLM training specifically

Strip HTML and boilerplate (nav bars, cookie banners, ads)
Remove adult content and hate speech
Filter out pages that are mostly machine-generated spam
Remove near-duplicate pages (MinHash and LSH are standard)
Balance languages, domains, and topics

The big idea: the unglamorous work is where the quality lives. The best models are not just the biggest, they are the ones whose data was cleaned most carefully.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-cleaning-80-percent

What is the main idea of "Data Cleaning: The Unglamorous 80 Percent"?
1. Surveys consistently find data scientists spend 60 to 80 percent of their time cleaning data. Here is what that actually looks like.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Data Cleaning: The Unglamorous 80 Percent"?
1. preprocessing
2. data cleaning
3. ETL
4. deduplication
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. Fixing encoding (UTF-8 vs Latin-1 vs Windows-1252)
4. Use the first answer without checking it
What should a careful learner remember about "The Anaconda survey"?
1. Use AI to draft or organize ideas about data cleaning, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. Use the AI answer as a draft, then check it against a reliable source.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about data cleaning be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about data cleaning.
Which action would help you apply "Data Cleaning: The Unglamorous 80 Percent" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Use the first answer without checking it
4. Standardizing date formats (2026-04-23 vs 4/23/26 vs April 23rd)

← Back to interactive lesson