Loading lesson…
Surveys consistently find data scientists spend 60 to 80 percent of their time cleaning data. Here is what that actually looks like.
Social media shows data scientists training models and making slick dashboards. Reality shows them wrestling with encoding errors, removing duplicate rows, fixing dates that are stored five different ways, and hunting down why column Z has three million NaNs.
import pandas as pd
df = pd.read_csv('raw.csv', encoding='utf-8')
# Standardize text columns
df['country'] = df['country'].str.strip().str.lower()
df['country'] = df['country'].replace({
'usa': 'united states',
'us': 'united states',
'u.s.a.': 'united states'
})
# Parse dates
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
# Remove duplicate rows
before = len(df)
df = df.drop_duplicates()
print(f'Dropped {before - len(df)} duplicates')
# Save cleaned version
df.to_parquet('clean.parquet')A small but real cleaning pipelineThe big idea: the unglamorous work is where the quality lives. The best models are not just the biggest, they are the ones whose data was cleaned most carefully.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-cleaning-80-percent
What is the core idea behind "Data Cleaning: The Unglamorous 80 Percent"?
Which term best describes a foundational idea in "Data Cleaning: The Unglamorous 80 Percent"?
A learner studying Data Cleaning: The Unglamorous 80 Percent would need to understand which concept?
Which of these is directly relevant to Data Cleaning: The Unglamorous 80 Percent?
Which of the following is a key point about Data Cleaning: The Unglamorous 80 Percent?
Which of these does NOT belong in a discussion of Data Cleaning: The Unglamorous 80 Percent?
Which statement is accurate regarding Data Cleaning: The Unglamorous 80 Percent?
Which of these does NOT belong in a discussion of Data Cleaning: The Unglamorous 80 Percent?
What is the key insight about "The Anaconda survey" in the context of Data Cleaning: The Unglamorous 80 Percent?
What is the recommended tip about "Build your mental model" in the context of Data Cleaning: The Unglamorous 80 Percent?
Which statement accurately describes an aspect of Data Cleaning: The Unglamorous 80 Percent?
What does working with Data Cleaning: The Unglamorous 80 Percent typically involve?
Which best describes the scope of "Data Cleaning: The Unglamorous 80 Percent"?
Which section heading best belongs in a lesson about Data Cleaning: The Unglamorous 80 Percent?
Which section heading best belongs in a lesson about Data Cleaning: The Unglamorous 80 Percent?