Lesson 263 of 1596
Your First Dataset Project, End to End
A complete walkthrough from question to shareable dataset. The first project is the hardest; this lesson gets you to the other side.
Creators · AI Foundations · ~27 min read
The Project Arc
Every dataset project moves through the same phases: ask a question, find or collect data, explore and clean, analyze, communicate, and share. Miss any phase and the project feels broken. Let's walk through a concrete example end to end.
The example: Spotify top songs
Imagine you want to answer: do songs with higher danceability get more streams? The Kaggle dataset Top Spotify Songs 2023 has exactly what you need. We will take it from download to a polished notebook.
Phase 1: scope your question
- Question: does danceability correlate with streams?
- Success criterion: a single statistic (correlation coefficient) plus a plot
- Time budget: 3 hours total
- Out of scope: causation (which requires experiments)
Phase 2: get the data
Load and inspect the Kaggle dataset
# Download from Kaggle (requires kaggle CLI setup) # kaggle datasets download -d nelgiriyewithana/top-spotify-songs-2023 import pandas as pd df = pd.read_csv('spotify-2023.csv', encoding='latin-1') print(df.shape) print(df.columns.tolist()) print(df.head())Phase 3: explore and clean
Type-fix the streams column
# Check types and missing values print(df.dtypes) print(df.isna().sum()) # Streams came in as string with commas, fix it df['streams'] = pd.to_numeric(df['streams'], errors='coerce') df = df.dropna(subset=['streams', 'danceability_%']) # Quick summary print(df[['streams', 'danceability_%']].describe())Phase 4: analyze
Compute correlation and plot
import matplotlib.pyplot as plt from scipy.stats import pearsonr, spearmanr # Pearson assumes linear; Spearman is rank-based pearson_r, p = pearsonr(df['danceability_%'], df['streams']) spearman_r, sp = spearmanr(df['danceability_%'], df['streams']) print(f'Pearson r: {pearson_r:.3f} (p={p:.3f})') print(f'Spearman rho: {spearman_r:.3f} (p={sp:.3f})') # Log-scale the streams for a readable plot plt.scatter(df['danceability_%'], df['streams'], alpha=0.3) plt.yscale('log') plt.xlabel('Danceability (%)') plt.ylabel('Streams (log scale)') plt.title('Danceability vs. Streams') plt.show()Phase 5: communicate
Phase 6: share
- 1Push notebook to GitHub or Kaggle
- 2Write a one-paragraph README with the question, method, finding
- 3Include a data card noting the dataset source and license
- 4Tag your notebook with readable headings so others can follow
Key terms in this lesson
The big idea: a dataset project is a small disciplined arc. Following the six phases turns chaos into a portfolio piece. Do one, then another, then ten. That is how data intuition develops.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Your First Dataset Project, End to End”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
Creators · 30 min
Mean, Median, Mode: Three Kinds of Average
Saying the average is 50,000 dollars can mean three different things. Picking the wrong kind of average is how statistics starts lying to you.
Creators · 45 min
Pandas Fundamentals in 40 Minutes
Pandas is the Python library that made data science what it is today. Ten verbs get you through 90 percent of day-to-day data work.
