Your First Dataset Project, End to End

Load and inspect the Kaggle dataset

python

# Download from Kaggle (requires kaggle CLI setup) # kaggle datasets download -d nelgiriyewithana/top-spotify-songs-2023 import pandas as pd df = pd.read_csv('spotify-2023.csv', encoding='latin-1') print(df.shape) print(df.columns.tolist()) print(df.head())

Type-fix the streams column

python

# Check types and missing values print(df.dtypes) print(df.isna().sum()) # Streams came in as string with commas, fix it df['streams'] = pd.to_numeric(df['streams'], errors='coerce') df = df.dropna(subset=['streams', 'danceability_%']) # Quick summary print(df[['streams', 'danceability_%']].describe())

Compute correlation and plot

python

import matplotlib.pyplot as plt from scipy.stats import pearsonr, spearmanr # Pearson assumes linear; Spearman is rank-based pearson_r, p = pearsonr(df['danceability_%'], df['streams']) spearman_r, sp = spearmanr(df['danceability_%'], df['streams']) print(f'Pearson r: {pearson_r:.3f} (p={p:.3f})') print(f'Spearman rho: {spearman_r:.3f} (p={sp:.3f})') # Log-scale the streams for a readable plot plt.scatter(df['danceability_%'], df['streams'], alpha=0.3) plt.yscale('log') plt.xlabel('Danceability (%)') plt.ylabel('Streams (log scale)') plt.title('Danceability vs. Streams') plt.show()

Key terms in this lesson

Your First Dataset Project, End to End

The Project Arc

The example: Spotify top songs

Phase 1: scope your question

Phase 2: get the data

Phase 3: explore and clean

Phase 4: analyze

Phase 5: communicate

Phase 6: share

Curious about “Your First Dataset Project, End to End”?

Keep going

Your First Dataset Project, End to End

The Project Arc

The example: Spotify top songs

Phase 1: scope your question

Phase 2: get the data

Phase 3: explore and clean

Phase 4: analyze

Phase 5: communicate

Phase 6: share

Curious about “Your First Dataset Project, End to End”?

Keep going