Loading lesson…
Pandas is the Python library that made data science what it is today. Ten verbs get you through 90 percent of day-to-day data work.
Pandas was created in 2008 by Wes McKinney at a hedge fund. Today it is the default Python library for tabular data, downloaded over 100 million times per month. Its two main types are Series (a single column) and DataFrame (a table).
import pandas as pd
# 1. Load
df = pd.read_csv('data.csv')
# 2. Peek
df.head()
df.info()
df.describe()
# 3. Select columns
df['age'] # one column (Series)
df[['age', 'income']] # multiple columns (DataFrame)
# 4. Filter rows
df[df['age'] > 18]
df[(df['age'] > 18) & (df['country'] == 'US')]
# 5. Sort
df.sort_values('income', ascending=False)
# 6. Create columns
df['income_per_age'] = df['income'] / df['age']
# 7. Group and aggregate
df.groupby('country')['income'].mean()
df.groupby(['country', 'gender']).agg({
'income': ['mean', 'median'],
'age': 'mean'
})
# 8. Join tables
merged = pd.merge(df, other_df, on='user_id', how='left')
# 9. Pivot
pd.pivot_table(df, index='country', columns='year', values='income')
# 10. Save
df.to_csv('clean.csv', index=False)
df.to_parquet('clean.parquet')The ten most important pandas operations# .loc uses labels
df.loc[5] # row with index label 5
df.loc[df['age'] > 18, 'name'] # name column, filtered rows
# .iloc uses positions
df.iloc[5] # 6th row regardless of index label
df.iloc[:10, :3] # first 10 rows, first 3 cols
# Chained assignment is a trap
# df[df.age > 18]['score'] = 100 # DO NOT DO THIS
df.loc[df.age > 18, 'score'] = 100 # CORRECTCorrect indexing patterns# Top N per group
top3 = df.groupby('country').apply(
lambda g: g.nlargest(3, 'income')
).reset_index(drop=True)
# Rolling stats
df['7d_avg'] = df['sales'].rolling(window=7).mean()
# Replace based on mapping
df['country'] = df['country'].replace({'USA': 'US', 'U.S.A.': 'US'})
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['color'])
# Handle dates
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.day_name()Patterns you will use every weekThe big idea: pandas rewards the ten verbs you use 90 percent of the time. Master those before chasing fancier features, and the other 10 percent will come naturally when you need it.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-pandas-fundamentals
What pandas object type represents a single column of data?
Who created the pandas library and in what year?
What does the SettingWithCopy warning typically indicate?
How can you avoid the SettingWithCopy warning when modifying DataFrame data?
In what programming language is the Polars library written?
Compared to pandas, Polars typically runs how much faster on large datasets (over 1 GB)?
What concept does Polars make more explicit than pandas through its API design?
Which pandas method is used to group data by one or more columns for aggregation?
According to the teaching approach in this material, what percentage of data work can be accomplished with ten core verbs?
What is the primary pandas data structure for representing tabular data with rows and columns?
What aspect of pandas is described as the most confusing part for users?
How many times per month is pandas downloaded according to the material?
Which library was created more recently and is often used as a high-performance alternative to pandas?
What is the recommended learning strategy before exploring advanced pandas features?
What does the .loc[] accessor in pandas use for selection?