Resampling: Making Data Work Harder

Section 1

Squeeze More From Your Data

Compare the options

Technique	Purpose	Key idea
K-fold CV	Model evaluation	Split data into k parts, train on k-1, test on 1, rotate
Leave-one-out	Model evaluation, tiny datasets	Train on n-1, test on 1, repeat n times
Stratified sampling	Preserve class balance	Sample within each class separately
Bootstrap	Estimate uncertainty	Sample with replacement, many times
Permutation	Hypothesis testing	Shuffle labels, re-compute stat
SMOTE	Class imbalance	Generate synthetic minority examples

5-fold cross-validation

python

from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

X, y = load_data()
model = LogisticRegression()

scores = cross_val_score(model, X, y, cv=5)
print(f'Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}')
# Each fold is trained and tested, giving 5 accuracy numbers
# Mean is the expected accuracy, std is the uncertainty

SMOTE for class balancing

python

from imblearn.over_sampling import SMOTE

X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)
print('Before:', dict(zip(*np.unique(y_train, return_counts=True))))
print('After:', dict(zip(*np.unique(y_resampled, return_counts=True))))

Key terms in this lesson

Resampling: Making Data Work Harder

Squeeze More From Your Data

The main techniques

K-fold cross-validation

SMOTE for imbalanced classes

Curious about “Resampling: Making Data Work Harder”?

Keep going

Resampling: Making Data Work Harder

Squeeze More From Your Data

The main techniques

K-fold cross-validation

SMOTE for imbalanced classes

Curious about “Resampling: Making Data Work Harder”?

Keep going