Lesson 298 of 2116
Resampling: Making Data Work Harder
Resampling techniques draw new samples from your data to estimate uncertainty, balance classes, or validate models. It is one of the most underused superpowers in statistics.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Squeeze More From Your Data
- 2resampling
- 3cross-validation
- 4SMOTE
Concept cluster
Terms to connect while reading
Section 1
Squeeze More From Your Data
You have 1,000 data points. A single train/test split gives you one estimate of model accuracy. But what if you happen to get a lucky or unlucky split? Resampling lets you run the experiment many times, getting more reliable answers from the same data.
The main techniques
Compare the options
| Technique | Purpose | Key idea |
|---|---|---|
| K-fold CV | Model evaluation | Split data into k parts, train on k-1, test on 1, rotate |
| Leave-one-out | Model evaluation, tiny datasets | Train on n-1, test on 1, repeat n times |
| Stratified sampling | Preserve class balance | Sample within each class separately |
| Bootstrap | Estimate uncertainty | Sample with replacement, many times |
| Permutation | Hypothesis testing | Shuffle labels, re-compute stat |
| SMOTE | Class imbalance | Generate synthetic minority examples |
K-fold cross-validation
5-fold cross-validation
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_data()
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)
print(f'Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}')
# Each fold is trained and tested, giving 5 accuracy numbers
# Mean is the expected accuracy, std is the uncertaintySMOTE for imbalanced classes
If 99 percent of your data is class A and 1 percent is class B (fraud detection, rare disease), a naive model just predicts A every time and hits 99 percent accuracy while being useless. SMOTE (Synthetic Minority Oversampling Technique) generates realistic new minority examples by interpolating between existing ones.
SMOTE for class balancing
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)
print('Before:', dict(zip(*np.unique(y_train, return_counts=True))))
print('After:', dict(zip(*np.unique(y_resampled, return_counts=True))))Key terms in this lesson
The big idea: a single train/test split is rarely enough. Resampling turns one experiment into many, giving you honest uncertainty estimates and squeezing more learning from limited data.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Resampling: Making Data Work Harder”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
Creators · 32 min
AP Biology: Using AI to Survive the Vocab Tsunami
AP Bio has roughly a thousand terms and four big concepts. NotebookLM and Claude Projects can turn your textbook into a custom tutor that actually knows what you are studying.
Creators · 40 min
Golden-Dataset Curation
A golden dataset is a curated set of hard, representative examples you trust completely. It is the backbone of every serious eval.
