Lesson 312 of 2116
Creating Your First Small Labeled Dataset
Creating a dataset from scratch teaches you more than using someone else's. Here is how to build a high-quality small labeled dataset for a real task.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Small, Careful, and Documented
- 2labeling
- 3annotation
- 4guidelines
Concept cluster
Terms to connect while reading
Section 1
Small, Careful, and Documented
A 500-example dataset you built yourself often teaches more than a 50,000-example dataset someone else assembled. Because you struggled with every label, you will notice bias, ambiguity, and quality issues that get papered over at scale. Here is how to do one well.
Project: classify tweets as complaint vs. praise
- 1Define the task precisely
- 2Collect 300-500 raw examples
- 3Write a labeling guideline document
- 4Label the examples (ideally with at least 2 annotators)
- 5Measure inter-annotator agreement
- 6Clean and release as a versioned dataset
Step 1: precise definition
Step 2: collect raw examples
Sourcing 500 raw examples
import pandas as pd
# Option A: use an existing Twitter-style dataset from Hugging Face
from datasets import load_dataset
raw = load_dataset('cardiffnlp/tweet_eval', 'sentiment', split='train[:500]')
df = raw.to_pandas()[['text']]
# Option B: collect your own via API (with the platform's consent)
# ... tweepy, praw for reddit, etc.
df.to_csv('to_label.csv', index=False)Step 3: the labeling guideline
A short but complete labeling guide
LABELING GUIDE v1.0
=============================================
Label one of: complaint | praise | neither
COMPLAINT: The post expresses negative feelings
about a product, service, or experience.
e.g., My phone just died for the third time today.
PRAISE: The post expresses positive feelings about
a product, service, or experience.
e.g., This new update is amazing, everything
is so much faster now!
NEITHER: Anything else — questions, statements of
fact, off-topic, jokes without sentiment.
e.g., Does anyone know if iOS 17 supports this?
EDGE CASES:
- Sarcastic praise that is really complaint -> complaint
- Complaint about a product phrased politely -> complaint
- Mixed (some good some bad) -> choose dominant; break tie with neither
- Non-English -> neither (we are only labeling English for now)Step 4 & 5: label with two annotators
Two annotators + agreement check
# After labeling, measure agreement
import pandas as pd
from sklearn.metrics import cohen_kappa_score
df = pd.read_csv('labeled.csv') # has columns: text, anno_a, anno_b
kappa = cohen_kappa_score(df['anno_a'], df['anno_b'])
print(f'Cohen kappa: {kappa:.3f}')
# Resolve disagreements by discussion, not by averaging
disagreements = df[df['anno_a'] != df['anno_b']]
print(f'{len(disagreements)} disagreements to resolve')Step 6: release with a data card
- Write a README with purpose, size, limitations
- Choose a license (CC-BY-4.0 is a safe default for most)
- Include labeling guidelines as a file in the repo
- Report inter-annotator agreement
- Version the dataset (v1.0 is better than undated)
Key terms in this lesson
The big idea: the dataset you build teaches you the problem. Every edge case surfaces. Every definition gets stress-tested. Shipping a small, careful, documented dataset is a credential in its own right.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Creating Your First Small Labeled Dataset”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 35 min
Labeling at Scale: The Hidden Human Layer
Behind every supervised model is an army of human labelers. Understanding how labeling works is understanding who really builds AI.
Creators · 28 min
Inter-Annotator Agreement: Measuring Reality
If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tells you if a task is even well-defined.
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
