Creating Your First Small Labeled Dataset

Creating a dataset from scratch teaches you more than using someone else's. Here is how to build a high-quality small labeled dataset for a real task.

45 min · Reviewed 2026

Small, Careful, and Documented

A 500-example dataset you built yourself often teaches more than a 50,000-example dataset someone else assembled. Because you struggled with every label, you will notice bias, ambiguity, and quality issues that get papered over at scale. Here is how to do one well.

Project: classify tweets as complaint vs. praise

Define the task precisely
Collect 300-500 raw examples
Write a labeling guideline document
Label the examples (ideally with at least 2 annotators)
Measure inter-annotator agreement
Clean and release as a versioned dataset

Step 1: precise definition

Step 2: collect raw examples

import pandas as pd

# Option A: use an existing Twitter-style dataset from Hugging Face
from datasets import load_dataset
raw = load_dataset('cardiffnlp/tweet_eval', 'sentiment', split='train[:500]')
df = raw.to_pandas()[['text']]

# Option B: collect your own via API (with the platform's consent)
# ... tweepy, praw for reddit, etc.

df.to_csv('to_label.csv', index=False)Sourcing 500 raw examples

Step 3: the labeling guideline

LABELING GUIDE v1.0
=============================================

Label one of: complaint | praise | neither

COMPLAINT: The post expresses negative feelings
 about a product, service, or experience.
  e.g., My phone just died for the third time today.

PRAISE: The post expresses positive feelings about
 a product, service, or experience.
  e.g., This new update is amazing, everything
       is so much faster now!

NEITHER: Anything else — questions, statements of
 fact, off-topic, jokes without sentiment.
  e.g., Does anyone know if iOS 17 supports this?

EDGE CASES:
 - Sarcastic praise that is really complaint -> complaint
 - Complaint about a product phrased politely -> complaint
 - Mixed (some good some bad) -> choose dominant; break tie with neither
 - Non-English -> neither (we are only labeling English for now)A short but complete labeling guide

Step 4 & 5: label with two annotators

# After labeling, measure agreement
import pandas as pd
from sklearn.metrics import cohen_kappa_score

df = pd.read_csv('labeled.csv')  # has columns: text, anno_a, anno_b
kappa = cohen_kappa_score(df['anno_a'], df['anno_b'])
print(f'Cohen kappa: {kappa:.3f}')

# Resolve disagreements by discussion, not by averaging
disagreements = df[df['anno_a'] != df['anno_b']]
print(f'{len(disagreements)} disagreements to resolve')Two annotators + agreement check

Step 6: release with a data card

Write a README with purpose, size, limitations
Choose a license (CC-BY-4.0 is a safe default for most)
Include labeling guidelines as a file in the repo
Report inter-annotator agreement
Version the dataset (v1.0 is better than undated)

The big idea: the dataset you build teaches you the problem. Every edge case surfaces. Every definition gets stress-tested. Shipping a small, careful, documented dataset is a credential in its own right.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-create-labeled-dataset

What is the core idea behind "Creating Your First Small Labeled Dataset"?
1. Creating a dataset from scratch teaches you more than using someone else's. Here is how to build a high-quality small labeled dataset for a real task.
2. memorization
3. Time budget: 3 hours total
4. Include train/validation/test splits
Which term best describes a foundational idea in "Creating Your First Small Labeled Dataset"?
1. annotator
2. labeling guide
3. Cohen kappa
4. data release
A learner studying Creating Your First Small Labeled Dataset would need to understand which concept?
1. labeling guide
2. Cohen kappa
3. annotator
4. data release
Which of these is directly relevant to Creating Your First Small Labeled Dataset?
1. labeling guide
2. annotator
3. data release
4. Cohen kappa
Which of the following is a key point about Creating Your First Small Labeled Dataset?
1. Define the task precisely
2. Collect 300-500 raw examples
3. Write a labeling guideline document
4. Label the examples (ideally with at least 2 annotators)
Which of these does NOT belong in a discussion of Creating Your First Small Labeled Dataset?
1. Write a labeling guideline document
2. Define the task precisely
3. Collect 300-500 raw examples
4. memorization
Which statement is accurate regarding Creating Your First Small Labeled Dataset?
1. Choose a license (CC-BY-4.0 is a safe default for most)
2. Include labeling guidelines as a file in the repo
3. Write a README with purpose, size, limitations
4. Report inter-annotator agreement
Which of these does NOT belong in a discussion of Creating Your First Small Labeled Dataset?
1. Choose a license (CC-BY-4.0 is a safe default for most)
2. Write a README with purpose, size, limitations
3. memorization
4. Include labeling guidelines as a file in the repo
What is the key insight about "Write the guidelines first" in the context of Creating Your First Small Labeled Dataset?
1. Before labeling anything, write a one-page document: what counts as complaint, what counts as praise, what counts as nei…
2. memorization
3. Time budget: 3 hours total
4. Include train/validation/test splits
What is the recommended tip about "Ground your practice in fundamentals" in the context of Creating Your First Small Labeled Dataset?
1. memorization
2. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
3. Time budget: 3 hours total
4. Include train/validation/test splits
What is the key insight about "Review date" in the context of Creating Your First Small Labeled Dataset?
1. memorization
2. Time budget: 3 hours total
3. Reviewed in 2026. Treat fast-changing product names, prices, availability, and policy details as examples to verify befo…
4. Include train/validation/test splits
Which statement accurately describes an aspect of Creating Your First Small Labeled Dataset?
1. memorization
2. Time budget: 3 hours total
3. Include train/validation/test splits
4. A 500-example dataset you built yourself often teaches more than a 50,000-example dataset someone else assembled.
What does working with Creating Your First Small Labeled Dataset typically involve?
1. The big idea: the dataset you build teaches you the problem. Every edge case surfaces. Every definition gets stress-tested.
2. memorization
3. Time budget: 3 hours total
4. Include train/validation/test splits
Which best describes the scope of "Creating Your First Small Labeled Dataset"?
1. It is unrelated to foundations workflows
2. It focuses on Creating a dataset from scratch teaches you more than using someone else's. Here is how to build a h
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Creating Your First Small Labeled Dataset?
1. memorization
2. Time budget: 3 hours total
3. Project: classify tweets as complaint vs. praise
4. Include train/validation/test splits

← Back to interactive lesson

Tendril · Creators · AI Foundations