Loading lesson…
Creating a dataset from scratch teaches you more than using someone else's. Here is how to build a high-quality small labeled dataset for a real task.
A 500-example dataset you built yourself often teaches more than a 50,000-example dataset someone else assembled. Because you struggled with every label, you will notice bias, ambiguity, and quality issues that get papered over at scale. Here is how to do one well.
import pandas as pd
# Option A: use an existing Twitter-style dataset from Hugging Face
from datasets import load_dataset
raw = load_dataset('cardiffnlp/tweet_eval', 'sentiment', split='train[:500]')
df = raw.to_pandas()[['text']]
# Option B: collect your own via API (with the platform's consent)
# ... tweepy, praw for reddit, etc.
df.to_csv('to_label.csv', index=False)Sourcing 500 raw examplesLABELING GUIDE v1.0
=============================================
Label one of: complaint | praise | neither
COMPLAINT: The post expresses negative feelings
about a product, service, or experience.
e.g., My phone just died for the third time today.
PRAISE: The post expresses positive feelings about
a product, service, or experience.
e.g., This new update is amazing, everything
is so much faster now!
NEITHER: Anything else — questions, statements of
fact, off-topic, jokes without sentiment.
e.g., Does anyone know if iOS 17 supports this?
EDGE CASES:
- Sarcastic praise that is really complaint -> complaint
- Complaint about a product phrased politely -> complaint
- Mixed (some good some bad) -> choose dominant; break tie with neither
- Non-English -> neither (we are only labeling English for now)A short but complete labeling guide# After labeling, measure agreement
import pandas as pd
from sklearn.metrics import cohen_kappa_score
df = pd.read_csv('labeled.csv') # has columns: text, anno_a, anno_b
kappa = cohen_kappa_score(df['anno_a'], df['anno_b'])
print(f'Cohen kappa: {kappa:.3f}')
# Resolve disagreements by discussion, not by averaging
disagreements = df[df['anno_a'] != df['anno_b']]
print(f'{len(disagreements)} disagreements to resolve')Two annotators + agreement checkThe big idea: the dataset you build teaches you the problem. Every edge case surfaces. Every definition gets stress-tested. Shipping a small, careful, documented dataset is a credential in its own right.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-create-labeled-dataset
What is the core idea behind "Creating Your First Small Labeled Dataset"?
Which term best describes a foundational idea in "Creating Your First Small Labeled Dataset"?
A learner studying Creating Your First Small Labeled Dataset would need to understand which concept?
Which of these is directly relevant to Creating Your First Small Labeled Dataset?
Which of the following is a key point about Creating Your First Small Labeled Dataset?
Which of these does NOT belong in a discussion of Creating Your First Small Labeled Dataset?
Which statement is accurate regarding Creating Your First Small Labeled Dataset?
Which of these does NOT belong in a discussion of Creating Your First Small Labeled Dataset?
What is the key insight about "Write the guidelines first" in the context of Creating Your First Small Labeled Dataset?
What is the recommended tip about "Ground your practice in fundamentals" in the context of Creating Your First Small Labeled Dataset?
What is the key insight about "Review date" in the context of Creating Your First Small Labeled Dataset?
Which statement accurately describes an aspect of Creating Your First Small Labeled Dataset?
What does working with Creating Your First Small Labeled Dataset typically involve?
Which best describes the scope of "Creating Your First Small Labeled Dataset"?
Which section heading best belongs in a lesson about Creating Your First Small Labeled Dataset?