Loading lesson…
Creating a dataset from scratch teaches you more than using someone else's. Here is how to build a high-quality small labeled dataset for a real task.
A 500-example dataset you built yourself often teaches more than a 50,000-example dataset someone else assembled. Because you struggled with every label, you will notice bias, ambiguity, and quality issues that get papered over at scale. Here is how to do one well.
import pandas as pd # Option A: use an existing Twitter-style dataset from Hugging Face from datasets import load_dataset raw = load_dataset('cardiffnlp/tweet_eval', 'sentiment', split='train[:500]') df = raw.to_pandas()[['text']] # Option B: collect your own via API (with the platform's consent) # tweepy, praw for reddit, etc. df.to_csv('to_label.csv', index=False)Sourcing 500 raw examplesLABELING GUIDE v1.0 ============================================= Label one of: complaint | praise | neither COMPLAINT: The post expresses negative feelings about a product, service, or experience. e.g., My phone just died for the third time today. PRAISE: The post expresses positive feelings about a product, service, or experience. e.g., This new update is amazing, everything is so much faster now! NEITHER: Anything else — questions, statements of fact, off-topic, jokes without sentiment. e.g., Does anyone know if iOS 17 supports this? EDGE CASES: - Sarcastic praise that is really complaint -> complaint - Complaint about a product phrased politely -> complaint - Mixed (some good some bad) -> choose dominant; break tie with neither - Non-English -> neither (we are only labeling English for now)A short but complete labeling guide# After labeling, measure agreement import pandas as pd from sklearn.metrics import cohen_kappa_score df = pd.read_csv('labeled.csv') # has columns: text, anno_a, anno_b kappa = cohen_kappa_score(df['anno_a'], df['anno_b']) print(f'Cohen kappa: {kappa:.3f}') # Resolve disagreements by discussion, not by averaging disagreements = df[df['anno_a'] != df['anno_b']] print(f'{len(disagreements)} disagreements to resolve')Two annotators + agreement checkThe big idea: the dataset you build teaches you the problem. Every edge case surfaces. Every definition gets stress-tested. Shipping a small, careful, documented dataset is a credential in its own right.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-create-labeled-dataset
What is the main idea of "Creating Your First Small Labeled Dataset"?
Which concept is most central to "Creating Your First Small Labeled Dataset"?
Which use of AI fits this topic best?
What should a careful learner remember about "Write the guidelines first"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about labeling be treated?
Name one way to verify an AI answer about labeling.
Which action would help you apply "Creating Your First Small Labeled Dataset" responsibly?