Lesson 313 of 2116
Sharing Datasets on Hugging Face Hub
Hugging Face Hub is the GitHub of AI data and models. Uploading a dataset there makes it instantly accessible to millions of practitioners.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Default Home of AI Data
- 2Hugging Face
- 3publishing
- 4datasets library
Concept cluster
Terms to connect while reading
Section 1
The Default Home of AI Data
Hugging Face Hub hosts over 200,000 datasets and over 1 million models as of 2024. Uploading your dataset there gives it citation, versioning, a built-in viewer, and instant programmatic access from any project using the datasets library. It is free for public datasets.
Step 1: install and authenticate
One-time setup
pip install huggingface_hub datasets
# Log in (grab a token from https://huggingface.co/settings/tokens)
huggingface-cli loginStep 2: prepare your data
Convert pandas to a Hugging Face Dataset
import pandas as pd
from datasets import Dataset
df = pd.read_csv('labeled_complaints.csv')
# Convert to a Hugging Face Dataset
ds = Dataset.from_pandas(df)
print(ds)
# Create a train/validation/test split
ds = ds.train_test_split(test_size=0.2, seed=42)
print(ds)Step 3: write a README / data card
A Hugging Face dataset card
---
language:
- en
license: cc-by-4.0
task_categories:
- text-classification
task_ids:
- sentiment-classification
size_categories:
- n<1K
pretty_name: Tweet Complaints vs Praise
---
# Tweet Complaints vs. Praise
## Description
500 English tweets labeled as complaint, praise, or neither,
collected from public data in 2026.
## Sources
Sampled from cardiffnlp/tweet_eval; relabeled by two annotators.
## Labels
- 0 = complaint
- 1 = praise
- 2 = neither
## Agreement
Cohen's kappa between annotators: 0.78 (substantial)
## Limitations
- English only
- Skewed toward consumer tech topics
- Labels reflect US cultural context; may not transfer
## License
CC-BY-4.0. Please cite Tendril Content Team, 2026.Step 4: push it
One-line publish
from datasets import DatasetDict
# Push to your Hugging Face account
ds.push_to_hub('your-username/tweet-complaints-praise')
# Or save locally first, then upload via git
# ds.save_to_disk('./tweet-complaints-praise')Step 5: verify and share
- Visit https://huggingface.co/datasets/your-username/tweet-complaints-praise
- Confirm the viewer loads and splits look right
- Ensure README renders; fix any YAML errors
- Add tags so others can find it
- Share the link on relevant communities
Good practices for Hub releases
- 1Use Parquet format (faster than CSV for the viewer)
- 2Keep individual files under 5 GB
- 3Include train/validation/test splits
- 4Version your dataset (v1.0, v2.0) rather than overwriting
- 5Respond to issues and discussions in the community tab
- 6If you discover a problem later, release a corrected version with a changelog
Key terms in this lesson
The big idea: publishing a dataset on Hugging Face is the 21st-century equivalent of publishing a paper. It is permanent, searchable, usable, and attributable. If you build a dataset, ship it. The community learns when you share.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Sharing Datasets on Hugging Face Hub”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
Creators · 40 min
Emergence vs. Scaling
Some capabilities grow smoothly with scale. Others seem to appear out of nowhere. Telling them apart is a whole research program. The Big Question Is AI capability a smooth climb or a staircase?
Creators · 45 min
Running Your Own Small Experiment
The best way to truly understand an AI claim is to try it yourself. Here is how to run a small experiment that actually teaches you something.
