Lesson 268 of 1596
Sharing Datasets on Hugging Face Hub
Hugging Face Hub is the GitHub of AI data and models. Uploading a dataset there makes it instantly accessible to millions of practitioners.
Creators · AI Foundations · ~24 min read
The Default Home of AI Data
Hugging Face Hub hosts over 200,000 datasets and over 1 million models as of 2024. Uploading your dataset there gives it citation, versioning, a built-in viewer, and instant programmatic access from any project using the datasets library. It is free for public datasets.
Step 1: install and authenticate
One-time setup
pip install huggingface_hub datasets # Log in (grab a token from https://huggingface.co/settings/tokens) huggingface-cli loginStep 2: prepare your data
Convert pandas to a Hugging Face Dataset
import pandas as pd from datasets import Dataset df = pd.read_csv('labeled_complaints.csv') # Convert to a Hugging Face Dataset ds = Dataset.from_pandas(df) print(ds) # Create a train/validation/test split ds = ds.train_test_split(test_size=0.2, seed=42) print(ds)Step 3: write a README / data card
A Hugging Face dataset card
--- language: - en license: cc-by-4.0 task_categories: - text-classification task_ids: - sentiment-classification size_categories: - n<1K pretty_name: Tweet Complaints vs Praise --- # Tweet Complaints vs. Praise ## Description 500 English tweets labeled as complaint, praise, or neither, collected from public data in 2026. ## Sources Sampled from cardiffnlp/tweet_eval; relabeled by two annotators. ## Labels - 0 = complaint - 1 = praise - 2 = neither ## Agreement Cohen's kappa between annotators: 0.78 (substantial) ## Limitations - English only - Skewed toward consumer tech topics - Labels reflect US cultural context; may not transfer ## License CC-BY-4.0. Please cite Tendril Content Team, 2026.Step 4: push it
One-line publish
from datasets import DatasetDict # Push to your Hugging Face account ds.push_to_hub('your-username/tweet-complaints-praise') # Or save locally first, then upload via git # ds.save_to_disk('./tweet-complaints-praise')Step 5: verify and share
- Visit https://huggingface.co/datasets/your-username/tweet-complaints-praise
- Confirm the viewer loads and splits look right
- Ensure README renders; fix any YAML errors
- Add tags so others can find it
- Share the link on relevant communities
Good practices for Hub releases
- 1Use Parquet format (faster than CSV for the viewer)
- 2Keep individual files under 5 GB
- 3Include train/validation/test splits
- 4Version your dataset (v1.0, v2.0) rather than overwriting
- 5Respond to issues and discussions in the community tab
- 6If you discover a problem later, release a corrected version with a changelog
Key terms in this lesson
The big idea: publishing a dataset on Hugging Face is the 21st-century equivalent of publishing a paper. It is permanent, searchable, usable, and attributable. If you build a dataset, ship it. The community learns when you share.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Sharing Datasets on Hugging Face Hub”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
Creators · 40 min
Emergence vs. Scaling
Some capabilities grow smoothly with scale. Others seem to appear out of nowhere. Telling them apart is a whole research program. The Big Question Is AI capability a smooth climb or a staircase?
Creators · 45 min
Running Your Own Small Experiment
The best way to truly understand an AI claim is to try it yourself. Here is how to run a small experiment that actually teaches you something.
