Lesson 300 of 2116
Who Owns the Data in a Dataset?
Ownership of data is not one question but a tangle of rights: copyright, contract, privacy, and control. Untangling them is essential for responsible use.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Ownership Is Plural
- 2data ownership
- 3copyright
- 4licensing
Concept cluster
Terms to connect while reading
Section 1
Ownership Is Plural
A single photo in a training dataset can have five different claims on it. The photographer has copyright. The people in the photo have privacy rights. The platform it was hosted on has terms of service. The dataset compiler has their own rights. The model trainer uses it under some legal theory. Each of these can conflict.
The main layers of rights
Compare the options
| Right | Who holds it | What it protects |
|---|---|---|
| Copyright | The creator | Creative expression (photos, writing, code) |
| Privacy | The person depicted | Images, recordings, personal data |
| Contract (ToS) | The platform | Use of the platform's services |
| Database right (EU) | The compiler | Substantial investment in data collection |
| Publicity | Celebrities or individuals | Name, image, and likeness |
Copyright and training data
Most training data is copyrighted. The legal debate is whether training a model on copyrighted data is fair use (US) or fair dealing (UK) or text and data mining exemption (EU). Courts are actively deciding this. The New York Times v. OpenAI case, filed December 2023, is still working through US federal courts.
Terms of service vs. copyright
Licensing your own work
- CC-BY-4.0: attribution required, any use allowed
- CC-BY-SA-4.0: attribution + share-alike (derivatives must use same license)
- CC-BY-NC: non-commercial only
- CC0: public domain, no rights reserved
- OpenRAIL: responsible AI licenses with use-case restrictions
The opt-out movement
Because AI training has outpaced consent, a cluster of opt-out tools has emerged. Spawning.ai's Have I Been Trained lets people see if their work is in major datasets. OpenAI, Google, and Anthropic all now publish crawler names you can block via robots.txt. Some datasets (Common Crawl's newer versions) honor these signals.
Key terms in this lesson
The big idea: data has owners, even when it feels free. Responsible practitioners treat provenance as mandatory, not optional.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Who Owns the Data in a Dataset?”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 28 min
Copyright vs. Terms of Service: Two Different Fights
Violating a website's Terms of Service and violating copyright are different legal problems. Understanding the distinction is critical for data work. Fair use in training The argument AI companies make is that training is transformative fair use.
Creators · 28 min
Licensing Your Own Datasets
If you build a dataset, how you license it determines who can use it and how. Picking the right license matters as much as the data itself.
Creators · 30 min
Debate Prep: Researching Both Sides Fast
Debate rewards knowing the other side's best argument better than they do. AI is built for exactly this kind of fast, balanced research.
