Lesson 4 of 2116
The Economics and Ethics of Training Data
Data is the strategic asset of AI. Understand the supply chain, the legal fight, and the philosophical stakes before you build anything on top.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Data as Infrastructure
- 2training data
- 3copyright
- 4fair use
Concept cluster
Terms to connect while reading
Section 1
Data as Infrastructure
Compute gets the headlines, but data is what separates a decent open model from a frontier one. The supply chain for training data is tangled, partially illegal, and rapidly being renegotiated in courts around the world.
The data supply chain
- Common Crawl: raw, open scrape of the public web
- Licensed corpora: stock image libraries, academic publishers, news deals
- Proprietary scrapes: everything a lab can collect and store
- Human feedback data: paid annotators worldwide
- Synthetic data: generated by existing models, then filtered
The legal landscape
Copyright law did not anticipate training data. The New York Times, Getty Images, and authors' guilds have sued AI companies, arguing that training on their works without permission is infringement. AI companies respond that training is transformative fair use. Courts are still deciding, and rulings vary by jurisdiction.
Compare the options
| Position | Core argument |
|---|---|
| Pro-training | Training is statistical learning, analogous to how humans read |
| Anti-training | Models can regurgitate and compete with original work |
| Middle | Training is fair use but outputs may infringe case-by-case |
| Opt-out | Creators should be able to exclude their work cleanly |
Licensing markets are emerging
OpenAI, Google, and others have signed nine-figure deals with publishers like News Corp, Axel Springer, and Reddit. This creates a two-tier data market: licensed, high-quality content for frontier labs, and the open crawl for everyone else. The competitive moat is increasingly the rolodex, not the algorithm.
Consent, compensation, and provenance
- 1Do creators know their work was used?
- 2Did they agree?
- 3Are they compensated when the model earns revenue?
- 4Can downstream users verify where outputs come from?
- 5Does any of this need to be enforced by law or by technology?
Implications for builders
- Use models from labs with clear data provenance if liability matters
- Keep receipts for any data you add on top
- Watch for indemnification clauses in commercial model terms
- Budget for data licensing the same way you budget for compute
“The value is not in the model. The value is in the data you choose to train it on.”
Key terms in this lesson
The big idea: training data is the political economy of AI. The next decade of AI regulation will mostly be arguments about who owns the input, not the output.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “The Economics and Ethics of Training Data”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 30 min
Who Owns the Data in a Dataset?
Ownership of data is not one question but a tangle of rights: copyright, contract, privacy, and control. Untangling them is essential for responsible use.
Creators · 28 min
Copyright vs. Terms of Service: Two Different Fights
Violating a website's Terms of Service and violating copyright are different legal problems. Understanding the distinction is critical for data work. Fair use in training The argument AI companies make is that training is transformative fair use.
Creators · 45 min
What Is Intelligence, Really? A Working Framework
Before we can judge whether an AI is intelligent, we need a framework for what intelligence even means. Draw on Chollet, Dennett, and modern evals.
