Lesson 24 of 1570
Your Data Is Somebody's Training Fuel
Your posts, chats, photos, and behavior have been scraped, sold, and fed to models. Here is what has actually happened and what you can actually do.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1You Have Already Contributed
- 2data scraping
- 3training data
- 4privacy
Concept cluster
Terms to connect while reading
Section 1
You Have Already Contributed
If you have posted anything public on the internet since 2010, there is a real chance it is in at least one training dataset. Common Crawl, the biggest open web scrape, has indexed tens of billions of pages. Every major model has trained on some version of it.
That is not paranoia. It is how the modern AI industry was built. And until very recently, almost nobody was asked.
Three kinds of data AI labs want
- Public web pages: scraped via Common Crawl or directly
- Platform data: what you post on Reddit, X, YouTube — now often sold to AI labs
- Conversational data: what you type into chat products, sometimes used to improve them
The platform deals you did not sign
Reddit signed a $60M/year deal with Google in 2024 for training data. X licenses its firehose. Stack Overflow cut deals with OpenAI. Many users who wrote that content had no idea they were training an AI. Terms of service usually allow it, because the fine print says so, and nobody reads the fine print.
Memorization: when training data leaks back
Models sometimes memorize exact passages from training. Researchers have extracted verbatim news articles, phone numbers, and in rare cases, personal data by carefully prompting production models. Labs have gotten better at preventing this, but it still happens for rare or repeated content.
Compare: what the big privacy laws say about AI training
Compare the options
| Law | On training data |
|---|---|
| EU GDPR (2018) | Personal data needs lawful basis — being training fuel is contested |
| EU AI Act (2025-2027) | GPAI providers must publish training data summaries |
| California CCPA/CPRA | Right to know and delete — AI training is being litigated |
| Illinois BIPA | Biometric data needs explicit consent — Meta paid $1.4B settlement in 2024 |
| Colorado AI Act (2026) | Risk assessments required for high-risk AI |
What you can actually do
- 1Opt out where it exists. Every major chatbot has a setting. Use it.
- 2Use the robots.txt and ai.txt mechanisms on your own sites.
- 3Choose services that pay for training data instead of scraping.
- 4In the EU, use your GDPR rights to request deletion (patchy but real).
- 5Be cautious about what you post under your real identity.
Differential privacy and the future
A technique called differential privacy adds mathematical noise to data so that no single person's contribution can be recovered from the model. Apple uses it for keyboard training. Google uses it for some Gemini features. For frontier LLMs it is still mostly a research direction. It is the cleanest theoretical answer — if it can be made to scale.
“Privacy is not about having something to hide. It is about having the right to decide what to reveal.”
Key terms in this lesson
The big idea: most of the web is already training data. You have limited but real controls going forward, and almost none retroactively. Privacy is now a habit more than a setting.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Your Data Is Somebody's Training Fuel”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 25 min
Logit Lens: Peeking at Predictions Mid-Forward-Pass
A transformer processes a token through many layers before outputting a prediction. The logit lens shows you what the model would predict if it stopped at each layer along the way.
Builders · 28 min
LAION and the Image Training Story
Stable Diffusion, Midjourney, and DALL-E all trace back to LAION, an open dataset of 5 billion image-text pairs. It changed AI, and started a legal storm.
Builders · 25 min
Red-Teaming: People Paid to Break AI
Red-teamers try to make models misbehave before bad actors do. Here is how the job works, who does it, and what they look for.
