Your Data Is Somebody's Training Fuel

Your posts, chats, photos, and behavior have been scraped, sold, and fed to models. Here is what has actually happened and what you can actually do.

28 min · Reviewed 2026

You Have Already Contributed

If you have posted anything public on the internet since 2010, there is a real chance it is in at least one training dataset. Common Crawl, the biggest open web scrape, has indexed tens of billions of pages. Every major model has trained on some version of it.

That is not paranoia. It is how the modern AI industry was built. And until very recently, almost nobody was asked.

Three kinds of data AI labs want

Public web pages: scraped via Common Crawl or directly
Platform data: what you post on Reddit, X, YouTube — now often sold to AI labs
Conversational data: what you type into chat products, sometimes used to improve them

The platform deals you did not sign

Reddit signed a $60M/year deal with Google in 2024 for training data. X licenses its firehose. Stack Overflow cut deals with OpenAI. Many users who wrote that content had no idea they were training an AI. Terms of service usually allow it, because the fine print says so, and nobody reads the fine print.

Memorization: when training data leaks back

Models sometimes memorize exact passages from training. Researchers have extracted verbatim news articles, phone numbers, and in rare cases, personal data by carefully prompting production models. Labs have gotten better at preventing this, but it still happens for rare or repeated content.

Compare: what the big privacy laws say about AI training

Law	On training data
EU GDPR (2018)	Personal data needs lawful basis — being training fuel is contested
EU AI Act (2025-2027)	GPAI providers must publish training data summaries
California CCPA/CPRA	Right to know and delete — AI training is being litigated
Illinois BIPA	Biometric data needs explicit consent — Meta paid $1.4B settlement in 2024
Colorado AI Act (2026)	Risk assessments required for high-risk AI

What you can actually do

Opt out where it exists. Every major chatbot has a setting. Use it.
Use the robots.txt and ai.txt mechanisms on your own sites.
Choose services that pay for training data instead of scraping.
In the EU, use your GDPR rights to request deletion (patchy but real).
Be cautious about what you post under your real identity.

Differential privacy and the future

A technique called differential privacy adds mathematical noise to data so that no single person's contribution can be recovered from the model. Apple uses it for keyboard training. Google uses it for some Gemini features. For frontier LLMs it is still mostly a research direction. It is the cleanest theoretical answer — if it can be made to scale.

Privacy is not about having something to hide. It is about having the right to decide what to reveal.
— Daniel Solove

The big idea: most of the web is already training data. You have limited but real controls going forward, and almost none retroactively. Privacy is now a habit more than a setting.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ethics-privacy-training-fuel-builders

What is the core idea behind "Your Data Is Somebody's Training Fuel"?
1. Your posts, chats, photos, and behavior have been scraped, sold, and fed to models. Here is what has actually happened and what you can actually do.
2. friendship
3. Lower reading level to 8th grade.
4. Sometimes AI gives spooky answers by accident
Which term best describes a foundational idea in "Your Data Is Somebody's Training Fuel"?
1. opt-out
2. Common Crawl
3. memorization
4. differential privacy
A learner studying Your Data Is Somebody's Training Fuel would need to understand which concept?
1. Common Crawl
2. memorization
3. opt-out
4. differential privacy
Which of these is directly relevant to Your Data Is Somebody's Training Fuel?
1. Common Crawl
2. opt-out
3. differential privacy
4. memorization
Which of the following is a key point about Your Data Is Somebody's Training Fuel?
1. Public web pages: scraped via Common Crawl or directly
2. Platform data: what you post on Reddit, X, YouTube — now often sold to AI labs
3. Conversational data: what you type into chat products, sometimes used to improve them
4. friendship
What is one important takeaway from studying Your Data Is Somebody's Training Fuel?
1. Use the robots.txt and ai.txt mechanisms on your own sites.
2. Opt out where it exists. Every major chatbot has a setting. Use it.
3. Choose services that pay for training data instead of scraping.
4. In the EU, use your GDPR rights to request deletion (patchy but real).
Which of these does NOT belong in a discussion of Your Data Is Somebody's Training Fuel?
1. friendship
2. Use the robots.txt and ai.txt mechanisms on your own sites.
3. Choose services that pay for training data instead of scraping.
4. Opt out where it exists. Every major chatbot has a setting. Use it.
What is the key insight about "Your chat window is its own privacy story" in the context of Your Data Is Somebody's Training Fuel?
1. friendship
2. Lower reading level to 8th grade.
3. Sometimes AI gives spooky answers by accident
4. By default, some chatbot services train on your conversations. Most now let you opt out — but the default is opt-in.
What is the key insight about "The asymmetry problem" in the context of Your Data Is Somebody's Training Fuel?
1. You can opt out of future training, but what is already in a trained model cannot easily be removed.
2. friendship
3. Lower reading level to 8th grade.
4. Sometimes AI gives spooky answers by accident
What is the recommended tip about "Key insight" in the context of Your Data Is Somebody's Training Fuel?
1. friendship
2. Your posts, chats, photos, and behavior have been scraped, sold, and fed to models.
3. Lower reading level to 8th grade.
4. Sometimes AI gives spooky answers by accident
Which statement accurately describes an aspect of Your Data Is Somebody's Training Fuel?
1. friendship
2. Lower reading level to 8th grade.
3. If you have posted anything public on the internet since 2010, there is a real chance it is in at least one training dataset.
4. Sometimes AI gives spooky answers by accident
What does working with Your Data Is Somebody's Training Fuel typically involve?
1. friendship
2. Lower reading level to 8th grade.
3. Sometimes AI gives spooky answers by accident
4. That is not paranoia. It is how the modern AI industry was built. And until very recently, almost nobody was asked.
Which of the following is true about Your Data Is Somebody's Training Fuel?
1. Reddit signed a $60M/year deal with Google in 2024 for training data. X licenses its firehose. Stack Overflow cut deals with OpenAI.
2. friendship
3. Lower reading level to 8th grade.
4. Sometimes AI gives spooky answers by accident
Which best describes the scope of "Your Data Is Somebody's Training Fuel"?
1. It is unrelated to ethics workflows
2. It focuses on Your posts, chats, photos, and behavior have been scraped, sold, and fed to models. Here is what has
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Your Data Is Somebody's Training Fuel?
1. friendship
2. Lower reading level to 8th grade.
3. Three kinds of data AI labs want
4. Sometimes AI gives spooky answers by accident

← Back to interactive lesson

Tendril · Builders · Ethics & Society