Loading lesson…
Your posts, chats, photos, and behavior have been scraped, sold, and fed to models. Here is what has actually happened and what you can actually do.
If you have posted anything public on the internet since 2010, there is a real chance it is in at least one training dataset. Common Crawl, the biggest open web scrape, has indexed tens of billions of pages. Every major model has trained on some version of it.
That is not paranoia. It is how the modern AI industry was built. And until very recently, almost nobody was asked.
Reddit signed a $60M/year deal with Google in 2024 for training data. X licenses its firehose. Stack Overflow cut deals with OpenAI. Many users who wrote that content had no idea they were training an AI. Terms of service usually allow it, because the fine print says so, and nobody reads the fine print.
Models sometimes memorize exact passages from training. Researchers have extracted verbatim news articles, phone numbers, and in rare cases, personal data by carefully prompting production models. Labs have gotten better at preventing this, but it still happens for rare or repeated content.
| Law | On training data |
|---|---|
| EU GDPR (2018) | Personal data needs lawful basis — being training fuel is contested |
| EU AI Act (2025-2027) | GPAI providers must publish training data summaries |
| California CCPA/CPRA | Right to know and delete — AI training is being litigated |
| Illinois BIPA | Biometric data needs explicit consent — Meta paid $1.4B settlement in 2024 |
| Colorado AI Act (2026) | Risk assessments required for high-risk AI |
A technique called differential privacy adds mathematical noise to data so that no single person's contribution can be recovered from the model. Apple uses it for keyboard training. Google uses it for some Gemini features. For frontier LLMs it is still mostly a research direction. It is the cleanest theoretical answer — if it can be made to scale.
Privacy is not about having something to hide. It is about having the right to decide what to reveal.
— Daniel Solove
The big idea: most of the web is already training data. You have limited but real controls going forward, and almost none retroactively. Privacy is now a habit more than a setting.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ethics-privacy-training-fuel-builders
What is the core idea behind "Your Data Is Somebody's Training Fuel"?
Which term best describes a foundational idea in "Your Data Is Somebody's Training Fuel"?
A learner studying Your Data Is Somebody's Training Fuel would need to understand which concept?
Which of these is directly relevant to Your Data Is Somebody's Training Fuel?
Which of the following is a key point about Your Data Is Somebody's Training Fuel?
What is one important takeaway from studying Your Data Is Somebody's Training Fuel?
Which of these does NOT belong in a discussion of Your Data Is Somebody's Training Fuel?
What is the key insight about "Your chat window is its own privacy story" in the context of Your Data Is Somebody's Training Fuel?
What is the key insight about "The asymmetry problem" in the context of Your Data Is Somebody's Training Fuel?
What is the recommended tip about "Key insight" in the context of Your Data Is Somebody's Training Fuel?
Which statement accurately describes an aspect of Your Data Is Somebody's Training Fuel?
What does working with Your Data Is Somebody's Training Fuel typically involve?
Which of the following is true about Your Data Is Somebody's Training Fuel?
Which best describes the scope of "Your Data Is Somebody's Training Fuel"?
Which section heading best belongs in a lesson about Your Data Is Somebody's Training Fuel?