Tendril

Lesson 24 of 1570

Your Data Is Somebody's Training Fuel

Your posts, chats, photos, and behavior have been scraped, sold, and fed to models. Here is what has actually happened and what you can actually do.

BuildersEthics & Society~17 min readIntermediateCoderDesignerBI5 · Societal ImpactBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

28 min22 blocks4 concepts

Learning path

The main moves in order

1You Have Already Contributed
2data scraping
3training data
4privacy

Concept cluster

Terms to connect while reading

data scrapingtraining dataprivacyGDPR

Sections7

Lists2

Notes4

Compare1

Quotes1

Section 1

You Have Already Contributed

If you have posted anything public on the internet since 2010, there is a real chance it is in at least one training dataset. Common Crawl, the biggest open web scrape, has indexed tens of billions of pages. Every major model has trained on some version of it.

That is not paranoia. It is how the modern AI industry was built. And until very recently, almost nobody was asked.

Three kinds of data AI labs want

Public web pages: scraped via Common Crawl or directly
Platform data: what you post on Reddit, X, YouTube — now often sold to AI labs
Conversational data: what you type into chat products, sometimes used to improve them

Check-in 1. Got it so far?

The platform deals you did not sign

Reddit signed a $60M/year deal with Google in 2024 for training data. X licenses its firehose. Stack Overflow cut deals with OpenAI. Many users who wrote that content had no idea they were training an AI. Terms of service usually allow it, because the fine print says so, and nobody reads the fine print.

Memorization: when training data leaks back

Models sometimes memorize exact passages from training. Researchers have extracted verbatim news articles, phone numbers, and in rare cases, personal data by carefully prompting production models. Labs have gotten better at preventing this, but it still happens for rare or repeated content.

Check-in 2. Got it so far?

Compare: what the big privacy laws say about AI training

Compare the options

Law	On training data
EU GDPR (2018)	Personal data needs lawful basis — being training fuel is contested
EU AI Act (2025-2027)	GPAI providers must publish training data summaries
California CCPA/CPRA	Right to know and delete — AI training is being litigated
Illinois BIPA	Biometric data needs explicit consent — Meta paid $1.4B settlement in 2024
Colorado AI Act (2026)	Risk assessments required for high-risk AI

What you can actually do

1Opt out where it exists. Every major chatbot has a setting. Use it.
2Use the robots.txt and ai.txt mechanisms on your own sites.
3Choose services that pay for training data instead of scraping.
4In the EU, use your GDPR rights to request deletion (patchy but real).
5Be cautious about what you post under your real identity.

Check-in 3. Got it so far?

Differential privacy and the future

A technique called differential privacy adds mathematical noise to data so that no single person's contribution can be recovered from the model. Apple uses it for keyboard training. Google uses it for some Gemini features. For frontier LLMs it is still mostly a research direction. It is the cleanest theoretical answer — if it can be made to scale.

“Privacy is not about having something to hide. It is about having the right to decide what to reveal.”
Daniel Solove

Key terms in this lesson

Check-in 4. Got it so far?

The big idea: most of the web is already training data. You have limited but real controls going forward, and almost none retroactively. Privacy is now a habit more than a setting.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Your Data Is Somebody's Training Fuel”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Your Data Is Somebody's Training Fuel

You Have Already Contributed

Three kinds of data AI labs want

The platform deals you did not sign

Memorization: when training data leaks back

Compare: what the big privacy laws say about AI training

What you can actually do

Differential privacy and the future

Curious about “Your Data Is Somebody's Training Fuel”?

Keep going

Your Data Is Somebody's Training Fuel

You Have Already Contributed

Three kinds of data AI labs want

The platform deals you did not sign

Memorization: when training data leaks back

Compare: what the big privacy laws say about AI training

What you can actually do

Differential privacy and the future

Curious about “Your Data Is Somebody's Training Fuel”?

Keep going