Lesson 211 of 1570
LAION and the Image Training Story
Stable Diffusion, Midjourney, and DALL-E all trace back to LAION, an open dataset of 5 billion image-text pairs. It changed AI, and started a legal storm.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Dataset That Built Modern Image AI
- 2LAION
- 3image datasets
- 4CLIP
Concept cluster
Terms to connect while reading
Section 1
The Dataset That Built Modern Image AI
In 2021, a small German nonprofit called LAION released LAION-400M, a dataset of 400 million image-text pairs scraped from Common Crawl. A year later, LAION-5B arrived with over 5 billion pairs. This is the dataset that Stable Diffusion was trained on. It is a foundational moment in AI history.
How it was built
- 1Common Crawl provided billions of web pages
- 2A crawler extracted every <img> tag and its alt text
- 3OpenAI's CLIP model scored how well the image and alt text matched
- 4Low-scoring pairs were thrown out
- 5What remained: 5.85 billion image-text pairs
What LAION made possible
- Stable Diffusion (Stability AI, 2022) — open-source image generation
- Midjourney's early models
- A wave of community fine-tunes and specialized models
- Open research into how image models learn concepts
The problems LAION surfaced
- Copyrighted images from Getty, artists, and photographers, included without permission
- Medical images from private health forums
- Personal photos scraped from social media
- Images with removed watermarks reappearing in generated outputs
The lawsuits
Getty Images sued Stability AI in 2023, pointing to cases where Stable Diffusion reproduced a garbled Getty watermark, strongly suggesting it learned from Getty photos. A group of artists filed a class action. These cases are still winding through courts as of 2026.
Key terms in this lesson
The big idea: LAION democratized image AI and exposed the messiness of scraped data. Every major debate in AI rights today, from artists to watermarks, can be traced back to this one dataset.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “LAION and the Image Training Story”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 30 min
Where Training Data Actually Comes From
You cannot understand modern AI without understanding its diet. Let's map where the data comes from, how it gets cleaned, and what that means.
Builders · 22 min
The Turing Test and Its Discontents
The imitation game became famous, but most AI researchers now think it measures the wrong thing.
Builders · 28 min
Statistics Class: Letting AI Handle the Arithmetic
Stats is 10 percent concepts and 90 percent careful arithmetic. AI is shockingly good at the arithmetic, which frees you to actually think about the concepts.
