Lesson 49 of 1570
How Diffusion Models Actually Work
An AI that paints starts with pure noise and removes it, one step at a time, until a picture appears. Here's the surprisingly beautiful math behind it.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Noise in, picture out
- 2diffusion
- 3denoising
- 4latent space
Concept cluster
Terms to connect while reading
Section 1
Noise in, picture out
Almost every modern image AI — DALL-E 3, Midjourney, Stable Diffusion, Flux, Imagen — is a diffusion model. The core idea is strange and brilliant: instead of 'drawing,' the AI subtracts. It starts with a canvas of pure random noise (like TV static) and removes noise step by step until a picture emerges. Your prompt steers which picture emerges.
How a diffusion model gets trained (the forward process)
- 1Take a real picture from the training data — say, a photo of a dog.
- 2Add a little random noise. The dog is still obvious.
- 3Add more noise. Now it's blurry.
- 4Keep adding noise over many steps, until the picture is pure static — indistinguishable from random noise.
- 5Train a neural network to reverse ONE step: given a slightly noisy image, predict what was removed to get there.
Do that with billions of images and their captions. The network learns, deeply, what noise looks like AT EVERY LEVEL and how to peel it back toward a coherent image — guided by the caption.
Generating a new image (the reverse process)
- 1Start with pure noise and your text prompt.
- 2The model predicts: 'if this is a partially-noisy picture of <your prompt>, what noise should I remove?'
- 3Subtract that predicted noise. The picture is now slightly less noisy.
- 4Repeat, typically 20–50 times, each step getting the image closer to a real picture matching the prompt.
- 5After the final step, the noise is gone and a finished picture remains.
Latent diffusion (the trick that made it fast)
Doing diffusion on full 1024x1024 pixel images is slow. Stable Diffusion's 2022 breakthrough was to work in latent space — a compressed representation (roughly 64x64 with many channels) learned by a separate autoencoder. Diffusion happens in latent space, which is 50x smaller, then the decoder turns the final latent into a full image. Flux and Stable Diffusion 3.5 use the same approach.
Diffusion vs. autoregressive image models
Compare the options
| Diffusion (SD, Flux, Midjourney) | Autoregressive (GPT-4o image, some experimental) |
|---|---|
| Generate whole image at once, refine. | Generate pixel or patch, then next, like text tokens. |
| Fast, parallel, high quality. | Slower, but natural fit with LLMs. |
| Dominant approach in 2026. | Growing as multimodal LLMs improve. |
| ControlNet, LoRA, IP-Adapter work here. | Different adapter ecosystem. |
Key terms in this lesson
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “How Diffusion Models Actually Work”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 26 min
Making Music with Suno and Udio
Type a prompt, get a full song — vocals, drums, mix, even in Portuguese. Here's how Suno v5, Udio, and ElevenMusic work — and what they can't yet do.
Builders · 26 min
DALL-E vs. Midjourney vs. Flux
Five image models, five personalities. Here's when each one is the right pick — in 2026, with current strengths, costs, and quirks.
Builders · 30 min
The Craft of Image Prompting
Great image prompters aren't typing harder — they're using a mental framework. Subject, setting, style, composition, lighting, mood. Here's the system.
