Tendril

Lesson 49 of 1570

How Diffusion Models Actually Work

An AI that paints starts with pure noise and removes it, one step at a time, until a picture appears. Here's the surprisingly beautiful math behind it.

BuildersCreative AI~17 min readIntermediateDesignerBI1 · PerceptionBI2 · Representation & ReasoningPrint / PDF

Lesson map

What this lesson covers

28 min15 blocks4 concepts

Learning path

The main moves in order

1Noise in, picture out
2diffusion
3denoising
4latent space

Concept cluster

Terms to connect while reading

diffusiondenoisinglatent spaceforward/reverse process

Sections5

Lists2

Notes3

Compare1

Terms1

Section 1

Noise in, picture out

Almost every modern image AI — DALL-E 3, Midjourney, Stable Diffusion, Flux, Imagen — is a diffusion model. The core idea is strange and brilliant: instead of 'drawing,' the AI subtracts. It starts with a canvas of pure random noise (like TV static) and removes noise step by step until a picture emerges. Your prompt steers which picture emerges.

How a diffusion model gets trained (the forward process)

1Take a real picture from the training data — say, a photo of a dog.
2Add a little random noise. The dog is still obvious.
3Add more noise. Now it's blurry.
4Keep adding noise over many steps, until the picture is pure static — indistinguishable from random noise.
5Train a neural network to reverse ONE step: given a slightly noisy image, predict what was removed to get there.

Do that with billions of images and their captions. The network learns, deeply, what noise looks like AT EVERY LEVEL and how to peel it back toward a coherent image — guided by the caption.

Check-in 1. Got it so far?

Generating a new image (the reverse process)

1Start with pure noise and your text prompt.
2The model predicts: 'if this is a partially-noisy picture of <your prompt>, what noise should I remove?'
3Subtract that predicted noise. The picture is now slightly less noisy.
4Repeat, typically 20–50 times, each step getting the image closer to a real picture matching the prompt.
5After the final step, the noise is gone and a finished picture remains.

Latent diffusion (the trick that made it fast)

Doing diffusion on full 1024x1024 pixel images is slow. Stable Diffusion's 2022 breakthrough was to work in latent space — a compressed representation (roughly 64x64 with many channels) learned by a separate autoencoder. Diffusion happens in latent space, which is 50x smaller, then the decoder turns the final latent into a full image. Flux and Stable Diffusion 3.5 use the same approach.

Check-in 2. Got it so far?

Diffusion vs. autoregressive image models

Compare the options

Diffusion (SD, Flux, Midjourney)	Autoregressive (GPT-4o image, some experimental)
Generate whole image at once, refine.	Generate pixel or patch, then next, like text tokens.
Fast, parallel, high quality.	Slower, but natural fit with LLMs.
Dominant approach in 2026.	Growing as multimodal LLMs improve.
ControlNet, LoRA, IP-Adapter work here.	Different adapter ecosystem.

Check-in 3. Got it so far?

Key terms in this lesson

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “How Diffusion Models Actually Work”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

How Diffusion Models Actually Work

Noise in, picture out

How a diffusion model gets trained (the forward process)

Generating a new image (the reverse process)

Latent diffusion (the trick that made it fast)

Diffusion vs. autoregressive image models

Curious about “How Diffusion Models Actually Work”?

Keep going

How Diffusion Models Actually Work

Noise in, picture out

How a diffusion model gets trained (the forward process)

Generating a new image (the reverse process)

Latent diffusion (the trick that made it fast)

Diffusion vs. autoregressive image models

Curious about “How Diffusion Models Actually Work”?

Keep going