Lesson 58 of 2116
ControlNet, IP-Adapter, LoRA — Fine-Grained Control
Base diffusion models give you creative possibilities. Adapters give you creative PRECISION. Master the three that matter most.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The control stack
- 2ControlNet
- 3LoRA
- 4IP-Adapter
Concept cluster
Terms to connect while reading
Section 1
The control stack
A bare diffusion model reads a text prompt and generates something plausible. Production creative work needs more: a specific pose, a specific character, a specific style. Three adapter families — ControlNet, IP-Adapter, and LoRA — cover 95% of professional use cases. They compose cleanly.
ControlNet — structural conditioning
ControlNet (Zhang et al., 2023) adds structural guidance to a diffusion model via an auxiliary network. You pass a conditioning image (edge map, depth map, pose skeleton, normal map, segmentation) and the model respects that structure while the text prompt fills in appearance. It's the foundation of 'put THIS character in THAT pose' and 'keep the composition, change the style.'
- Canny edges — preserve line structure of a sketch.
- Depth — preserve 3D layout of a reference photo.
- OpenPose — specify exact human body pose.
- Scribble — rough doodle controls composition.
- Segmentation — color-coded regions define what goes where.
- Tile — upscale with seamless detail.
IP-Adapter — image prompting
IP-Adapter (Ye et al., 2023) lets you prompt with an IMAGE, not just text. Feed it a reference image; the diffusion model's output borrows the reference's subject, style, or composition (depending on the variant). Crucial for character consistency across a comic, style matching across a brand system, and face-preserving portraits.
LoRA — lightweight fine-tuning
LoRA (Low-Rank Adaptation, Hu et al., 2021) adds a small set of trainable matrices to the diffusion model's attention layers. You can fine-tune a few MB of weights on as few as 10-30 images to teach the model a new character, object, artist style, or concept. Swap LoRAs at inference time — 'same base Flux, three different brand styles' is a one-line change.
Compare the options
| Tool | What it controls | When to use |
|---|---|---|
| ControlNet | Structure (pose, depth, edges). | You have a reference composition and want to re-style it. |
| IP-Adapter | Style or subject from a reference image. | You want the 'vibe' of a reference or a consistent character. |
| LoRA | A learned concept (character, style, object). | You have 10+ reference images of a specific thing and want to generate more. |
| Textual Inversion | A learned concept as a single prompt token. | Similar to LoRA but lower capacity; less common in 2026. |
Stacking adapters
The professional pipeline typically stacks: base model (Flux Dev) + character LoRA + style LoRA + ControlNet pose + IP-Adapter for facial consistency. Each layer adds constraint. The art is knowing when you're over-constraining (outputs look muddy, burnt) vs. under-constraining (outputs drift).
Production-style adapter stacking on Flux Dev in Diffusers.
# ComfyUI / Diffusers-style pseudocode stacking adapters on Flux
from diffusers import FluxPipeline, ControlNetModel
import torch
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16
).to("cuda")
# Load a character LoRA (trained on 15 images of our mascot)
pipe.load_lora_weights("./loras/mascot-flux-lora.safetensors", adapter_name="mascot")
# Load a brand-style LoRA
pipe.load_lora_weights("./loras/brand-style-lora.safetensors", adapter_name="brand_style")
pipe.set_adapters(["mascot", "brand_style"], adapter_weights=[1.0, 0.7])
# Pose control from an OpenPose reference
controlnet = ControlNetModel.from_pretrained("XLabs-AI/flux-controlnet-pose")
# IP-Adapter for facial consistency with hero shot
pipe.load_ip_adapter("XLabs-AI/flux-ip-adapter", weight=0.6)
image = pipe(
prompt="The mascot standing confidently in a neon-lit lab, cinematic",
control_image=pose_reference, # OpenPose skeleton
ip_adapter_image=hero_face, # Reference face
num_inference_steps=28,
guidance_scale=3.5,
).images[0]Training a LoRA yourself
- 1Collect 15-30 images of your concept. Varied angles, consistent subject.
- 2Caption each image precisely; use a unique token (e.g., 'TDRLMSCT' — a made-up word) for the concept.
- 3Train with kohya_ss, Replicate FLUX trainer, or Fal trainer. Typical time: 30 min on an H100, ~$3-10.
- 4Validate: generate 10 outputs with and without the LoRA. Check concept is captured without overfitting.
- 5Version. Tag the LoRA with training date, base model, and trigger token.
Key terms in this lesson
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “ControlNet, IP-Adapter, LoRA — Fine-Grained Control”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
Video Generation at the API Level
Behind the glossy UIs, video models expose REST APIs. Here's how to call Sora, Veo, and Runway programmatically and build production pipelines.
Creators · 38 min
Human-in-the-Loop Creative Workflows
The winning pattern in 2026 is not AI-replacing-humans — it's AI-as-instrument. Figma, v0.dev, Canva, and editor workflows show how to compose it.
Creators · 42 min
Diffusion vs. Autoregressive Image Generation
Two fundamentally different approaches to generating pixels. Understand the architectural tradeoffs to reason about what each can and can't do. Classifier-free guidance (CFG) controls prompt adherence vs.
