Lesson 341 of 2116
Production Incidents With an AI Co-Pilot
When prod is on fire, AI agents can be either your best partner or a dangerous distraction. Learn the incident workflow that uses AI safely under pressure — and the moments to put it down.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Page at 3 AM
- 2incident response
- 3blast radius
- 4rollback
Concept cluster
Terms to connect while reading
Section 1
The Page at 3 AM
Production is down. Customers are upset. You're tired, alone, and you have an AI that can run shell commands. This is the highest-pressure moment to use AI well — and the easiest moment to make things much worse.
The incident response loop
Six phases. AI helps in different ways in each. Knowing which is the difference between a 30-minute incident and a 3-hour one.
1. ASSESS — What's broken? How bad? Who's affected?
2. CONTAIN — Stop the bleeding (rollback, feature flag, kill switch)
3. DIAGNOSE — Why did this happen?
4. FIX — Apply the targeted fix.
5. VERIFY — Confirm prod is healthy.
6. REVIEW — Postmortem, prevent recurrence.Where AI helps in each phase
Compare the options
| Phase | AI's role | Caution |
|---|---|---|
| Assess | Search logs/traces in parallel via MCP | Cross-check anything before acting |
| Contain | Suggest the rollback or flag flip | Human runs the actual command |
| Diagnose | Generate hypotheses, read git diff, summarize change | AI is fast but speculative — verify |
| Fix | Draft the patch and tests | Read every line, test in staging if possible |
| Verify | Sanity-check dashboards, log queries | Final confirmation by human |
| Review | Draft the postmortem timeline | Edit for honesty and accuracy |
The rollback-first principle
When prod is on fire, your first move is almost always to roll back to the last known-good state. Diagnosing first is a luxury you can't afford while customers are seeing 500s. AI is great at suggesting the right rollback target — "the last deploy where p95 was healthy was 3:47 PM, sha b9c1d4e" — but you execute the rollback. AI does not push the big red button.
Tight scope. Read-only. Human-in-the-loop on every action that changes prod.
# A safe incident prompt under pressure:
"Production is failing. Symptom: <description>. Started: <time>.
1. Search the last hour of logs in service X for any error pattern.
2. List the 3 most recent deploys, with commit SHA and timestamp.
3. Suggest the most likely rollback target.
4. Do NOT run any rollback or write commands. I will execute them.
5. Once I confirm rollback, help me write the customer comms."When AI is dangerously wrong under pressure
- It will speculate confidently about the root cause based on partial data — verify before acting on it
- It will suggest fixes that touch unrelated code — narrow the scope before approving
- It can misread dashboards (especially long stack traces with similar errors) — re-confirm with your eyes
- It doesn't know your team's deploy norms — ask before running anything
The customer-comms accelerator
AI is excellent at this. Comms during incidents is a high-stress writing task; AI removes the friction.
# After containment, in a fresh chat:
"Draft a customer-facing incident message for status.example.com.
Facts:
- Service X started returning 500s at 14:33 UTC
- We rolled back at 14:51 UTC, recovered at 14:58 UTC
- Affected: ~30% of API requests during the window
- Cause: under investigation, will share in postmortem
Tone: factual, accountable, not over-promising.
Length: 2-3 sentences for status page, 1 longer paragraph for email."Postmortem assistance
- 1Paste your incident timeline (commands run, times) into the prompt
- 2Ask AI to fill gaps from logs, deploy history, alert timestamps
- 3Have AI draft the "what went wrong" section based on facts you provide
- 4Edit it yourself — the postmortem is yours, not the AI's
- 5Convert action items into tracked tickets with the agent's help
- 6Most importantly: have the AI cross-check your stated root cause against the evidence — "is this conclusion supported?"
What you should NOT use AI for during an incident
- Deciding whether to declare an incident (your judgment, not AI's)
- Communicating with on-call leadership (humans only)
- Authorizing emergency access changes (humans, with audit trail)
- Determining customer impact for legal or PR purposes (humans, with logs)
- Deciding when the incident is over (humans, with verification)
Pre-incident prep: the playbook is the work
Pre-built incident skills mean you're not improvising under pressure. Ship these before you need them.
# Save this in your team's runbook (and test it before you need it):
INCIDENT.md
When prod is failing:
1. Open #incidents, post: 'I'm investigating <symptom>'
2. Open Claude Code in read-only MCP mode (no write tools)
3. Run /incident-triage skill (defined below)
4. After containment, post the timeline to the channel
5. Schedule postmortem within 48 hours
.claude/skills/incident-triage.md
---
name: incident-triage
description: Read-only triage of production issues
---
1. Pull recent deploys (read git log, deploy log)
2. Search the relevant service logs for last 1 hour
3. List the 5 most likely rollback targets
4. Output a concise triage report — DO NOT execute any change“AI is a force multiplier for the prepared and a force multiplier of chaos for the unprepared.”
Key terms in this lesson
The big idea: production incidents are the highest-stakes use of AI in coding. Use AI for parallel reads — logs, traces, history — while you make the writes. Pre-build incident skills, keep AI read-only under pressure, and never let stress lower your scrutiny. The best engineers stay calm by following a checklist; AI is just one of the tools the checklist names.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Production Incidents With an AI Co-Pilot”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 50 min
Deploy Pipelines With AI in the Loop
AI belongs in CI/CD too. From PR previews to rollback judgment calls, agents can operate inside your pipeline safely — if you scope them right.
Creators · 75 min
Capstone: Ship a Real Full-Stack AI-Assisted Project
The creators capstone. You scope, design, build, test, deploy, and document a real full-stack project using an agentic workflow — end to end.
Creators · 11 min
Recovering When the Agent Trashed Your Repo
An agent went off-script, broke your build, and committed garbage. Learn the systematic recovery workflow — git, sanity checks, and the cultural habits that make recovery fast.
