Loading lesson…
When prod is on fire, AI agents can be either your best partner or a dangerous distraction. Learn the incident workflow that uses AI safely under pressure — and the moments to put it down.
Production is down. Customers are upset. You're tired, alone, and you have an AI that can run shell commands. This is the highest-pressure moment to use AI well — and the easiest moment to make things much worse.
1. ASSESS — What's broken? How bad? Who's affected?
2. CONTAIN — Stop the bleeding (rollback, feature flag, kill switch)
3. DIAGNOSE — Why did this happen?
4. FIX — Apply the targeted fix.
5. VERIFY — Confirm prod is healthy.
6. REVIEW — Postmortem, prevent recurrence.Six phases. AI helps in different ways in each. Knowing which is the difference between a 30-minute incident and a 3-hour one.| Phase | AI's role | Caution |
|---|---|---|
| Assess | Search logs/traces in parallel via MCP | Cross-check anything before acting |
| Contain | Suggest the rollback or flag flip | Human runs the actual command |
| Diagnose | Generate hypotheses, read git diff, summarize change | AI is fast but speculative — verify |
| Fix | Draft the patch and tests | Read every line, test in staging if possible |
| Verify | Sanity-check dashboards, log queries | Final confirmation by human |
| Review | Draft the postmortem timeline | Edit for honesty and accuracy |
When prod is on fire, your first move is almost always to roll back to the last known-good state. Diagnosing first is a luxury you can't afford while customers are seeing 500s. AI is great at suggesting the right rollback target — "the last deploy where p95 was healthy was 3:47 PM, sha b9c1d4e" — but you execute the rollback. AI does not push the big red button.
# A safe incident prompt under pressure:
"Production is failing. Symptom: <description>. Started: <time>.
1. Search the last hour of logs in service X for any error pattern.
2. List the 3 most recent deploys, with commit SHA and timestamp.
3. Suggest the most likely rollback target.
4. Do NOT run any rollback or write commands. I will execute them.
5. Once I confirm rollback, help me write the customer comms."Tight scope. Read-only. Human-in-the-loop on every action that changes prod.# After containment, in a fresh chat:
"Draft a customer-facing incident message for status.example.com.
Facts:
- Service X started returning 500s at 14:33 UTC
- We rolled back at 14:51 UTC, recovered at 14:58 UTC
- Affected: ~30% of API requests during the window
- Cause: under investigation, will share in postmortem
Tone: factual, accountable, not over-promising.
Length: 2-3 sentences for status page, 1 longer paragraph for email."AI is excellent at this. Comms during incidents is a high-stress writing task; AI removes the friction.# Save this in your team's runbook (and test it before you need it):
INCIDENT.md
When prod is failing:
1. Open #incidents, post: 'I'm investigating <symptom>'
2. Open Claude Code in read-only MCP mode (no write tools)
3. Run /incident-triage skill (defined below)
4. After containment, post the timeline to the channel
5. Schedule postmortem within 48 hours
.claude/skills/incident-triage.md
---
name: incident-triage
description: Read-only triage of production issues
---
1. Pull recent deploys (read git log, deploy log)
2. Search the relevant service logs for last 1 hour
3. List the 5 most likely rollback targets
4. Output a concise triage report — DO NOT execute any changePre-built incident skills mean you're not improvising under pressure. Ship these before you need them.AI is a force multiplier for the prepared and a force multiplier of chaos for the unprepared.
— An on-call lead
The big idea: production incidents are the highest-stakes use of AI in coding. Use AI for parallel reads — logs, traces, history — while you make the writes. Pre-build incident skills, keep AI read-only under pressure, and never let stress lower your scrutiny. The best engineers stay calm by following a checklist; AI is just one of the tools the checklist names.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-debug-prod-incident-with-ai-creators
What is the core idea behind "Production Incidents With an AI Co-Pilot"?
Which term best describes a foundational idea in "Production Incidents With an AI Co-Pilot"?
A learner studying Production Incidents With an AI Co-Pilot would need to understand which concept?
Which of these is directly relevant to Production Incidents With an AI Co-Pilot?
Which of the following is a key point about Production Incidents With an AI Co-Pilot?
Which of these does NOT belong in a discussion of Production Incidents With an AI Co-Pilot?
Which statement is accurate regarding Production Incidents With an AI Co-Pilot?
Which of these does NOT belong in a discussion of Production Incidents With an AI Co-Pilot?
What is the key insight about "Never let an agent push to prod during an incident" in the context of Production Incidents With an AI Co-Pilot?
What is the key insight about "Use AI for parallelism, not autonomy" in the context of Production Incidents With an AI Co-Pilot?
What is the key insight about "Stress lowers your scrutiny" in the context of Production Incidents With an AI Co-Pilot?
Which statement accurately describes an aspect of Production Incidents With an AI Co-Pilot?
What does working with Production Incidents With an AI Co-Pilot typically involve?
Which of the following is true about Production Incidents With an AI Co-Pilot?
Which best describes the scope of "Production Incidents With an AI Co-Pilot"?