Loading lesson…
When prod is on fire, AI agents can be either your best partner or a dangerous distraction. Learn the incident workflow that uses AI safely under pressure — and the moments to put it down.
Production is down. Customers are upset. You're tired, alone, and you have an AI that can run shell commands. This is the highest-pressure moment to use AI well — and the easiest moment to make things much worse.
1. ASSESS — What's broken? How bad? Who's affected? 2. CONTAIN — Stop the bleeding (rollback, feature flag, kill switch) 3. DIAGNOSE — Why did this happen? 4. FIX — Apply the targeted fix. 5. VERIFY — Confirm prod is healthy. 6. REVIEW — Postmortem, prevent recurrence.Six phases. AI helps in different ways in each. Knowing which is the difference between a 30-minute incident and a 3-hour one.| Phase | AI's role | Caution |
|---|---|---|
| Assess | Search logs/traces in parallel via MCP | Cross-check anything before acting |
| Contain | Suggest the rollback or flag flip | Human runs the actual command |
| Diagnose | Generate hypotheses, read git diff, summarize change | AI is fast but speculative — verify |
| Fix | Draft the patch and tests | Read every line, test in staging if possible |
| Verify | Sanity-check dashboards, log queries | Final confirmation by human |
| Review | Draft the postmortem timeline | Edit for honesty and accuracy |
When prod is on fire, your first move is almost always to roll back to the last known-good state. Diagnosing first is a luxury you can't afford while customers are seeing 500s. AI is great at suggesting the right rollback target — "the last deploy where p95 was healthy was 3:47 PM, sha b9c1d4e" — but you execute the rollback. AI does not push the big red button.
# A safe incident prompt under pressure: "Production is failing. Symptom: <description>. Started: <time>. 1. Search the last hour of logs in service X for any error pattern. 2. List the 3 most recent deploys, with commit SHA and timestamp. 3. Suggest the most likely rollback target. 4. Do NOT run any rollback or write commands. I will execute them. 5. Once I confirm rollback, help me write the customer comms."Tight scope. Read-only. Human-in-the-loop on every action that changes prod.# After containment, in a fresh chat: "Draft a customer-facing incident message for status.example.com. Facts: - Service X started returning 500s at 14:33 UTC - We rolled back at 14:51 UTC, recovered at 14:58 UTC - Affected: ~30% of API requests during the window - Cause: under investigation, will share in postmortem Tone: factual, accountable, not over-promising. Length: 2-3 sentences for status page, 1 longer paragraph for email."AI is excellent at this. Comms during incidents is a high-stress writing task; AI removes the friction.# Save this in your team's runbook (and test it before you need it): INCIDENT.md When prod is failing: 1. Open #incidents, post: 'I'm investigating <symptom>' 2. Open Claude Code in read-only MCP mode (no write tools) 3. Run /incident-triage skill (defined below) 4. After containment, post the timeline to the channel 5. Schedule postmortem within 48 hours .claude/skills/incident-triage.md --- name: incident-triage description: Read-only triage of production issues --- 1. Pull recent deploys (read git log, deploy log) 2. Search the relevant service logs for last 1 hour 3. List the 5 most likely rollback targets 4. Output a concise triage report — DO NOT execute any changePre-built incident skills mean you're not improvising under pressure. Ship these before you need them.AI is a force multiplier for the prepared and a force multiplier of chaos for the unprepared.
— An on-call lead
The big idea: production incidents are the highest-stakes use of AI in coding. Use AI for parallel reads — logs, traces, history — while you make the writes. Pre-build incident skills, keep AI read-only under pressure, and never let stress lower your scrutiny. The best engineers stay calm by following a checklist; AI is just one of the tools the checklist names.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-debug-prod-incident-with-ai-creators
What is the main idea of "Production Incidents With an AI Co-Pilot"?
Which concept is most central to "Production Incidents With an AI Co-Pilot"?
Which use of AI fits this topic best?
What should a careful learner remember about "Never let an agent push to prod during an incident"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about incident response be treated?
Name one way to verify an AI answer about incident response.
Which action would help you apply "Production Incidents With an AI Co-Pilot" responsibly?