Lesson 295 of 1596
Production Incidents With an AI Co-Pilot
When prod is on fire, AI agents can be either your best partner or a dangerous distraction. Learn the incident workflow that uses AI safely under pressure — and the moments to put it down.
Creators · AI-Assisted Coding · ~8 min read
The Page at 3 AM
Production is down. Customers are upset. You're tired, alone, and you have an AI that can run shell commands. This is the highest-pressure moment to use AI well — and the easiest moment to make things much worse.
The incident response loop
Six phases. AI helps in different ways in each. Knowing which is the difference between a 30-minute incident and a 3-hour one.
1. ASSESS — What's broken? How bad? Who's affected? 2. CONTAIN — Stop the bleeding (rollback, feature flag, kill switch) 3. DIAGNOSE — Why did this happen? 4. FIX — Apply the targeted fix. 5. VERIFY — Confirm prod is healthy. 6. REVIEW — Postmortem, prevent recurrence.Where AI helps in each phase
Compare the options
| Phase | AI's role | Caution |
|---|---|---|
| Assess | Search logs/traces in parallel via MCP | Cross-check anything before acting |
| Contain | Suggest the rollback or flag flip | Human runs the actual command |
| Diagnose | Generate hypotheses, read git diff, summarize change | AI is fast but speculative — verify |
| Fix | Draft the patch and tests | Read every line, test in staging if possible |
| Verify | Sanity-check dashboards, log queries | Final confirmation by human |
| Review | Draft the postmortem timeline | Edit for honesty and accuracy |
The rollback-first principle
When prod is on fire, your first move is almost always to roll back to the last known-good state. Diagnosing first is a luxury you can't afford while customers are seeing 500s. AI is great at suggesting the right rollback target — "the last deploy where p95 was healthy was 3:47 PM, sha b9c1d4e" — but you execute the rollback. AI does not push the big red button.
Tight scope. Read-only. Human-in-the-loop on every action that changes prod.
# A safe incident prompt under pressure: "Production is failing. Symptom: <description>. Started: <time>. 1. Search the last hour of logs in service X for any error pattern. 2. List the 3 most recent deploys, with commit SHA and timestamp. 3. Suggest the most likely rollback target. 4. Do NOT run any rollback or write commands. I will execute them. 5. Once I confirm rollback, help me write the customer comms."When AI is dangerously wrong under pressure
- It will speculate confidently about the root cause based on partial data — verify before acting on it
- It will suggest fixes that touch unrelated code — narrow the scope before approving
- It can misread dashboards (especially long stack traces with similar errors) — re-confirm with your eyes
- It doesn't know your team's deploy norms — ask before running anything
The customer-comms accelerator
AI is excellent at this. Comms during incidents is a high-stress writing task; AI removes the friction.
# After containment, in a fresh chat: "Draft a customer-facing incident message for status.example.com. Facts: - Service X started returning 500s at 14:33 UTC - We rolled back at 14:51 UTC, recovered at 14:58 UTC - Affected: ~30% of API requests during the window - Cause: under investigation, will share in postmortem Tone: factual, accountable, not over-promising. Length: 2-3 sentences for status page, 1 longer paragraph for email."Postmortem assistance
- 1Paste your incident timeline (commands run, times) into the prompt
- 2Ask AI to fill gaps from logs, deploy history, alert timestamps
- 3Have AI draft the "what went wrong" section based on facts you provide
- 4Edit it yourself — the postmortem is yours, not the AI's
- 5Convert action items into tracked tickets with the agent's help
- 6Most importantly: have the AI cross-check your stated root cause against the evidence — "is this conclusion supported?"
What you should NOT use AI for during an incident
- Deciding whether to declare an incident (your judgment, not AI's)
- Communicating with on-call leadership (humans only)
- Authorizing emergency access changes (humans, with audit trail)
- Determining customer impact for legal or PR purposes (humans, with logs)
- Deciding when the incident is over (humans, with verification)
Pre-incident prep: the playbook is the work
Pre-built incident skills mean you're not improvising under pressure. Ship these before you need them.
# Save this in your team's runbook (and test it before you need it): INCIDENT.md When prod is failing: 1. Open #incidents, post: 'I'm investigating <symptom>' 2. Open Claude Code in read-only MCP mode (no write tools) 3. Run /incident-triage skill (defined below) 4. After containment, post the timeline to the channel 5. Schedule postmortem within 48 hours .claude/skills/incident-triage.md --- name: incident-triage description: Read-only triage of production issues --- 1. Pull recent deploys (read git log, deploy log) 2. Search the relevant service logs for last 1 hour 3. List the 5 most likely rollback targets 4. Output a concise triage report — DO NOT execute any change“AI is a force multiplier for the prepared and a force multiplier of chaos for the unprepared.”
Key terms in this lesson
The big idea: production incidents are the highest-stakes use of AI in coding. Use AI for parallel reads — logs, traces, history — while you make the writes. Pre-build incident skills, keep AI read-only under pressure, and never let stress lower your scrutiny. The best engineers stay calm by following a checklist; AI is just one of the tools the checklist names.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Production Incidents With an AI Co-Pilot”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 50 min
Deploy Pipelines With AI in the Loop
AI belongs in CI/CD too. From PR previews to rollback judgment calls, agents can operate inside your pipeline safely — if you scope them right.
Creators · 75 min
Capstone: Ship a Real Full-Stack AI-Assisted Project
The creators capstone. You scope, design, build, test, deploy, and document a real full-stack project using an agentic workflow — end to end.
Creators · 11 min
Recovering When the Agent Trashed Your Repo
An agent went off-script, broke your build, and committed garbage. Learn the systematic recovery workflow — git, sanity checks, and the cultural habits that make recovery fast.
