Tendril

Lesson 295 of 1596

Production Incidents With an AI Co-Pilot

When prod is on fire, AI agents can be either your best partner or a dangerous distraction. Learn the incident workflow that uses AI safely under pressure — and the moments to put it down.

Creators · AI-Assisted Coding · ~8 min read

Print / PDF

The Page at 3 AM

Production is down. Customers are upset. You're tired, alone, and you have an AI that can run shell commands. This is the highest-pressure moment to use AI well — and the easiest moment to make things much worse.

The incident response loop

Six phases. AI helps in different ways in each. Knowing which is the difference between a 30-minute incident and a 3-hour one.

text

1. ASSESS — What's broken? How bad? Who's affected? 2. CONTAIN — Stop the bleeding (rollback, feature flag, kill switch) 3. DIAGNOSE — Why did this happen? 4. FIX — Apply the targeted fix. 5. VERIFY — Confirm prod is healthy. 6. REVIEW — Postmortem, prevent recurrence.

Where AI helps in each phase

Compare the options

Phase	AI's role	Caution
Assess	Search logs/traces in parallel via MCP	Cross-check anything before acting
Contain	Suggest the rollback or flag flip	Human runs the actual command
Diagnose	Generate hypotheses, read git diff, summarize change	AI is fast but speculative — verify
Fix	Draft the patch and tests	Read every line, test in staging if possible
Verify	Sanity-check dashboards, log queries	Final confirmation by human
Review	Draft the postmortem timeline	Edit for honesty and accuracy

The rollback-first principle

When prod is on fire, your first move is almost always to roll back to the last known-good state. Diagnosing first is a luxury you can't afford while customers are seeing 500s. AI is great at suggesting the right rollback target — "the last deploy where p95 was healthy was 3:47 PM, sha b9c1d4e" — but you execute the rollback. AI does not push the big red button.

Tight scope. Read-only. Human-in-the-loop on every action that changes prod.

text

# A safe incident prompt under pressure: "Production is failing. Symptom: <description>. Started: <time>. 1. Search the last hour of logs in service X for any error pattern. 2. List the 3 most recent deploys, with commit SHA and timestamp. 3. Suggest the most likely rollback target. 4. Do NOT run any rollback or write commands. I will execute them. 5. Once I confirm rollback, help me write the customer comms."

When AI is dangerously wrong under pressure

It will speculate confidently about the root cause based on partial data — verify before acting on it
It will suggest fixes that touch unrelated code — narrow the scope before approving
It can misread dashboards (especially long stack traces with similar errors) — re-confirm with your eyes
It doesn't know your team's deploy norms — ask before running anything

The customer-comms accelerator

AI is excellent at this. Comms during incidents is a high-stress writing task; AI removes the friction.

text

# After containment, in a fresh chat: "Draft a customer-facing incident message for status.example.com. Facts: - Service X started returning 500s at 14:33 UTC - We rolled back at 14:51 UTC, recovered at 14:58 UTC - Affected: ~30% of API requests during the window - Cause: under investigation, will share in postmortem Tone: factual, accountable, not over-promising. Length: 2-3 sentences for status page, 1 longer paragraph for email."

Postmortem assistance

1Paste your incident timeline (commands run, times) into the prompt
2Ask AI to fill gaps from logs, deploy history, alert timestamps
3Have AI draft the "what went wrong" section based on facts you provide
4Edit it yourself — the postmortem is yours, not the AI's
5Convert action items into tracked tickets with the agent's help
6Most importantly: have the AI cross-check your stated root cause against the evidence — "is this conclusion supported?"

What you should NOT use AI for during an incident

Deciding whether to declare an incident (your judgment, not AI's)
Communicating with on-call leadership (humans only)
Authorizing emergency access changes (humans, with audit trail)
Determining customer impact for legal or PR purposes (humans, with logs)
Deciding when the incident is over (humans, with verification)

Pre-incident prep: the playbook is the work

Pre-built incident skills mean you're not improvising under pressure. Ship these before you need them.

text

# Save this in your team's runbook (and test it before you need it): INCIDENT.md When prod is failing: 1. Open #incidents, post: 'I'm investigating <symptom>' 2. Open Claude Code in read-only MCP mode (no write tools) 3. Run /incident-triage skill (defined below) 4. After containment, post the timeline to the channel 5. Schedule postmortem within 48 hours .claude/skills/incident-triage.md --- name: incident-triage description: Read-only triage of production issues --- 1. Pull recent deploys (read git log, deploy log) 2. Search the relevant service logs for last 1 hour 3. List the 5 most likely rollback targets 4. Output a concise triage report — DO NOT execute any change

“AI is a force multiplier for the prepared and a force multiplier of chaos for the unprepared.”
An on-call lead

Key terms in this lesson

The big idea: production incidents are the highest-stakes use of AI in coding. Use AI for parallel reads — logs, traces, history — while you make the writes. Pre-build incident skills, keep AI read-only under pressure, and never let stress lower your scrutiny. The best engineers stay calm by following a checklist; AI is just one of the tools the checklist names.

End-of-lesson quiz

Check what stuck

8 questions · Score saves to your progress.

Tutor

Curious about “Production Incidents With an AI Co-Pilot”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Production Incidents With an AI Co-Pilot

The Page at 3 AM

The incident response loop

Where AI helps in each phase

The rollback-first principle

When AI is dangerously wrong under pressure

The customer-comms accelerator

Postmortem assistance

What you should NOT use AI for during an incident

Pre-incident prep: the playbook is the work

Curious about “Production Incidents With an AI Co-Pilot”?

Keep going

Production Incidents With an AI Co-Pilot

The Page at 3 AM

The incident response loop

Where AI helps in each phase

The rollback-first principle

When AI is dangerously wrong under pressure

The customer-comms accelerator

Postmortem assistance

What you should NOT use AI for during an incident

Pre-incident prep: the playbook is the work

Curious about “Production Incidents With an AI Co-Pilot”?

Keep going