Loading lesson…
A prototype agent and a production agent have the same LLM. What's different is everything around it — durable state, retries, idempotency, observability. The real engineering.
In a prototype, a crash is fine — you rerun. In production, a crash means a user's pizza never got ordered and a $4 LLM call got burned. Production agents must be durable, idempotent, observable, and cost-capped. Most teams discover this after shipping a demo.
| Requirement | What it means |
|---|---|
| Durable state | Every step is persisted. Process can die and resume. |
| Idempotent steps | Re-running a step is safe — no duplicate actions. |
| Retries with backoff | Transient failures retry; permanent failures surface. |
| Observability | Every tool call, every prompt, every token logged. |
| Cost + step caps | Hard ceilings prevent runaway loops and bills. |
Agents that run longer than a few seconds shouldn't live in memory. Checkpoint after every step. Options in 2026:
// Vercel Workflow DevKit — modern "use workflow" directive // Models are addressed via the AI Gateway alias format. import { step } from 'workflow'; import { generateText } from 'ai'; export async function researchAgent(goal: string) { 'use workflow'; const plan = await step('plan', async () => { const { text } = await generateText({ model: 'anthropic/claude-opus-4.7', prompt: `Break into sub-questions:\n${goal}`, }); return text.split('\n'); }); const findings = []; for (const q of plan) { const answer = await step(`research:${q.slice(0, 20)}`, async () => { return await searchAndSummarize(q); }, { retries: 3, timeout: '60s' }); findings.push({ q, answer }); } return await step('synthesize', async () => { const { text } = await generateText({ model: 'anthropic/claude-opus-4.7', prompt: `Write a cited answer:\n${JSON.stringify(findings)}`, }); return text; }); }Every step() is durable. If the process dies, execution resumes from the last completed step. Built-in retries + timeouts.Any step that touches the outside world (send email, charge card, create ticket) needs an idempotency key. When the step retries, the external system recognizes the key and doesn't duplicate the action.
// Idempotent Stripe charge const chargeId = `task:${taskId}:step:${stepName}`; const charge = await stripe.paymentIntents.create( { amount: 5000, currency: 'usd', customer: custId }, { idempotencyKey: chargeId } ); // Same chargeId on retry returns the same charge — no double-billing.Every external API call gets an idempotency key derived from workflow state. Most APIs (Stripe, SendGrid, Twilio) support this natively.| Error type | Policy |
|---|---|
| Network timeout | Retry 3x with exponential backoff (1s, 5s, 30s). |
| Rate limit (429) | Retry after Retry-After header; circuit-break after 5 attempts. |
| 5xx server error | Retry 3x; alert on repeated 503. |
| Tool schema mismatch | One retry with error fed back to model. |
| 4xx client error | Do NOT retry — it'll fail the same way. |
| Auth failure | Do NOT retry — alert, stop. |
const MAX_STEPS = 50; const MAX_COST_USD = 2.00; let stepCount = 0; let costUsd = 0; while (!isDone(state)) { if (++stepCount > MAX_STEPS) { throw new Error('Step cap exceeded — possible loop.'); } const { result, usage } = await runOneStep(state); costUsd += usage.inputTokens * 3/1_000_000 + usage.outputTokens * 15/1_000_000; if (costUsd > MAX_COST_USD) { throw new Error(`Cost cap exceeded: $${costUsd.toFixed(2)}`); } state = applyResult(state, result); }Non-negotiable ceilings. Cheaper to fail a task than to let a loop burn $400 of Opus calls at 3 AM.Next: the security dimension. An agent is a new attack surface.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-production-patterns-creators
What is the main idea of "Production Agent Patterns: Queues, Retries, Idempotency"?
Which concept is most central to "Production Agent Patterns: Queues, Retries, Idempotency"?
Which use of AI fits this topic best?
What should a careful learner remember about "The pager story everyone has"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about durability be treated?
Name one way to verify an AI answer about durability.
Which action would help you apply "Production Agent Patterns: Queues, Retries, Idempotency" responsibly?