Loading lesson…
A prototype agent and a production agent have the same LLM. What's different is everything around it — durable state, retries, idempotency, observability. The real engineering.
In a prototype, a crash is fine — you rerun. In production, a crash means a user's pizza never got ordered and a $4 LLM call got burned. Production agents must be durable, idempotent, observable, and cost-capped. Most teams discover this after shipping a demo.
| Requirement | What it means |
|---|---|
| Durable state | Every step is persisted. Process can die and resume. |
| Idempotent steps | Re-running a step is safe — no duplicate actions. |
| Retries with backoff | Transient failures retry; permanent failures surface. |
| Observability | Every tool call, every prompt, every token logged. |
| Cost + step caps | Hard ceilings prevent runaway loops and bills. |
Agents that run longer than a few seconds shouldn't live in memory. Checkpoint after every step. Options in 2026:
// Vercel Workflow DevKit — modern "use workflow" directive
// Models are addressed via the AI Gateway alias format.
import { step } from 'workflow';
import { generateText } from 'ai';
export async function researchAgent(goal: string) {
'use workflow';
const plan = await step('plan', async () => {
const { text } = await generateText({
model: 'anthropic/claude-opus-4.7',
prompt: `Break into sub-questions:\n${goal}`,
});
return text.split('\n');
});
const findings = [];
for (const q of plan) {
const answer = await step(`research:${q.slice(0, 20)}`, async () => {
return await searchAndSummarize(q);
}, { retries: 3, timeout: '60s' });
findings.push({ q, answer });
}
return await step('synthesize', async () => {
const { text } = await generateText({
model: 'anthropic/claude-opus-4.7',
prompt: `Write a cited answer:\n${JSON.stringify(findings)}`,
});
return text;
});
}Every step() is durable. If the process dies, execution resumes from the last completed step. Built-in retries + timeouts.Any step that touches the outside world (send email, charge card, create ticket) needs an idempotency key. When the step retries, the external system recognizes the key and doesn't duplicate the action.
// Idempotent Stripe charge
const chargeId = `task:${taskId}:step:${stepName}`;
const charge = await stripe.paymentIntents.create(
{ amount: 5000, currency: 'usd', customer: custId },
{ idempotencyKey: chargeId }
);
// Same chargeId on retry returns the same charge — no double-billing.Every external API call gets an idempotency key derived from workflow state. Most APIs (Stripe, SendGrid, Twilio) support this natively.| Error type | Policy |
|---|---|
| Network timeout | Retry 3x with exponential backoff (1s, 5s, 30s). |
| Rate limit (429) | Retry after Retry-After header; circuit-break after 5 attempts. |
| 5xx server error | Retry 3x; alert on repeated 503. |
| Tool schema mismatch | One retry with error fed back to model. |
| 4xx client error | Do NOT retry — it'll fail the same way. |
| Auth failure | Do NOT retry — alert, stop. |
const MAX_STEPS = 50;
const MAX_COST_USD = 2.00;
let stepCount = 0;
let costUsd = 0;
while (!isDone(state)) {
if (++stepCount > MAX_STEPS) {
throw new Error('Step cap exceeded — possible loop.');
}
const { result, usage } = await runOneStep(state);
costUsd += usage.inputTokens * 3/1_000_000 + usage.outputTokens * 15/1_000_000;
if (costUsd > MAX_COST_USD) {
throw new Error(`Cost cap exceeded: $${costUsd.toFixed(2)}`);
}
state = applyResult(state, result);
}Non-negotiable ceilings. Cheaper to fail a task than to let a loop burn $400 of Opus calls at 3 AM.Next: the security dimension. An agent is a new attack surface.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-production-patterns-creators
What is the core idea behind "Production Agent Patterns: Queues, Retries, Idempotency"?
Which term best describes a foundational idea in "Production Agent Patterns: Queues, Retries, Idempotency"?
A learner studying Production Agent Patterns: Queues, Retries, Idempotency would need to understand which concept?
Which of these is directly relevant to Production Agent Patterns: Queues, Retries, Idempotency?
Which of the following is a key point about Production Agent Patterns: Queues, Retries, Idempotency?
Which of these does NOT belong in a discussion of Production Agent Patterns: Queues, Retries, Idempotency?
Which statement is accurate regarding Production Agent Patterns: Queues, Retries, Idempotency?
Which of these does NOT belong in a discussion of Production Agent Patterns: Queues, Retries, Idempotency?
What is the key insight about "The pager story everyone has" in the context of Production Agent Patterns: Queues, Retries, Idempotency?
What is the key insight about "Testing durability" in the context of Production Agent Patterns: Queues, Retries, Idempotency?
What is the key warning about "Scope your agents tightly" in the context of Production Agent Patterns: Queues, Retries, Idempotency?
Which statement accurately describes an aspect of Production Agent Patterns: Queues, Retries, Idempotency?
What does working with Production Agent Patterns: Queues, Retries, Idempotency typically involve?
Which of the following is true about Production Agent Patterns: Queues, Retries, Idempotency?
Which best describes the scope of "Production Agent Patterns: Queues, Retries, Idempotency"?