Production Agent Patterns: Queues, Retries, Idempotency

A prototype agent and a production agent have the same LLM. What's different is everything around it — durable state, retries, idempotency, observability. The real engineering.

52 min · Reviewed 2026

Why production is different

In a prototype, a crash is fine — you rerun. In production, a crash means a user's pizza never got ordered and a $4 LLM call got burned. Production agents must be durable, idempotent, observable, and cost-capped. Most teams discover this after shipping a demo.

The five production requirements

Requirement	What it means
Durable state	Every step is persisted. Process can die and resume.
Idempotent steps	Re-running a step is safe — no duplicate actions.
Retries with backoff	Transient failures retry; permanent failures surface.
Observability	Every tool call, every prompt, every token logged.
Cost + step caps	Hard ceilings prevent runaway loops and bills.

Durable state pattern

Agents that run longer than a few seconds shouldn't live in memory. Checkpoint after every step. Options in 2026:

Vercel Workflow DevKit (WDK) — step-based, crash-safe, powered by Queues.
LangGraph + PostgresSaver — durable state machines.
Temporal — mature workflow engine; strong for multi-day flows.
Inngest — event-driven steps with retries and concurrency controls.
Roll your own — Postgres + a state column + a worker loop.

// Vercel Workflow DevKit — modern "use workflow" directive
// Models are addressed via the AI Gateway alias format.
import { step } from 'workflow';
import { generateText } from 'ai';

export async function researchAgent(goal: string) {
  'use workflow';

  const plan = await step('plan', async () => {
    const { text } = await generateText({
      model: 'anthropic/claude-opus-4.7',
      prompt: `Break into sub-questions:\n${goal}`,
    });
    return text.split('\n');
  });

  const findings = [];
  for (const q of plan) {
    const answer = await step(`research:${q.slice(0, 20)}`, async () => {
      return await searchAndSummarize(q);
    }, { retries: 3, timeout: '60s' });
    findings.push({ q, answer });
  }

  return await step('synthesize', async () => {
    const { text } = await generateText({
      model: 'anthropic/claude-opus-4.7',
      prompt: `Write a cited answer:\n${JSON.stringify(findings)}`,
    });
    return text;
  });
}Every step() is durable. If the process dies, execution resumes from the last completed step. Built-in retries + timeouts.

Idempotency — the underrated superpower

Any step that touches the outside world (send email, charge card, create ticket) needs an idempotency key. When the step retries, the external system recognizes the key and doesn't duplicate the action.

// Idempotent Stripe charge
const chargeId = `task:${taskId}:step:${stepName}`;
const charge = await stripe.paymentIntents.create(
  { amount: 5000, currency: 'usd', customer: custId },
  { idempotencyKey: chargeId }
);
// Same chargeId on retry returns the same charge — no double-billing.Every external API call gets an idempotency key derived from workflow state. Most APIs (Stripe, SendGrid, Twilio) support this natively.

Retry policy

Error type	Policy
Network timeout	Retry 3x with exponential backoff (1s, 5s, 30s).
Rate limit (429)	Retry after Retry-After header; circuit-break after 5 attempts.
5xx server error	Retry 3x; alert on repeated 503.
Tool schema mismatch	One retry with error fed back to model.
4xx client error	Do NOT retry — it'll fail the same way.
Auth failure	Do NOT retry — alert, stop.

Observability essentials

Trace every LLM call with prompt, response, token counts, cost — OpenTelemetry + LangSmith/Braintrust/Vercel Observability.
Log every tool call with args, result, latency.
Record every workflow event (start, step complete, retry, fail).
Cost dashboards per workflow/agent/user.
Alerts on cost spikes, error spikes, latency regressions.

Cost and step caps

const MAX_STEPS = 50;
const MAX_COST_USD = 2.00;

let stepCount = 0;
let costUsd = 0;

while (!isDone(state)) {
  if (++stepCount > MAX_STEPS) {
    throw new Error('Step cap exceeded — possible loop.');
  }
  const { result, usage } = await runOneStep(state);
  costUsd += usage.inputTokens * 3/1_000_000 + usage.outputTokens * 15/1_000_000;
  if (costUsd > MAX_COST_USD) {
    throw new Error(`Cost cap exceeded: $${costUsd.toFixed(2)}`);
  }
  state = applyResult(state, result);
}Non-negotiable ceilings. Cheaper to fail a task than to let a loop burn $400 of Opus calls at 3 AM.

Next: the security dimension. An agent is a new attack surface.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-production-patterns-creators

What is the core idea behind "Production Agent Patterns: Queues, Retries, Idempotency"?
1. A prototype agent and a production agent have the same LLM. What's different is everything around it — durable state, retries, idempotency, observability. The real engineering.
2. Eliminate injection risk entirely
3. filesystem
4. Track checklist effectiveness over time
Which term best describes a foundational idea in "Production Agent Patterns: Queues, Retries, Idempotency"?
1. idempotency key
2. durable state
3. retry backoff
4. workflow
A learner studying Production Agent Patterns: Queues, Retries, Idempotency would need to understand which concept?
1. durable state
2. retry backoff
3. idempotency key
4. workflow
Which of these is directly relevant to Production Agent Patterns: Queues, Retries, Idempotency?
1. durable state
2. idempotency key
3. workflow
4. retry backoff
Which of the following is a key point about Production Agent Patterns: Queues, Retries, Idempotency?
1. Vercel Workflow DevKit (WDK) — step-based, crash-safe, powered by Queues.
2. LangGraph + PostgresSaver — durable state machines.
3. Temporal — mature workflow engine; strong for multi-day flows.
4. Inngest — event-driven steps with retries and concurrency controls.
Which of these does NOT belong in a discussion of Production Agent Patterns: Queues, Retries, Idempotency?
1. Vercel Workflow DevKit (WDK) — step-based, crash-safe, powered by Queues.
2. Eliminate injection risk entirely
3. Temporal — mature workflow engine; strong for multi-day flows.
4. LangGraph + PostgresSaver — durable state machines.
Which statement is accurate regarding Production Agent Patterns: Queues, Retries, Idempotency?
1. Log every tool call with args, result, latency.
2. Record every workflow event (start, step complete, retry, fail).
3. Trace every LLM call with prompt, response, token counts, cost — OpenTelemetry + LangSmith/Braintrus…
4. Cost dashboards per workflow/agent/user.
Which of these does NOT belong in a discussion of Production Agent Patterns: Queues, Retries, Idempotency?
1. Log every tool call with args, result, latency.
2. Eliminate injection risk entirely
3. Record every workflow event (start, step complete, retry, fail).
4. Trace every LLM call with prompt, response, token counts, cost — OpenTelemetry + LangSmith/Braintrus…
What is the key insight about "The pager story everyone has" in the context of Production Agent Patterns: Queues, Retries, Idempotency?
1. Someone at every agent company has been paged at 3 AM because an agent went into a loop and racked up $4000 in API calls.
2. Eliminate injection risk entirely
3. filesystem
4. Track checklist effectiveness over time
What is the key insight about "Testing durability" in the context of Production Agent Patterns: Queues, Retries, Idempotency?
1. Eliminate injection risk entirely
2. Periodically (weekly, monthly) simulate a process crash mid-workflow in staging. Verify the workflow resumes correctly.
3. filesystem
4. Track checklist effectiveness over time
What is the key warning about "Scope your agents tightly" in the context of Production Agent Patterns: Queues, Retries, Idempotency?
1. Eliminate injection risk entirely
2. filesystem
3. Always define: goal, tools, permissions, and stop condition before executing.
4. Track checklist effectiveness over time
Which statement accurately describes an aspect of Production Agent Patterns: Queues, Retries, Idempotency?
1. Eliminate injection risk entirely
2. filesystem
3. Track checklist effectiveness over time
4. In a prototype, a crash is fine — you rerun. In production, a crash means a user's pizza never got ordered and a $4 LLM call got burned.
What does working with Production Agent Patterns: Queues, Retries, Idempotency typically involve?
1. Agents that run longer than a few seconds shouldn't live in memory. Checkpoint after every step. Options in 2026:
2. Eliminate injection risk entirely
3. filesystem
4. Track checklist effectiveness over time
Which of the following is true about Production Agent Patterns: Queues, Retries, Idempotency?
1. Eliminate injection risk entirely
2. Any step that touches the outside world (send email, charge card, create ticket) needs an idempotency key.
3. filesystem
4. Track checklist effectiveness over time
Which best describes the scope of "Production Agent Patterns: Queues, Retries, Idempotency"?
1. It is unrelated to agentic workflows
2. It applies only to the opposite beginner tier
3. It focuses on A prototype agent and a production agent have the same LLM. What's different is everything around it
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson

Tendril · Creators · Agentic AI

Production Agent Patterns: Queues, Retries, Idempotency

A prototype agent and a production agent have the same LLM. What's different is everything around it — durable state, retries, idempotency, observability. The real engineering.

52 min · Reviewed 2026

Why production is different

The five production requirements

Requirement	What it means
Durable state	Every step is persisted. Process can die and resume.
Idempotent steps	Re-running a step is safe — no duplicate actions.
Retries with backoff	Transient failures retry; permanent failures surface.
Observability	Every tool call, every prompt, every token logged.
Cost + step caps	Hard ceilings prevent runaway loops and bills.

Durable state pattern

Agents that run longer than a few seconds shouldn't live in memory. Checkpoint after every step. Options in 2026:

Vercel Workflow DevKit (WDK) — step-based, crash-safe, powered by Queues.
LangGraph + PostgresSaver — durable state machines.
Temporal — mature workflow engine; strong for multi-day flows.
Inngest — event-driven steps with retries and concurrency controls.
Roll your own — Postgres + a state column + a worker loop.

// Vercel Workflow DevKit — modern "use workflow" directive
// Models are addressed via the AI Gateway alias format.
import { step } from 'workflow';
import { generateText } from 'ai';

export async function researchAgent(goal: string) {
  'use workflow';

  const plan = await step('plan', async () => {
    const { text } = await generateText({
      model: 'anthropic/claude-opus-4.7',
      prompt: `Break into sub-questions:\n${goal}`,
    });
    return text.split('\n');
  });

  const findings = [];
  for (const q of plan) {
    const answer = await step(`research:${q.slice(0, 20)}`, async () => {
      return await searchAndSummarize(q);
    }, { retries: 3, timeout: '60s' });
    findings.push({ q, answer });
  }

  return await step('synthesize', async () => {
    const { text } = await generateText({
      model: 'anthropic/claude-opus-4.7',
      prompt: `Write a cited answer:\n${JSON.stringify(findings)}`,
    });
    return text;
  });
}Every step() is durable. If the process dies, execution resumes from the last completed step. Built-in retries + timeouts.

Idempotency — the underrated superpower

// Idempotent Stripe charge
const chargeId = `task:${taskId}:step:${stepName}`;
const charge = await stripe.paymentIntents.create(
  { amount: 5000, currency: 'usd', customer: custId },
  { idempotencyKey: chargeId }
);
// Same chargeId on retry returns the same charge — no double-billing.Every external API call gets an idempotency key derived from workflow state. Most APIs (Stripe, SendGrid, Twilio) support this natively.

Retry policy

Error type	Policy
Network timeout	Retry 3x with exponential backoff (1s, 5s, 30s).
Rate limit (429)	Retry after Retry-After header; circuit-break after 5 attempts.
5xx server error	Retry 3x; alert on repeated 503.
Tool schema mismatch	One retry with error fed back to model.
4xx client error	Do NOT retry — it'll fail the same way.
Auth failure	Do NOT retry — alert, stop.

Observability essentials

Trace every LLM call with prompt, response, token counts, cost — OpenTelemetry + LangSmith/Braintrust/Vercel Observability.
Log every tool call with args, result, latency.
Record every workflow event (start, step complete, retry, fail).
Cost dashboards per workflow/agent/user.
Alerts on cost spikes, error spikes, latency regressions.

Cost and step caps

const MAX_STEPS = 50;
const MAX_COST_USD = 2.00;

let stepCount = 0;
let costUsd = 0;

while (!isDone(state)) {
  if (++stepCount > MAX_STEPS) {
    throw new Error('Step cap exceeded — possible loop.');
  }
  const { result, usage } = await runOneStep(state);
  costUsd += usage.inputTokens * 3/1_000_000 + usage.outputTokens * 15/1_000_000;
  if (costUsd > MAX_COST_USD) {
    throw new Error(`Cost cap exceeded: $${costUsd.toFixed(2)}`);
  }
  state = applyResult(state, result);
}Non-negotiable ceilings. Cheaper to fail a task than to let a loop burn $400 of Opus calls at 3 AM.

Next: the security dimension. An agent is a new attack surface.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-production-patterns-creators

What is the core idea behind "Production Agent Patterns: Queues, Retries, Idempotency"?
1. A prototype agent and a production agent have the same LLM. What's different is everything around it — durable state, retries, idempotency, observability. The real engineering.
2. Eliminate injection risk entirely
3. filesystem
4. Track checklist effectiveness over time
Which term best describes a foundational idea in "Production Agent Patterns: Queues, Retries, Idempotency"?
1. idempotency key
2. durable state
3. retry backoff
4. workflow
A learner studying Production Agent Patterns: Queues, Retries, Idempotency would need to understand which concept?
1. durable state
2. retry backoff
3. idempotency key
4. workflow
Which of these is directly relevant to Production Agent Patterns: Queues, Retries, Idempotency?
1. durable state
2. idempotency key
3. workflow
4. retry backoff
Which of the following is a key point about Production Agent Patterns: Queues, Retries, Idempotency?
1. Vercel Workflow DevKit (WDK) — step-based, crash-safe, powered by Queues.
2. LangGraph + PostgresSaver — durable state machines.
3. Temporal — mature workflow engine; strong for multi-day flows.
4. Inngest — event-driven steps with retries and concurrency controls.
Which of these does NOT belong in a discussion of Production Agent Patterns: Queues, Retries, Idempotency?
1. Vercel Workflow DevKit (WDK) — step-based, crash-safe, powered by Queues.
2. Eliminate injection risk entirely
3. Temporal — mature workflow engine; strong for multi-day flows.
4. LangGraph + PostgresSaver — durable state machines.
Which statement is accurate regarding Production Agent Patterns: Queues, Retries, Idempotency?
1. Log every tool call with args, result, latency.
2. Record every workflow event (start, step complete, retry, fail).
3. Trace every LLM call with prompt, response, token counts, cost — OpenTelemetry + LangSmith/Braintrus…
4. Cost dashboards per workflow/agent/user.
Which of these does NOT belong in a discussion of Production Agent Patterns: Queues, Retries, Idempotency?
1. Log every tool call with args, result, latency.
2. Eliminate injection risk entirely
3. Record every workflow event (start, step complete, retry, fail).
4. Trace every LLM call with prompt, response, token counts, cost — OpenTelemetry + LangSmith/Braintrus…
What is the key insight about "The pager story everyone has" in the context of Production Agent Patterns: Queues, Retries, Idempotency?
1. Someone at every agent company has been paged at 3 AM because an agent went into a loop and racked up $4000 in API calls.
2. Eliminate injection risk entirely
3. filesystem
4. Track checklist effectiveness over time
What is the key insight about "Testing durability" in the context of Production Agent Patterns: Queues, Retries, Idempotency?
1. Eliminate injection risk entirely
2. Periodically (weekly, monthly) simulate a process crash mid-workflow in staging. Verify the workflow resumes correctly.
3. filesystem
4. Track checklist effectiveness over time
What is the key warning about "Scope your agents tightly" in the context of Production Agent Patterns: Queues, Retries, Idempotency?
1. Eliminate injection risk entirely
2. filesystem
3. Always define: goal, tools, permissions, and stop condition before executing.
4. Track checklist effectiveness over time
Which statement accurately describes an aspect of Production Agent Patterns: Queues, Retries, Idempotency?
1. Eliminate injection risk entirely
2. filesystem
3. Track checklist effectiveness over time
4. In a prototype, a crash is fine — you rerun. In production, a crash means a user's pizza never got ordered and a $4 LLM call got burned.
What does working with Production Agent Patterns: Queues, Retries, Idempotency typically involve?
1. Agents that run longer than a few seconds shouldn't live in memory. Checkpoint after every step. Options in 2026:
2. Eliminate injection risk entirely
3. filesystem
4. Track checklist effectiveness over time
Which of the following is true about Production Agent Patterns: Queues, Retries, Idempotency?
1. Eliminate injection risk entirely
2. Any step that touches the outside world (send email, charge card, create ticket) needs an idempotency key.
3. filesystem
4. Track checklist effectiveness over time
Which best describes the scope of "Production Agent Patterns: Queues, Retries, Idempotency"?
1. It is unrelated to agentic workflows
2. It applies only to the opposite beginner tier
3. It focuses on A prototype agent and a production agent have the same LLM. What's different is everything around it
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson