Tendril

Lesson 54 of 2116

Production Agent Patterns: Queues, Retries, Idempotency

A prototype agent and a production agent have the same LLM. What's different is everything around it — durable state, retries, idempotency, observability. The real engineering.

CreatorsAgentic AI~31 min readAdvancedProfessionalOperationsBI2 · Representation & ReasoningBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

52 min23 blocks5 concepts

Learning path

The main moves in order

1Why production is different
2durability
3idempotency
4retries

Concept cluster

Terms to connect while reading

durabilityidempotencyretriesobservabilityworkflow

Sections7

Lists2

Notes4

Code3

Compare2

Section 1

Why production is different

In a prototype, a crash is fine — you rerun. In production, a crash means a user's pizza never got ordered and a $4 LLM call got burned. Production agents must be durable, idempotent, observable, and cost-capped. Most teams discover this after shipping a demo.

The five production requirements

Compare the options

Requirement	What it means
Durable state	Every step is persisted. Process can die and resume.
Idempotent steps	Re-running a step is safe — no duplicate actions.
Retries with backoff	Transient failures retry; permanent failures surface.
Observability	Every tool call, every prompt, every token logged.
Cost + step caps	Hard ceilings prevent runaway loops and bills.

Durable state pattern

Agents that run longer than a few seconds shouldn't live in memory. Checkpoint after every step. Options in 2026:

Check-in 1. Got it so far?

Vercel Workflow DevKit (WDK) — step-based, crash-safe, powered by Queues.
LangGraph + PostgresSaver — durable state machines.
Temporal — mature workflow engine; strong for multi-day flows.
Inngest — event-driven steps with retries and concurrency controls.
Roll your own — Postgres + a state column + a worker loop.

Every step() is durable. If the process dies, execution resumes from the last completed step. Built-in retries + timeouts.

typescript

// Vercel Workflow DevKit — modern "use workflow" directive
// Models are addressed via the AI Gateway alias format.
import { step } from 'workflow';
import { generateText } from 'ai';

export async function researchAgent(goal: string) {
  'use workflow';

  const plan = await step('plan', async () => {
    const { text } = await generateText({
      model: 'anthropic/claude-opus-4.7',
      prompt: `Break into sub-questions:\n${goal}`,
    });
    return text.split('\n');
  });

  const findings = [];
  for (const q of plan) {
    const answer = await step(`research:${q.slice(0, 20)}`, async () => {
      return await searchAndSummarize(q);
    }, { retries: 3, timeout: '60s' });
    findings.push({ q, answer });
  }

  return await step('synthesize', async () => {
    const { text } = await generateText({
      model: 'anthropic/claude-opus-4.7',
      prompt: `Write a cited answer:\n${JSON.stringify(findings)}`,
    });
    return text;
  });
}

Idempotency — the underrated superpower

Any step that touches the outside world (send email, charge card, create ticket) needs an idempotency key. When the step retries, the external system recognizes the key and doesn't duplicate the action.

Check-in 2. Got it so far?

Every external API call gets an idempotency key derived from workflow state. Most APIs (Stripe, SendGrid, Twilio) support this natively.

typescript

// Idempotent Stripe charge
const chargeId = `task:${taskId}:step:${stepName}`;
const charge = await stripe.paymentIntents.create(
  { amount: 5000, currency: 'usd', customer: custId },
  { idempotencyKey: chargeId }
);
// Same chargeId on retry returns the same charge — no double-billing.

Retry policy

Compare the options

Error type	Policy
Network timeout	Retry 3x with exponential backoff (1s, 5s, 30s).
Rate limit (429)	Retry after Retry-After header; circuit-break after 5 attempts.
5xx server error	Retry 3x; alert on repeated 503.
Tool schema mismatch	One retry with error fed back to model.
4xx client error	Do NOT retry — it'll fail the same way.
Auth failure	Do NOT retry — alert, stop.

Observability essentials

Trace every LLM call with prompt, response, token counts, cost — OpenTelemetry + LangSmith/Braintrust/Vercel Observability.
Log every tool call with args, result, latency.
Record every workflow event (start, step complete, retry, fail).
Cost dashboards per workflow/agent/user.
Alerts on cost spikes, error spikes, latency regressions.

Check-in 3. Got it so far?

Cost and step caps

Non-negotiable ceilings. Cheaper to fail a task than to let a loop burn $400 of Opus calls at 3 AM.

typescript

const MAX_STEPS = 50;
const MAX_COST_USD = 2.00;

let stepCount = 0;
let costUsd = 0;

while (!isDone(state)) {
  if (++stepCount > MAX_STEPS) {
    throw new Error('Step cap exceeded — possible loop.');
  }
  const { result, usage } = await runOneStep(state);
  costUsd += usage.inputTokens * 3/1_000_000 + usage.outputTokens * 15/1_000_000;
  if (costUsd > MAX_COST_USD) {
    throw new Error(`Cost cap exceeded: $${costUsd.toFixed(2)}`);
  }
  state = applyResult(state, result);
}

Check-in 4. Got it so far?

Next: the security dimension. An agent is a new attack surface.

Key terms in this lesson

Check-in 5. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Production Agent Patterns: Queues, Retries, Idempotency”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Production Agent Patterns: Queues, Retries, Idempotency

Why production is different

The five production requirements

Durable state pattern

Idempotency — the underrated superpower

Retry policy

Observability essentials

Cost and step caps

Curious about “Production Agent Patterns: Queues, Retries, Idempotency”?

Keep going

Production Agent Patterns: Queues, Retries, Idempotency

Why production is different

The five production requirements

Durable state pattern

Idempotency — the underrated superpower

Retry policy

Observability essentials

Cost and step caps

Curious about “Production Agent Patterns: Queues, Retries, Idempotency”?

Keep going