Lesson 872 of 2116
Debugging A Heartbeat Loop: Observability, Replay, And Failure Modes
Heartbeats fail in ways reactive agents never do — silent drift, soul-state thrash, infinite loops. Debugging them takes different tools and a different mental model.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Why heartbeats are harder to debug
- 2observability
- 3beat trace
- 4replay
Concept cluster
Terms to connect while reading
Section 1
Why heartbeats are harder to debug
A reactive agent fails in front of the user — the bug is in the message you just got. A heartbeat soul fails while you're asleep. By the time you notice, it has run hundreds of beats, mutated its own memory, called dozens of tools, and possibly recovered (or not) without telling anyone. Debugging means rebuilding the story from logs, not watching it happen.
What good observability looks like
- 1Per-beat trace: every beat logs its trigger, its model input, its tool calls, its memory deltas, and its outcome
- 2Beat timeline: a chart of beats per minute over time, so spikes and silences are visible at a glance
- 3Soul state diff: snapshots of soul memory before/after each beat, browseable in the dashboard
- 4Tool-call audit: an immutable log of every external action, with the beat ID that caused it
- 5Token and cost ledger: live numbers, not 'check your bill next month'
A single beat's structured log. The cost of structured beats is the cost of being able to debug them — pay it.
{
"beat_id": "b_2026_04_27_142055_pr-reviewer",
"trigger": { "type": "event", "source": "github.pull_request.opened", "id": "PR-1842" },
"started_at": "2026-04-27T14:20:55Z",
"duration_ms": 4321,
"input_tokens": 8450,
"output_tokens": 612,
"tool_calls": [
{ "name": "github.read_diff", "ok": true },
{ "name": "github.post_review", "ok": true }
],
"memory_deltas": [
{ "key": "recent_reviews", "op": "append", "size": 1 }
],
"outcome": "acted",
"next_beat": null
}Replay: the heartbeat-debug superpower
The single best heartbeat debugging tool is replay — re-running a past beat against the current code, with the original trigger and memory snapshot, and watching what happens. Reactive agents replay individual messages; heartbeat souls need to replay beats. A good runtime makes this a one-command operation: 'replay beat b_2026_04_27_142055.' The soul wakes up in a sandbox, sees what it saw then, and you watch it think.
Three failure modes you will see
Compare the options
| Failure mode | Symptom | Root-cause direction |
|---|---|---|
| Infinite loop | Beats-per-minute graph goes vertical; budget caps kick in | Self-paced soul picking tiny intervals, or recursive event trigger |
| Soul-state thrash | Memory deltas alternate forward and back every few beats | Two beats writing competing values; missing locks or stale reads |
| Drift | Soul's behavior slowly diverges from its job over days | Memory accumulating noise; bad facts learned and never corrected |
| Phantom no-ops | Soul wakes 1000 times, never acts, beats look fine | Trigger condition is always-true; soul thinks 'nothing to do' every time |
| Stuck retry | Same error every beat, error-rate breaker trips eventually | External tool returning a failure the soul doesn't recognize as fatal |
| Silent staleness | Soul keeps acting on data it stopped refreshing weeks ago | Refresh tool deprecated; soul never noticed |
Infinite loops, in detail
The classic OpenClaw infinite loop has two flavors. The first is a self-paced soul whose state-update logic accidentally always returns 'wake me in 1 second' — caught by the rate limit, but only after a noisy minute. The second is two souls beating each other: soul A sends a message that triggers soul B's event heartbeat, which sends a message that triggers soul A's event heartbeat. The fix for both is the same — the rate limit floor is your friend, but the real fix is detecting the cycle in trigger logs and breaking it.
Soul-state thrash
Thrash happens when two beats — usually two close-together beats from different triggers — disagree on what the memory should say, and each undoes the other's writes. You'll see memory deltas alternating forward and back. The fix is a single coordinator beat (only one type of trigger writes to a given memory key), or proper locks (a beat reads-then-writes atomically). Without one of those, your soul is in a small civil war with itself.
Drift
Drift is the slow killer. The soul behaves correctly on day one, mostly correctly on day seven, and oddly on day thirty. Usually the cause is accumulated memory — random facts the soul wrote during weird beats, never corrected, now distorting its sense of self. The cure is a periodic memory-consolidation heartbeat that prunes, summarizes, and corrects. Souls that never review their own memory always drift.
Apply: the four-step debug ritual
- 1Pause the soul — preserve state, stop the bleeding
- 2Pull the beat timeline; find the moment behavior changed
- 3Replay the suspect beat in a sandbox with current code
- 4Decide: is this a code fix, a config fix, a memory fix, or a trigger-logic fix? Fix the root cause, not the symptom
Key terms in this lesson
The big idea: heartbeats fail differently than reactive agents — silently, slowly, and at 3 AM. Per-beat traces, replay, and three named failure modes — infinite loops, thrash, drift — are the toolkit. Pause first, replay second, fix the root cause.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Debugging A Heartbeat Loop: Observability, Replay, And Failure Modes”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 10 min
Codex With Custom Tools And MCP
Codex's real power shows when you connect it to your own tools — internal APIs, datastores, ticketing systems — usually via Model Context Protocol.
Creators · 10 min
Codex For Refactoring Legacy Code
Refactors are where Codex shines and where it most easily goes off the rails. Bound the refactor with tests, scope, and a clean baseline before delegating.
Creators · 11 min
Observability: Logs, Traces, And Soul Timelines
A long-running agent is a black box unless you instrument it. Logs tell you what; traces tell you why; the soul timeline tells you whether the runtime is healthy at all.
