Lesson 874 of 2116
Observability: Logs, Traces, And Soul Timelines
A long-running agent is a black box unless you instrument it. Logs tell you what; traces tell you why; the soul timeline tells you whether the runtime is healthy at all.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Why agents need their own observability
- 2structured logs
- 3trace per heartbeat
- 4soul timeline
Concept cluster
Terms to connect while reading
Section 1
Why agents need their own observability
A web service that's slow is obvious — pages don't load. A soul that's quietly drifting — choosing the wrong skill, looping on the same heartbeat, burning model budget while you sleep — is invisible until you check. OpenClaw is opinionated here: every heartbeat emits a structured log, every skill call emits a trace span, and every soul has a timeline view in Mission Control. Use them or you're flying blind.
Three layers, three questions
Compare the options
| Layer | Question it answers | Where it lives |
|---|---|---|
| Logs | What happened in this heartbeat? | stdout / file / log drain (Loki, Datadog) |
| Traces | How long did each step take, and which step was the bottleneck? | OTLP endpoint (Jaeger, Honeycomb, Vercel Observability) |
| Soul timeline | Is this soul still healthy as a long-running thing? | Mission Control UI / Grafana dashboard |
| Audit log | Did the soul actually do what we authorized? | Append-only file in /var/openclaw/audit (lesson 1) |
What to surface in logs
OpenClaw's structured logs include heartbeat ID, soul slug, model used, token count, skill calls, duration, and outcome. JSON-shaped, one line per event. The default level is info — keep it there. Cranking to debug spams useful patterns into noise; bumping to warn hides exactly the boring-success events you need to spot the abnormal one.
One line of OpenClaw heartbeat log. Grep-friendly, Loki-friendly, eyeball-friendly.
{
"ts": "2026-04-27T08:00:00.123Z",
"level": "info",
"event": "heartbeat.complete",
"soul": "inbox-triage",
"heartbeat_id": "hb_2k4n9",
"interval_s": 900,
"actual_duration_s": 12.4,
"model": "qwen3.5:8b",
"tokens_in": 4218,
"tokens_out": 612,
"skills_called": ["gmail.list", "gmail.label"],
"approvals_pending": 0,
"outcome": "success"
}Traces: where the time actually went
A heartbeat looks like a single event in logs but is a tree of work — model call, skill call, sub-skill call, return. OTLP traces give you that tree. OpenClaw exports OpenTelemetry by default; point it at a collector (Jaeger locally, Honeycomb or Vercel Observability for hosted) and you get flame graphs of every heartbeat. The first time a soul feels 'slow,' a trace shows you it's the model — or it's that one skill that's quietly making three round-trips. Don't guess.
The soul timeline
Mission Control's soul-timeline view is the long-running version of the trace. It plots heartbeats over hours and days — interval, duration, outcome, token spend. Patterns you can only see here: a soul whose duration is creeping up day over day (memory bloat), a soul whose token-per-tick has 10x'd since you swapped models, a soul whose interval drifts because heartbeats run longer than the gap between them.
Sketch your dashboard before you build it
- 1Top row: number of healthy souls, number with pending approvals, number with errors in last hour. Big numbers, no charts.
- 2Per-soul row: last heartbeat timestamp, last duration, last token cost, status dot (green / yellow / red).
- 3Trend chart: tokens-per-day for each soul, last 7 days. Spot a soul whose model swap doubled its cost overnight.
- 4Heartbeat-anomaly chart: actual_duration vs interval, log scale. Anything trending toward 1.0 is a soul that's about to overlap itself.
- 5Audit feed: scrolling list of skill calls, last 50. The chaos-monkey check — does what's happening match what you authorized?
Alerting on heartbeat anomalies
The high-leverage alerts are not 'soul errored' — those are loud and self-announcing. The ones that catch real problems are the silent failures: a soul that hasn't ticked, a soul whose tick is taking longer than its interval, a soul whose token spend doubled overnight without a model change. Wire these as paging-grade alerts; everything else is dashboard-grade.
Compare the options
| Alert | Condition | Why it matters |
|---|---|---|
| Heartbeat missed | No heartbeat.complete event in 2x interval window | Soul is dead, hung, or the host is down — and you wouldn't notice otherwise |
| Tick > interval | actual_duration_s > interval_s for 3 consecutive heartbeats | Soul is overlapping itself; ticks are queuing; cost will runaway |
| Token spend spike | Daily tokens_in for a soul > 2x rolling 7-day median | Model swap, prompt regression, infinite tool loop, or context bloat |
| Pending approvals piling up | approvals_pending > 5 for over an hour | Soul is stuck waiting for a human; needs attention or the gate needs tuning |
| Repeated skill error | Same skill returning error in 5 consecutive heartbeats | Skill is broken, credentials expired, or the upstream API changed |
Apply: instrument one soul this week
- 1Pick the soul that runs most often.
- 2Tail its logs for one full heartbeat — read every field. If anything's missing that you'd want at 3am, raise a feature request or add a log line.
- 3Wire OTLP to a free Honeycomb / Jaeger / SigNoz collector. Look at one trace.
- 4Sketch your one-screen dashboard on paper before you open Grafana.
- 5Set the 'heartbeat missed' and 'tick > interval' alerts. Skip the rest until you've used the dashboard for a week.
Key terms in this lesson
The big idea: a long-running agent without observability is a long-running mystery. Wire logs, traces, and the soul timeline before you trust a soul with anything that matters.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Observability: Logs, Traces, And Soul Timelines”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 10 min
Codex With Custom Tools And MCP
Codex's real power shows when you connect it to your own tools — internal APIs, datastores, ticketing systems — usually via Model Context Protocol.
Creators · 10 min
Debugging A Heartbeat Loop: Observability, Replay, And Failure Modes
Heartbeats fail in ways reactive agents never do — silent drift, soul-state thrash, infinite loops. Debugging them takes different tools and a different mental model.
Creators · 10 min
Beyond The Basics: Federation, Custom Runtimes, Contributing Back
Once you trust the runtime, the next moves are scaling out (multiple machines), swapping the brain (different LLM provider), and giving back (clean upstream contributions). Each step compounds the value of the rest.
