Loading lesson…
A long-running agent is a black box unless you instrument it. Logs tell you what; traces tell you why; the soul timeline tells you whether the runtime is healthy at all.
A web service that's slow is obvious — pages don't load. A soul that's quietly drifting — choosing the wrong skill, looping on the same heartbeat, burning model budget while you sleep — is invisible until you check. OpenClaw is opinionated here: every heartbeat emits a structured log, every skill call emits a trace span, and every soul has a timeline view in Mission Control. Use them or you're flying blind.
| Layer | Question it answers | Where it lives |
|---|---|---|
| Logs | What happened in this heartbeat? | stdout / file / log drain (Loki, Datadog) |
| Traces | How long did each step take, and which step was the bottleneck? | OTLP endpoint (Jaeger, Honeycomb, Vercel Observability) |
| Soul timeline | Is this soul still healthy as a long-running thing? | Mission Control UI / Grafana dashboard |
| Audit log | Did the soul actually do what we authorized? | Append-only file in /var/openclaw/audit (lesson 1) |
OpenClaw's structured logs include heartbeat ID, soul slug, model used, token count, skill calls, duration, and outcome. JSON-shaped, one line per event. The default level is info — keep it there. Cranking to debug spams useful patterns into noise; bumping to warn hides exactly the boring-success events you need to spot the abnormal one.
{ "ts": "2026-04-27T08:00:00.123Z", "level": "info", "event": "heartbeat.complete", "soul": "inbox-triage", "heartbeat_id": "hb_2k4n9", "interval_s": 900, "actual_duration_s": 12.4, "model": "qwen3.5:8b", "tokens_in": 4218, "tokens_out": 612, "skills_called": ["gmail.list", "gmail.label"], "approvals_pending": 0, "outcome": "success" }One line of OpenClaw heartbeat log. Grep-friendly, Loki-friendly, eyeball-friendly.A heartbeat looks like a single event in logs but is a tree of work — model call, skill call, sub-skill call, return. OTLP traces give you that tree. OpenClaw exports OpenTelemetry by default; point it at a collector (Jaeger locally, Honeycomb or Vercel Observability for hosted) and you get flame graphs of every heartbeat. The first time a soul feels 'slow,' a trace shows you it's the model — or it's that one skill that's quietly making three round-trips. Don't guess.
Mission Control's soul-timeline view is the long-running version of the trace. It plots heartbeats over hours and days — interval, duration, outcome, token spend. Patterns you can only see here: a soul whose duration is creeping up day over day (memory bloat), a soul whose token-per-tick has 10x'd since you swapped models, a soul whose interval drifts because heartbeats run longer than the gap between them.
The high-leverage alerts are not 'soul errored' — those are loud and self-announcing. The ones that catch real problems are the silent failures: a soul that hasn't ticked, a soul whose tick is taking longer than its interval, a soul whose token spend doubled overnight without a model change. Wire these as paging-grade alerts; everything else is dashboard-grade.
| Alert | Condition | Why it matters |
|---|---|---|
| Heartbeat missed | No heartbeat.complete event in 2x interval window | Soul is dead, hung, or the host is down — and you wouldn't notice otherwise |
| Tick > interval | actual_duration_s > interval_s for 3 consecutive heartbeats | Soul is overlapping itself; ticks are queuing; cost will runaway |
| Token spend spike | Daily tokens_in for a soul > 2x rolling 7-day median | Model swap, prompt regression, infinite tool loop, or context bloat |
| Pending approvals piling up | approvals_pending > 5 for over an hour | Soul is stuck waiting for a human; needs attention or the gate needs tuning |
| Repeated skill error | Same skill returning error in 5 consecutive heartbeats | Skill is broken, credentials expired, or the upstream API changed |
The big idea: a long-running agent without observability is a long-running mystery. Wire logs, traces, and the soul timeline before you trust a soul with anything that matters.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-openclaw-ops-observability-creators
What is the main idea of "Observability: Logs, Traces, And Soul Timelines"?
Which concept is most central to "Observability: Logs, Traces, And Soul Timelines"?
Which use of AI fits this topic best?
What should a careful learner remember about "Logs are not optional even at hobby scale"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about structured logs be treated?
Name one way to verify an AI answer about structured logs.
Which action would help you apply "Observability: Logs, Traces, And Soul Timelines" responsibly?