System overview
Dunetrace is a pipeline of five independent services communicating through a shared Postgres database. Each service does one job.
Agent Code
└─▶ Dunetrace SDK (hashes content → ingest events + OTel spans)
└─▶ Ingest API (POST /v1/ingest → Postgres, returns 202)
├─▶ Detector (poll → RunState → 15 detectors → signals)
├─▶ Alerts (poll → explain → Slack / webhook)
└─▶ Customer API (runs, signals, explanations → dashboard)
├─▶ stdout NDJSON (emit_as_json=True → Loki / Grafana Alloy)
└─▶ OTel exporter (otel_exporter=… → Tempo / Honeycomb / Datadog)
Services
Ingest API · port 8001
The entry point for all SDK traffic. Its only job is to accept events as fast as possible and not lose them. Validates the schema, authenticates via the api_keys table, returns 202 Accepted before touching the database, then writes in a background task.
Why the 202 before writing? Your agent is waiting. Round-trip latency should be as short as possible. Validation is synchronous; persistence is async.
Detector worker
Background polling loop, every 5 seconds. The only process that runs detection logic.
- Fetches runs completed since last poll plus runs stalled longer than 90s
- Skips runs already in
processed_runs - Reconstructs
RunStateby replaying events - Runs 14 Tier 1 detectors against the
RunState.PROMPT_INJECTION_SIGNALis handled by the SDK on raw input, before hashing; the worker extracts the evidence from therun.startedpayload. - Writes any
FailureSignalrows - UPSERTs the
issuestable for each fired signal and advances the clean-run counter; auto-resolves after 5 consecutive clean runs - Marks the run processed
Why polling instead of streaming? A polling worker needs no message broker, survives restarts gracefully, and is trivial to reason about. At sub-100 runs/sec, 5-second latency is acceptable.
Explain layer
A library — not a service. Imported by both the alerts worker and the customer API. Takes a FailureSignal, returns an Explanation in under 1 ms. Uses deterministic string templates, not LLM calls.
Three reasons for no LLM: latency (templates are instant), cost (zero per-signal API cost), consistency (same signal → same explanation, makes testing predictable).
Alerts worker
Background polling loop, every 10 seconds. The only process that sends external notifications. Fetches unalerted signals, computes rate context concurrently, calls explain(), formats for Slack Block Kit or webhook JSON, POSTs with exponential backoff up to 3 attempts. Marks alerted=TRUE only after at least one destination succeeds.
(run_id, failure_type, detected_at) as the idempotency key.Customer API · port 8002
Read-only FastAPI service. Powers the dashboard and any customer integrations. All endpoints require Authorization: Bearer <api_key>, except in AUTH_MODE=dev. Signal responses include the full explanation inline.
| Endpoint | Purpose |
|---|---|
GET /v1/agents | List agents with run counts, signal counts, failure breakdown |
GET /v1/agents/{id}/runs | Paginated run list — summary only |
GET /v1/agents/{id}/signals | Signals with explanations; filters: severity, failure_type, include_shadow |
GET /v1/agents/{id}/insights | Aggregates — input patterns, daily trends, failure_rates, systemic_patterns |
GET /v1/agents/{id}/issues | Open/resolved issues per (agent, failure_type). Accepts optional status filter (open, resolved, reopened) |
GET /v1/runs/{id} | Full run — metadata, events, signals |
POST /v1/signals/{id}/explain | Fetch Langfuse trace, run LLM analysis, return root_cause, fix_content, fix_type, apply_blocked. Requires LANGFUSE_* and ANTHROPIC_API_KEY or OPENAI_API_KEY |
POST /v1/signals/{id}/apply-fix | Append fix_content to the named Langfuse prompt and publish a new version. Blocked for PROMPT_INJECTION_SIGNAL (returns 403) |
POST /v1/signals/{id}/record-copy | Record a clipboard-path fix in the fixes table without writing to Langfuse |
GET /v1/signals/{id}/fix-status | Return fix history and recurrence verdict (verified / likely_fixed / still_occurring / insufficient_data) |
GET /health | Service health check — returns {"status":"ok","db":"ok"} |
Dashboard · port 3000
A single-page HTML app served by nginx. No build step — plain HTML/CSS/JS fetching from the Customer API. Auto-refreshes every 15 seconds. All data is computed client-side.
SDK output modes
Three independent output paths that can be combined:
| Mode | How to enable | Destination |
|---|---|---|
| HTTP ingest (default) | endpoint="http://…" | Ingest API → Postgres → Detector |
| Loki NDJSON | emit_as_json=True | stdout → Promtail/Alloy → Loki |
| OTel spans | otel_exporter=DunetraceOTelExporter(provider) | OTel collector → Tempo / Honeycomb / Datadog |
All three can be active at once. OTel and NDJSON are zero-cost when disabled. Pass endpoint=None for OTel-only or Loki-only deployments.
Database schema
Seven tables. No column anywhere could store raw prompt or output content.
CREATE TABLE events (
id BIGSERIAL PRIMARY KEY,
batch_id TEXT NOT NULL,
event_type TEXT NOT NULL,
run_id TEXT NOT NULL,
agent_id TEXT NOT NULL,
agent_version TEXT NOT NULL,
step_index INTEGER NOT NULL,
timestamp DOUBLE PRECISION NOT NULL,
payload JSONB NOT NULL,
parent_run_id TEXT,
received_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE failure_signals (
id BIGSERIAL PRIMARY KEY,
failure_type TEXT NOT NULL,
severity TEXT NOT NULL,
run_id TEXT NOT NULL,
agent_id TEXT NOT NULL,
agent_version TEXT NOT NULL,
step_index INTEGER NOT NULL,
confidence REAL NOT NULL,
evidence JSONB NOT NULL,
detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
shadow BOOLEAN NOT NULL DEFAULT TRUE,
alerted BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE TABLE processed_runs (…);
CREATE TABLE api_keys (…);
CREATE TABLE issues (…);
CREATE TABLE digest_log (…);
CREATE TABLE fixes (
id BIGSERIAL PRIMARY KEY,
run_id TEXT NOT NULL,
signal_id BIGINT NOT NULL,
fix_content TEXT NOT NULL,
fix_type TEXT NOT NULL DEFAULT 'prompt_addition',
applied_via TEXT NOT NULL, -- 'langfuse' or 'clipboard'
langfuse_prompt_name TEXT,
langfuse_version INTEGER,
applied_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Performance
| Component | Latency | Throughput |
|---|---|---|
SDK _emit() | <1 μs | Millions/sec |
| SDK drain thread | 200 ms idle poll | 100 events/batch |
| Ingest API (202) | ~5 ms | ~1,000 req/sec |
| Detector poll cycle | 5 s | ~100 runs/cycle |
| Explain layer | <1 ms | synchronous |
| Alerts poll cycle | 10 s | 50 signals/cycle |
| Customer API | ~10 ms | ~500 req/sec |
Agent overhead: under 500 μs per run with default HTTP ingest. The drain thread is entirely background. Even under backpressure (ingest API down), the ring buffer drops the oldest events rather than blocking the agent.
Failure modes
- Ingest API down — drain thread drops unshippable events. Agent never blocks. Events during outage are lost.
- Detector worker down — runs queue up. When the worker restarts, it catches up. Signals delayed but not lost.
- Postgres down — ingest returns 503. SDK buffers up to 10,000 events, then rolls. Observability data loss is acceptable during DB outages.
- Alerts worker down — signals accumulate as
alerted=FALSE. On restart, delivery resumes. At-least-once.