Docs / Architecture

Architecture

Five independent services, one Postgres, one static dashboard. The SDK hashes content in-process, ingest writes events, the detector replays them, the explain layer templates a plain-English signal, and the alerts worker ships to Slack.

System overview

Dunetrace is a pipeline of five independent services communicating through a shared Postgres database. Each service does one job.

Agent Code
  └─▶ Dunetrace SDK       (hashes content → ingest events + OTel spans)
        └─▶ Ingest API    (POST /v1/ingest → Postgres, returns 202)
                ├─▶ Detector     (poll → RunState → 15 detectors → signals)
                ├─▶ Alerts       (poll → explain → Slack / webhook)
                └─▶ Customer API (runs, signals, explanations → dashboard)
        ├─▶ stdout NDJSON (emit_as_json=True → Loki / Grafana Alloy)
        └─▶ OTel exporter (otel_exporter=… → Tempo / Honeycomb / Datadog)

Services

Ingest API · port 8001

The entry point for all SDK traffic. Its only job is to accept events as fast as possible and not lose them. Validates the schema, authenticates via the api_keys table, returns 202 Accepted before touching the database, then writes in a background task.

Why the 202 before writing? Your agent is waiting. Round-trip latency should be as short as possible. Validation is synchronous; persistence is async.

Detector worker

Background polling loop, every 5 seconds. The only process that runs detection logic.

  1. Fetches runs completed since last poll plus runs stalled longer than 90s
  2. Skips runs already in processed_runs
  3. Reconstructs RunState by replaying events
  4. Runs 14 Tier 1 detectors against the RunState. PROMPT_INJECTION_SIGNAL is handled by the SDK on raw input, before hashing; the worker extracts the evidence from the run.started payload.
  5. Writes any FailureSignal rows
  6. UPSERTs the issues table for each fired signal and advances the clean-run counter; auto-resolves after 5 consecutive clean runs
  7. Marks the run processed

Why polling instead of streaming? A polling worker needs no message broker, survives restarts gracefully, and is trivial to reason about. At sub-100 runs/sec, 5-second latency is acceptable.

Explain layer

A library — not a service. Imported by both the alerts worker and the customer API. Takes a FailureSignal, returns an Explanation in under 1 ms. Uses deterministic string templates, not LLM calls.

Three reasons for no LLM: latency (templates are instant), cost (zero per-signal API cost), consistency (same signal → same explanation, makes testing predictable).

Alerts worker

Background polling loop, every 10 seconds. The only process that sends external notifications. Fetches unalerted signals, computes rate context concurrently, calls explain(), formats for Slack Block Kit or webhook JSON, POSTs with exponential backoff up to 3 attempts. Marks alerted=TRUE only after at least one destination succeeds.

At-least-once delivery. If the worker crashes between sending and marking, the signal re-sends on restart. Receivers should treat (run_id, failure_type, detected_at) as the idempotency key.

Customer API · port 8002

Read-only FastAPI service. Powers the dashboard and any customer integrations. All endpoints require Authorization: Bearer <api_key>, except in AUTH_MODE=dev. Signal responses include the full explanation inline.

EndpointPurpose
GET /v1/agentsList agents with run counts, signal counts, failure breakdown
GET /v1/agents/{id}/runsPaginated run list — summary only
GET /v1/agents/{id}/signalsSignals with explanations; filters: severity, failure_type, include_shadow
GET /v1/agents/{id}/insightsAggregates — input patterns, daily trends, failure_rates, systemic_patterns
GET /v1/agents/{id}/issuesOpen/resolved issues per (agent, failure_type). Accepts optional status filter (open, resolved, reopened)
GET /v1/runs/{id}Full run — metadata, events, signals
POST /v1/signals/{id}/explainFetch Langfuse trace, run LLM analysis, return root_cause, fix_content, fix_type, apply_blocked. Requires LANGFUSE_* and ANTHROPIC_API_KEY or OPENAI_API_KEY
POST /v1/signals/{id}/apply-fixAppend fix_content to the named Langfuse prompt and publish a new version. Blocked for PROMPT_INJECTION_SIGNAL (returns 403)
POST /v1/signals/{id}/record-copyRecord a clipboard-path fix in the fixes table without writing to Langfuse
GET /v1/signals/{id}/fix-statusReturn fix history and recurrence verdict (verified / likely_fixed / still_occurring / insufficient_data)
GET /healthService health check — returns {"status":"ok","db":"ok"}

Dashboard · port 3000

A single-page HTML app served by nginx. No build step — plain HTML/CSS/JS fetching from the Customer API. Auto-refreshes every 15 seconds. All data is computed client-side.

SDK output modes

Three independent output paths that can be combined:

ModeHow to enableDestination
HTTP ingest (default)endpoint="http://…"Ingest API → Postgres → Detector
Loki NDJSONemit_as_json=Truestdout → Promtail/Alloy → Loki
OTel spansotel_exporter=DunetraceOTelExporter(provider)OTel collector → Tempo / Honeycomb / Datadog

All three can be active at once. OTel and NDJSON are zero-cost when disabled. Pass endpoint=None for OTel-only or Loki-only deployments.

Database schema

Seven tables. No column anywhere could store raw prompt or output content.

CREATE TABLE events (
    id             BIGSERIAL PRIMARY KEY,
    batch_id       TEXT             NOT NULL,
    event_type     TEXT             NOT NULL,
    run_id         TEXT             NOT NULL,
    agent_id       TEXT             NOT NULL,
    agent_version  TEXT             NOT NULL,
    step_index     INTEGER          NOT NULL,
    timestamp      DOUBLE PRECISION NOT NULL,
    payload        JSONB            NOT NULL,
    parent_run_id  TEXT,
    received_at    TIMESTAMPTZ      NOT NULL DEFAULT NOW()
);

CREATE TABLE failure_signals (
    id             BIGSERIAL PRIMARY KEY,
    failure_type   TEXT        NOT NULL,
    severity       TEXT        NOT NULL,
    run_id         TEXT        NOT NULL,
    agent_id       TEXT        NOT NULL,
    agent_version  TEXT        NOT NULL,
    step_index     INTEGER     NOT NULL,
    confidence     REAL        NOT NULL,
    evidence       JSONB       NOT NULL,
    detected_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    shadow         BOOLEAN     NOT NULL DEFAULT TRUE,
    alerted        BOOLEAN     NOT NULL DEFAULT FALSE
);

CREATE TABLE processed_runs (…);
CREATE TABLE api_keys (…);
CREATE TABLE issues (…);
CREATE TABLE digest_log (…);

CREATE TABLE fixes (
    id                    BIGSERIAL    PRIMARY KEY,
    run_id                TEXT         NOT NULL,
    signal_id             BIGINT       NOT NULL,
    fix_content           TEXT         NOT NULL,
    fix_type              TEXT         NOT NULL DEFAULT 'prompt_addition',
    applied_via           TEXT         NOT NULL,   -- 'langfuse' or 'clipboard'
    langfuse_prompt_name  TEXT,
    langfuse_version      INTEGER,
    applied_at            TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

Performance

ComponentLatencyThroughput
SDK _emit()<1 μsMillions/sec
SDK drain thread200 ms idle poll100 events/batch
Ingest API (202)~5 ms~1,000 req/sec
Detector poll cycle5 s~100 runs/cycle
Explain layer<1 mssynchronous
Alerts poll cycle10 s50 signals/cycle
Customer API~10 ms~500 req/sec

Agent overhead: under 500 μs per run with default HTTP ingest. The drain thread is entirely background. Even under backpressure (ingest API down), the ring buffer drops the oldest events rather than blocking the agent.

Failure modes

  • Ingest API down — drain thread drops unshippable events. Agent never blocks. Events during outage are lost.
  • Detector worker down — runs queue up. When the worker restarts, it catches up. Signals delayed but not lost.
  • Postgres down — ingest returns 503. SDK buffers up to 10,000 events, then rolls. Observability data loss is acceptable during DB outages.
  • Alerts worker down — signals accumulate as alerted=FALSE. On restart, delivery resumes. At-least-once.