Pages
Live at localhost:3000. Auto-refreshes every 15 seconds.
| Page | What it shows |
|---|---|
| Overview | Four stat cards (Critical / High / Signals / Runs) with trend deltas (1h / 24h / 7d). Risk Trend 24-hour bar chart. Token Waste Drift panel — 24h sparkline of wasted tokens vs 7-day baseline, dashed baseline line, WARNING ZONE badge when 24h exceeds baseline by >20%. Failure Posture gauge — half-circle SVG with needle at avg confidence, daily signals, avg confidence, false positive rate. Top Failure Drivers, Agent Signal Drift, live run feed. |
| All Runs | Full run table. Click any row to open run detail. |
| Alerts | Signals grouped by failure type with per-run confidence and token estimates. Shadow signals rendered below with dashed border + SHADOW badge. |
| Analytics | Estimated token cost saved this week (configurable $/1k). Cross-agent totals, top failure patterns, per-agent breakdown. |
| Risk Heatmap | Failure type × agent intensity grid. |
| Agents | Per-agent health cards — failure rate %, dominant pattern, run / critical / high counts, last seen, ungraduated shadow signal count. Each card shows an Agent Health Score badge (0–100, colour-coded green/amber/red) powered by GET /v1/agents/{id}/health-score. Each card links to a Health Record panel with 30-day per-failure-type rates, sparkline, and a SYSTEMIC badge. Clicking any failure type opens the Why is this happening? deep-dive panel. |
| Compare Runs | Side-by-side comparison. Select any two runs — metrics, signals, and max confidence shown in both panels with a colour-coded delta table (new / resolved failure types highlighted). |
| Detectors | Threshold sliders and alert-level selector. Live review panel: "with current config, N of M past runs would be flagged HIGH or above" — recomputes on every change. |
| Policies | Create, edit, toggle, and delete runtime guardrails. Each row shows trigger, operator, threshold, action type, and enabled state. One-click example templates for "cap tool calls", "cost cap", and "loop fix". Policies saved here are fetched automatically by the SDK within 60 seconds. |
Why is this happening?
Clicking any failure type in the Signal Breakdown or Systemic Patterns sidebar opens a cross-run deep-dive panel. Click the same item again or ✕ to dismiss.
| Section | What it answers |
|---|---|
| Overview | Affected runs / total runs, rate, avg confidence, severity breakdown, first and last seen |
| Fires at step | P25 / P50 / P75 / avg step index — answers "does this happen early or late in runs?" |
| Evidence patterns | Aggregated detector evidence: loop counts, token growth, RAG top scores, stall steps |
| Co-occurs with | Other failure types that fire in the same runs, ranked by co-occurrence rate |
| 14-day trend | Daily sparkline of affected_runs / rate — is this getting worse, better, or stable? |
| Highest confidence runs | Five runs with highest confidence for this failure type — each row opens run detail |
Powered by GET /v1/agents/{agent_id}/failure-patterns/{failure_type}.
Run detail
Click any run row to open the detail panel. Three tabs:
- Analysis — step timeline, signal score cards with confidence bars, plain-English explanation + suggested fix. When Langfuse credentials are configured, an Explain with Langfuse ↗ button calls
POST /v1/signals/{id}/explainfor a root-cause explanation and optional prompt fix. - Run graph — SVG node graph: green = LLM, orange = tool (ok), red = looping tool call, blue = start/end. Loop clusters highlighted with a dashed red outline.
- Event log — every event in order, expandable to full payload. Content fields shown as SHA-256 hashes.
Stat card info buttons
| Card | Threshold |
|---|---|
| Critical | conf ≥ 0.85, or prompt injection / cascading failure regardless of confidence |
| High | conf ≥ 0.70 — tool loops, retry storms, context bloat |
| Signals | All four levels: CRITICAL ≥ 0.85 · HIGH ≥ 0.70 · MEDIUM ≥ 0.50 · LOW < 0.50 |
| Total runs | Processed runs counted within one 5s detector poll |
Token waste estimates
Token waste across the dashboard is computed client-side from run step_count using a fixed estimate of 250 tokens per step. Dollar costs use a configurable rate — default $0.010/1k tokens, editable on the Analytics page. These are approximations.
Data sources
| Page | API calls |
|---|---|
| Overview, Alerts, Analytics, Heatmap, Agents | GET /v1/agents + per-agent /runs + /signals?include_shadow=true |
| All Runs, Compare Runs | Same cached data, no extra calls |
| Run detail | GET /v1/runs/{id} (events + signals) |
| Agent view (health record + runs) | GET /v1/agents/{id}/runs + /signals + /insights + /health-score |
| Why is this happening? panel | GET /v1/agents/{agent_id}/failure-patterns/{failure_type} |
| Detectors | Static — edits require updating detectors.yml and restarting the detector service |
| Policies | GET /v1/policies + POST + PUT /{id} + DELETE /{id} + PATCH /{id}/toggle |