Alerts — Dunetrace docs

Slack and generic webhooks, with HMAC signatures, rate context, at-least-once delivery, and a Monday-morning weekly digest.

Slack

Add to .env:

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx/yyy/zzz
SLACK_CHANNEL=#agent-alerts
SLACK_MIN_SEVERITY=HIGH   # LOW | MEDIUM | HIGH | CRITICAL

Get a webhook URL from api.slack.com/messaging/webhooks. Restart the alerts worker:

docker compose up -d --force-recreate alerts

Each Slack alert includes failure type, severity, what happened, why it matters, a concrete code fix targeted at the specific failure pattern, and a one-line rate context summary.

Rate context

The rate context line appears above the "What happened" block:

Condition	Message
First time in last 7 days	`ℹ First occurrence of this pattern in the last 7 days`
Recurring but not systemic	`📊 5/20 runs affected (25%) in the last 7 days`
Systemic (≥10%)	`⚠ Systemic pattern — 8/12 runs affected (67%)`

Rate context is computed per (agent_id, failure_type) at alert time. If the lookup fails (DB contention), the signal is still delivered without a rate context line.

Generic webhook

Works with PagerDuty, Linear, or any endpoint.

WEBHOOK_URL=https://your-endpoint.example.com/alerts
WEBHOOK_SECRET=your-hmac-secret   # optional

When WEBHOOK_SECRET is set, each request includes an X-Dunetrace-Signature header containing HMAC-SHA256(body, secret).

Example payload:

{
  "schema_version": "1.0",
  "event": "signal.detected",
  "run_id": "...",
  "agent_id": "...",
  "failure_type": "TOOL_LOOP",
  "severity": "HIGH",
  "confidence": 0.95,
  "evidence": { "tool": "web_search", "count": 6, ... },
  "explanation": {
    "title": "...",
    "what": "...",
    "why_it_matters": "...",
    "suggested_fixes": [
      { "description": "...", "language": "python", "code": "..." }
    ]
  }
}

Delivery guarantees

The alerts worker polls every 10 seconds for unalerted signals (shadow=FALSE AND alerted=FALSE). It calls the explainer, formats the payload, POSTs with exponential backoff up to 3 attempts. A signal is marked alerted=TRUE only after at least one destination succeeds.

⚠

If the worker crashes between sending and marking, the signal re-sends on restart. Design receivers to be idempotent. Use (run_id, failure_type, detected_at) as the key.

Weekly digest

A Monday 9am UTC summary covering the past 7 days:

Top 5 failure types by affected run count
Top 5 agents by signal volume with dominant failure type
Systemic patterns — failure types affecting ≥10% of runs per agent
Issues opened and resolved this week
Dashboard button

Enable with:

DIGEST_ENABLED=true
DASHBOARD_URL=https://your-dashboard-url

# Optional overrides
DIGEST_DAY=0        # 0=Monday … 6=Sunday
DIGEST_HOUR=9       # UTC hour

Delivery is deduplicated via digest_log. If a digest was sent within the last 6 days, it will not send again even if the worker restarts. If no runs in the last 7 days, the digest is skipped but timestamp is still logged.

Shadow mode

Shadow signals are stored and visible in the dashboard but never delivered. See the detectors page for how to graduate a detector from shadow to live.