Slack
Add to .env:
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx/yyy/zzz
SLACK_CHANNEL=#agent-alerts
SLACK_MIN_SEVERITY=HIGH # LOW | MEDIUM | HIGH | CRITICAL
Get a webhook URL from api.slack.com/messaging/webhooks. Restart the alerts worker:
docker compose up -d --force-recreate alerts
Each Slack alert includes failure type, severity, what happened, why it matters, a concrete code fix targeted at the specific failure pattern, and a one-line rate context summary.
Rate context
The rate context line appears above the "What happened" block:
| Condition | Message |
|---|---|
| First time in last 7 days | ℹ First occurrence of this pattern in the last 7 days |
| Recurring but not systemic | 📊 5/20 runs affected (25%) in the last 7 days |
| Systemic (≥10%) | ⚠ Systemic pattern — 8/12 runs affected (67%) |
Rate context is computed per (agent_id, failure_type) at alert time. If the lookup fails (DB contention), the signal is still delivered without a rate context line.
Generic webhook
Works with PagerDuty, Linear, or any endpoint.
WEBHOOK_URL=https://your-endpoint.example.com/alerts
WEBHOOK_SECRET=your-hmac-secret # optional
When WEBHOOK_SECRET is set, each request includes an X-Dunetrace-Signature header containing HMAC-SHA256(body, secret).
Example payload:
{
"schema_version": "1.0",
"event": "signal.detected",
"run_id": "...",
"agent_id": "...",
"failure_type": "TOOL_LOOP",
"severity": "HIGH",
"confidence": 0.95,
"evidence": { "tool": "web_search", "count": 6, ... },
"explanation": {
"title": "...",
"what": "...",
"why_it_matters": "...",
"suggested_fixes": [
{ "description": "...", "language": "python", "code": "..." }
]
}
}
Delivery guarantees
The alerts worker polls every 10 seconds for unalerted signals (shadow=FALSE AND alerted=FALSE). It calls the explainer, formats the payload, POSTs with exponential backoff up to 3 attempts. A signal is marked alerted=TRUE only after at least one destination succeeds.
(run_id, failure_type, detected_at) as the key.Weekly digest
A Monday 9am UTC summary covering the past 7 days:
- Top 5 failure types by affected run count
- Top 5 agents by signal volume with dominant failure type
- Systemic patterns — failure types affecting ≥10% of runs per agent
- Issues opened and resolved this week
- Dashboard button
Enable with:
DIGEST_ENABLED=true
DASHBOARD_URL=https://your-dashboard-url
# Optional overrides
DIGEST_DAY=0 # 0=Monday … 6=Sunday
DIGEST_HOUR=9 # UTC hour
Delivery is deduplicated via digest_log. If a digest was sent within the last 6 days, it will not send again even if the worker restarts. If no runs in the last 7 days, the digest is skipped but timestamp is still logged.
Shadow mode
Shadow signals are stored and visible in the dashboard but never delivered. See the detectors page for how to graduate a detector from shadow to live.