A five-layer system that rebuilds alert replies for a mixed Slack channel of executives and engineers - each layer applies across all alert types at once, stacking a new capability on the one before it.
Requirement: alerts must either loop in the right team or deliver an actionable insight. Today they reliably do neither - replies are technically correct walls of numbers no one can act on, with no root-cause hypothesis, no memory of past fires, and no connection to internal systems.
The dashed line marks “now”. Phase 1 is live-buildable today; Phase 5 sits last because it requires Phases 1-4 to be stable and trusted before auto-tagging anyone. The Start-early rail runs underneath every band - its data accumulation begins in Phase 1 and feeds Phases 3, 4 and 5.
Goal: transform alert replies from data-dense readouts into 3-4 narrative bullets any reader can scan in 10 seconds and know what to do. Applies to A1/A1x (deposit success rate) plus one funnel-type and one operational-type alert.
| What happened | Plain English, one number max (the rate), compared to normal. No SE/pp unless borderline. |
| Why | Root-cause hypothesis in causal language. Name the specific provider/component; connect findings into a chain. |
| Scope | Isolated or widespread? Which sites/systems are clean? Has it recovered? |
| Action | One concrete next step with a named team (plain text, not @mention yet). |
| Session replays | 3-5 sessions hitting the failure point. Clickable PostHog replay links. |
| Activity traces | For sessions without recordings, query the event sequence for the factual failure timeline. |
| DB evidence | P2+ specific failed transactions, error codes, timeout logs. |
| Event links | Direct PostHog person/event links for engineers to click through. |
Message 1 tells the story. Message 2 shows the receipts. Numbers policy: at most 1-2 per bullet, always with plain-English context.
:sos: SYSTEM routing/failover -- RECOVERED -- phk1 success/click 26.2% vs ~58% same-hour-7d (~30 SE), phk1 only - submit-to-success 47.9% (~75%) <- leak - providers: vcpay-ph-native 2797->6 dark; rmpay 534->2736 @16% failover-flood - scope: phk1 only (bdk1 73%, mxkzg1 63% normal) Owner: payments/platform -- phk1 routing
Deposit alert · phk1 · 17:00 (+07) | Real incident, High confidence • What happened: Success rate dropped to 26% (normally ~58% this hour) - a significant drop, not noise. • Why: vcpay-ph-native went dark (2,797 → 6 orders). Traffic failed over to rmpay, which handled the flood at 16% success. A single provider outage cascading through failover. • Scope: phk1 only - bdk1 and mxkzg1 clean. No siblings firing. Recovered by 18:00. • Action: Post-incident review with payments on vcpay reliability + rmpay failover capacity. No escalation needed.
Goal: connect the bot to data sources beyond PostHog so root-cause hypotheses are backed by ground-truth evidence, not just event-metric inference.
| Source | What it provides | Reply impact |
|---|---|---|
| Transaction DB | Payment status codes, provider error responses, timeout counts | “Why” cites specific error codes and timestamps |
| Operational config | Provider enabled/disabled state, routing weight changes | “Why” can attribute to config changes |
| Deploy logs | Recent deployments with timestamps and affected services | “Why” can correlate with deploys |
| User account state | KYC status, account tier, balance context | “Scope” can segment by user type |
When internal DB data contradicts PostHog metrics, internal DB wins. PostHog captures client-side events with ingestion lag; the transaction DB holds the server-side source of truth.
Goal: when multiple alerts fire in a short window, deliver one full triage on the primary incident and brief acknowledgments on the rest.
| 1 · Group | Incoming alert waits a 3-5 min quiet window to collect co-firing alerts. |
| 2 · Correlate | Check sibling cascades, cross-site, persistent, and shared-upstream signals. |
| 3 · Rank | Highest user impact > root-cause indicator > broadest scope > earliest fire. |
| 4 · Respond | Full triage on primary; 1-2 line ack on secondaries linking back. |
| Late arrival | Outside quiet window - check correlation with recent batch, ack if related. |
| No clear primary | Pick earliest fire, note “multiple alerts of equal severity”. |
| Primary false, secondary real | Note this explicitly so the real one isn’t buried. |
| Signal | Type | Example |
|---|---|---|
| Same site, same hour, different alerts | Sibling cascade | A1 + A3 + A4 → phk1 |
| Same alert, multiple sites | Cross-site | A1-phk1 + A1-bdk1 + A1-mxkzg1 |
| Same alert, same site, repeated | Persistent | A1-phk1 @ 15:00, 16:00, 17:00 |
| Different journeys, same site | Shared upstream | Deposit A1 + registration alert · phk1 |
| No relationship | Independent | Triage separately |
Phase 1 format (primary reply uses the narrative template) + a stateful component for tracking recent fires within the window (KV store, Worker state, or lightweight DB).
Goal: the bot learns from its own history - remembers past verdicts and outcomes, builds context-aware baselines, and distinguishes real anomalies from normal fluctuation.
| Past pattern | Reply adjustment |
|---|---|
| False alarm 3 of last 4 fires | “Noise 3 of its last 4 fires - consider threshold tuning” |
| Same root cause 3× in 14 days | “3rd vcpay outage in 2 weeks - systemic reliability issue” |
| Prior incidents self-recovered ~40 min | “Previous vcpay outages self-recovered in ~40 min. Monitor before escalating” |
| Human marked verdict incorrect | “Note: similar incident Jun 28 was initially misdiagnosed” |
Human feedback loop: lightweight - Slack reaction (✓ correct / ✗ incorrect) or a simple command. Not a full ML pipeline.
| Type | What it captures |
|---|---|
| Day-of-week | Sunday dips, Monday spikes, payday effects (4-week lookback) |
| Campaign-aware | Secret Code campaigns depress rates 2-3pp for 24-48h |
| Post-deploy grace | Widen expected range for 2h after deploys |
| Site-specific | phk1 ~57%, mxkzg1 ~35%, bdk1 ~35% - formalize per-site |
| Trend-adjusted | 30-day linear trend - flag break vs. within-trend |
{
"fire_id": "...", "alert_type": "A1", "site": "phk1",
"verdict": "real", "confidence": "high",
"root_cause": "vcpay-ph-native outage",
"root_cause_category": "provider_outage",
"action_taken": "escalated to payments",
"resolved": true, "resolution_time_minutes": 73,
"was_correct": null // set later by the human feedback loop
}
Storage: verdict journal as monthly JSON files or an internal DB table (~300 records/month at 10 fires/day). Baseline cache precomputed daily by a scheduled job, read at triage time. Dependencies: Phase 2 data improves quality but isn’t required; journal accumulation should start in Phase 1.
Goal: auto-@mention the right team or person when the bot is confident in its diagnosis - closing the gap between “here’s what happened” and “someone is now looking at it.” Routing depends on the diagnosis, not the alert name.
"provider_outage": {
"team": "payments",
"primary": "@ace", "secondary": "@nora",
"channel": "#kz-ops-support",
"escalation_window": "15min"
}
5C · Escalation chains: no response within the window → @mention secondary → post to escalation channel. Requires tracking thread activity (reactions, replies).
| Confidence | Verdict | Behavior |
|---|---|---|
| High | Real | Auto @mention primary |
| High | False alarm | No mention |
| Medium | Real | Plain team name · needs review |
| Low | Any | No mention · manual review |
Kill switch - single flag to disable all auto-mentions instantly. Dry-run - first 2 weeks, plain-text team names with a “would have auto-tagged” note. Override registry - per alert + site + time-of-day. Out-of-office awareness - skip primary if marked away. Hard dependency: Phases 1-4 stable and trusted - auto-tagging a wrong diagnosis is worse than no tag at all.
Phases 1 and 3 form an independent fast track - neither needs backend data, so they can ship while Phase 2 endpoints are still in flight. Phase 5 is deliberately last: auto-tagging on an inaccurate diagnosis erodes trust faster than no tag at all, so it waits until the diagnostic layers below it are proven.
| Action (begins Phase 1) | Feeds | How |
|---|---|---|
| Log verdict records | → Phase 4 memory | Start from the first triage under the new format - the journal only has value once it has history. |
| Track fire timestamps | → Phase 3 correlation | Log every fire with alert type, site, and time, so the correlation engine has a window to look back over. |
| Document ownership | → Phase 5 registry | As you learn who owns what during Phases 1-2, write it down in a structured format ready for routing. |
| Risk | Impact | Mitigation |
|---|---|---|
| Backend data endpoints delayed | P2 BLOCKED | Phases 1 + 3 are independent of backend; Phase 4 baselines can start with PostHog-only data. |
| Correlation logic groups unrelated alerts | WRONG CONTEXT | Conservative grouping rules + dry-run period. Independent alerts always get full triage. |
| Auto-tagging on wrong diagnosis | TRUST ERODES | Confidence gate + 2-week dry-run + kill switch. Ms. Giang’s core concern - gated hardest. |
| Ownership registry goes stale | WRONG PERSON | Monthly review cadence + override mechanism. |
| Prompt too long for all alert types | COST · LATENCY | Modular prompt architecture - shared template + per-journey query modules. |