Alert Reply Intelligence
Design Spec · 2026-06-30 · Status: Draft

From data-dense readouts to actionable alert intelligence

A five-layer system that rebuilds alert replies for a mixed Slack channel of executives and engineers - each layer applies across all alert types at once, stacking a new capability on the one before it.

5 horizontal layers 7 user journeys · 5-10 alerts each Audience: execs + engineers, same channel Architecture: webhook-first + internal DB
The mandate

Requirement: alerts must either loop in the right team or deliver an actionable insight. Today they reliably do neither - replies are technically correct walls of numbers no one can act on, with no root-cause hypothesis, no memory of past fires, and no connection to internal systems.

01

Roadmap timeline

Relative bands - no fixed dates. Bar length is rough effort, not a committed schedule.
Layer
Now
Next
Later
PHASE 1Reply Format
NOWNarrative verdict + evidence thread
PHASE 2Context Enrichment
PHASE 3Multi-fire Intelligence
PHASE 4Memory & Baselines
LATERVerdict journal · adaptive baselines
PHASE 5Smart Routing
LATERAuto @mention · escalation
CROSS-CUTStart early
Log verdicts · track fire timestamps · document ownership →

The dashed line marks “now”. Phase 1 is live-buildable today; Phase 5 sits last because it requires Phases 1-4 to be stable and trusted before auto-tagging anyone. The Start-early rail runs underneath every band - its data accumulation begins in Phase 1 and feeds Phases 3, 4 and 5.

02

Phase detail

Click a bar above or a tab below to inspect any layer in full.

Reply Format Layer NOW

Goal: transform alert replies from data-dense readouts into 3-4 narrative bullets any reader can scan in 10 seconds and know what to do. Applies to A1/A1x (deposit success rate) plus one funnel-type and one operational-type alert.

Message 1 - narrative verdict (4 bullets)
What happenedPlain English, one number max (the rate), compared to normal. No SE/pp unless borderline.
WhyRoot-cause hypothesis in causal language. Name the specific provider/component; connect findings into a chain.
ScopeIsolated or widespread? Which sites/systems are clean? Has it recovered?
ActionOne concrete next step with a named team (plain text, not @mention yet).
Message 2 - evidence (thread reply)
Session replays3-5 sessions hitting the failure point. Clickable PostHog replay links.
Activity tracesFor sessions without recordings, query the event sequence for the factual failure timeline.
DB evidenceP2+ specific failed transactions, error codes, timeout logs.
Event linksDirect PostHog person/event links for engineers to click through.

Message 1 tells the story. Message 2 shows the receipts. Numbers policy: at most 1-2 per bullet, always with plain-English context.

Example - current vs. new
Current - last-verdict.txt style
:sos: SYSTEM routing/failover -- RECOVERED -- phk1
success/click 26.2% vs ~58% same-hour-7d (~30 SE), phk1 only
- submit-to-success 47.9% (~75%) <- leak
- providers: vcpay-ph-native 2797->6 dark;
  rmpay 534->2736 @16% failover-flood
- scope: phk1 only (bdk1 73%, mxkzg1 63% normal)
Owner: payments/platform -- phk1 routing
New - narrative verdict
Deposit alert · phk1 · 17:00 (+07) | Real incident, High confidence

• What happened: Success rate dropped to 26% (normally ~58%
  this hour) - a significant drop, not noise.
• Why: vcpay-ph-native went dark (2,797 → 6 orders). Traffic
  failed over to rmpay, which handled the flood at 16% success.
  A single provider outage cascading through failover.
• Scope: phk1 only - bdk1 and mxkzg1 clean. No siblings firing.
  Recovered by 18:00.
• Action: Post-incident review with payments on vcpay
  reliability + rmpay failover capacity. No escalation needed.

Context Enrichment Layer NEXT

Goal: connect the bot to data sources beyond PostHog so root-cause hypotheses are backed by ground-truth evidence, not just event-metric inference.

Data sources
SourceWhat it providesReply impact
Transaction DBPayment status codes, provider error responses, timeout counts“Why” cites specific error codes and timestamps
Operational configProvider enabled/disabled state, routing weight changes“Why” can attribute to config changes
Deploy logsRecent deployments with timestamps and affected services“Why” can correlate with deploys
User account stateKYC status, account tier, balance context“Scope” can segment by user type
Evidence hierarchy

When internal DB data contradicts PostHog metrics, internal DB wins. PostHog captures client-side events with ingestion lag; the transaction DB holds the server-side source of truth.

Multi-fire Intelligence Layer NEXT

Goal: when multiple alerts fire in a short window, deliver one full triage on the primary incident and brief acknowledgments on the rest.

Process
1 · GroupIncoming alert waits a 3-5 min quiet window to collect co-firing alerts.
2 · CorrelateCheck sibling cascades, cross-site, persistent, and shared-upstream signals.
3 · RankHighest user impact > root-cause indicator > broadest scope > earliest fire.
4 · RespondFull triage on primary; 1-2 line ack on secondaries linking back.
Edge cases
Late arrivalOutside quiet window - check correlation with recent batch, ack if related.
No clear primaryPick earliest fire, note “multiple alerts of equal severity”.
Primary false, secondary realNote this explicitly so the real one isn’t buried.
Correlation types
SignalTypeExample
Same site, same hour, different alertsSibling cascadeA1 + A3 + A4 → phk1
Same alert, multiple sitesCross-siteA1-phk1 + A1-bdk1 + A1-mxkzg1
Same alert, same site, repeatedPersistentA1-phk1 @ 15:00, 16:00, 17:00
Different journeys, same siteShared upstreamDeposit A1 + registration alert · phk1
No relationshipIndependentTriage separately
Dependencies

Phase 1 format (primary reply uses the narrative template) + a stateful component for tracking recent fires within the window (KV store, Worker state, or lightweight DB).

Memory & Adaptive Baselines LATER

Goal: the bot learns from its own history - remembers past verdicts and outcomes, builds context-aware baselines, and distinguishes real anomalies from normal fluctuation.

4A · Outcome memory - journal uses
Past patternReply adjustment
False alarm 3 of last 4 fires“Noise 3 of its last 4 fires - consider threshold tuning”
Same root cause 3× in 14 days“3rd vcpay outage in 2 weeks - systemic reliability issue”
Prior incidents self-recovered ~40 min“Previous vcpay outages self-recovered in ~40 min. Monitor before escalating”
Human marked verdict incorrect“Note: similar incident Jun 28 was initially misdiagnosed”

Human feedback loop: lightweight - Slack reaction (✓ correct / ✗ incorrect) or a simple command. Not a full ML pipeline.

4B · Adaptive baselines
TypeWhat it captures
Day-of-weekSunday dips, Monday spikes, payday effects (4-week lookback)
Campaign-awareSecret Code campaigns depress rates 2-3pp for 24-48h
Post-deploy graceWiden expected range for 2h after deploys
Site-specificphk1 ~57%, mxkzg1 ~35%, bdk1 ~35% - formalize per-site
Trend-adjusted30-day linear trend - flag break vs. within-trend
Verdict journal record (logged after each triage)
{
  "fire_id": "...",        "alert_type": "A1",      "site": "phk1",
  "verdict": "real",       "confidence": "high",
  "root_cause": "vcpay-ph-native outage",
  "root_cause_category": "provider_outage",
  "action_taken": "escalated to payments",
  "resolved": true,        "resolution_time_minutes": 73,
  "was_correct": null      // set later by the human feedback loop
}

Storage: verdict journal as monthly JSON files or an internal DB table (~300 records/month at 10 fires/day). Baseline cache precomputed daily by a scheduled job, read at triage time. Dependencies: Phase 2 data improves quality but isn’t required; journal accumulation should start in Phase 1.

Smart Routing Layer LATER

Goal: auto-@mention the right team or person when the bot is confident in its diagnosis - closing the gap between “here’s what happened” and “someone is now looking at it.” Routing depends on the diagnosis, not the alert name.

5A · Ownership registry - maps root-cause categories to owners
"provider_outage": {
  "team": "payments",
  "primary": "@ace",   "secondary": "@nora",
  "channel": "#kz-ops-support",
  "escalation_window": "15min"
}

5C · Escalation chains: no response within the window → @mention secondary → post to escalation channel. Requires tracking thread activity (reactions, replies).

5B · Confidence gate
ConfidenceVerdictBehavior
HighRealAuto @mention primary
HighFalse alarmNo mention
MediumRealPlain team name · needs review
LowAnyNo mention · manual review
Safeguards

Kill switch - single flag to disable all auto-mentions instantly. Dry-run - first 2 weeks, plain-text team names with a “would have auto-tagged” note. Override registry - per alert + site + time-of-day. Out-of-office awareness - skip primary if marked away. Hard dependency: Phases 1-4 stable and trusted - auto-tagging a wrong diagnosis is worse than no tag at all.

03

Dependency graph

What each phase needs before it can ship - and what it unblocks downstream.
PHASE 1
Reply Format
Foundation. Independent of backend. Provides the narrative template every later phase reuses.
PHASE 2
Context Enrichment
Needs backend data endpoints. Improves - but doesn’t gate - Phase 4 baseline quality.
PHASE 3
Multi-fire
Needs Phase 1 template + a stateful store for recent fires. Independent of Phase 2.
PHASE 4
Memory & Baselines
Needs verdict journal (start logging in P1) + a scheduled baseline job. Can begin PostHog-only.
PHASE 5
Smart Routing
Hard gate: Phases 1-4 stable & trusted. Needs ownership registry + Slack @mention scopes.
Sequential build order (recommended) Soft input - improves quality, not required to ship

Phases 1 and 3 form an independent fast track - neither needs backend data, so they can ship while Phase 2 endpoints are still in flight. Phase 5 is deliberately last: auto-tagging on an inaccurate diagnosis erodes trust faster than no tag at all, so it waits until the diagnostic layers below it are proven.

04

Cross-cutting: start early

Three actions that begin in Phase 1 and quietly accumulate value for later phases.
Action (begins Phase 1)FeedsHow
Log verdict records→ Phase 4 memoryStart from the first triage under the new format - the journal only has value once it has history.
Track fire timestamps→ Phase 3 correlationLog every fire with alert type, site, and time, so the correlation engine has a window to look back over.
Document ownership→ Phase 5 registryAs you learn who owns what during Phases 1-2, write it down in a structured format ready for routing.
05

Risk register

Known risks and the mitigation already designed into the plan.
RiskImpactMitigation
Backend data endpoints delayed P2 BLOCKED Phases 1 + 3 are independent of backend; Phase 4 baselines can start with PostHog-only data.
Correlation logic groups unrelated alerts WRONG CONTEXT Conservative grouping rules + dry-run period. Independent alerts always get full triage.
Auto-tagging on wrong diagnosis TRUST ERODES Confidence gate + 2-week dry-run + kill switch. Ms. Giang’s core concern - gated hardest.
Ownership registry goes stale WRONG PERSON Monthly review cadence + override mechanism.
Prompt too long for all alert types COST · LATENCY Modular prompt architecture - shared template + per-journey query modules.
06

Success criteria · per phase

Tick items as they land - progress persists in this browser. These are the bars each layer must clear.
Phase 1 - Reply Format
0/4
Any reader understands what / why / what-to-do in under 10 seconds.
Engineers access full technical detail via the evidence thread.
Team confirms the new format is an improvement.
Each Message 1 stays at 4 bullets with 1-2 numbers per bullet - no wall of metrics.
Phase 2 - Context Enrichment
0/4
Correctly identifies external cause (deploy / config / outage) in >80% of real incidents.
False-alarm rate drops - bot distinguishes ingestion lag from real outages.
Reply credibility increases - diagnosis cites transaction-level evidence.
When internal DB contradicts PostHog, the reply uses the DB source of truth.
Phase 3 - Multi-fire
0/5
Channel noise drops 60-70% during multi-fire incidents.
One coherent story per incident, not N fragmented reports.
Every alert gets at least an acknowledgment.
Independent alerts still get full triage - no false grouping.
Primary alert is correctly ranked (highest user impact) in the multi-fire incidents reviewed.
Phase 4 - Memory & Baselines
0/5
False-alarm rate drops 40%+.
Recurring incidents flagged as "Day N" with prior dates & outcomes.
Campaign / weekend / post-deploy effects correctly accounted for.
Human feedback loop lightweight enough to actually get used.
Verdict journal accumulates a usable history (started logging in Phase 1).
Phase 5 - Smart Routing
0/4
Time from alert fire to right person acknowledging drops 50%+.
Zero complaints about being tagged on false alarms.
Ownership registry stays accurate (monthly review cadence).
2-week dry-run completed with kill switch verified before any live auto-mention.