Design Spec · 2026-06-30 · Status: Draft

From data-dense readouts to actionable alert intelligence

A five-layer system that rebuilds alert replies for a mixed Slack channel of executives and engineers - each layer applies across all alert types at once, stacking a new capability on the one before it.

5 horizontal layers 7 user journeys · 5-10 alerts each Audience: execs + engineers, same channel Architecture: webhook-first + internal DB

The mandate

Requirement: alerts must either loop in the right team or deliver an actionable insight. Today they reliably do neither - replies are technically correct walls of numbers no one can act on, with no root-cause hypothesis, no memory of past fires, and no connection to internal systems.

01

Roadmap timeline

Relative bands - no fixed dates. Bar length is rough effort, not a committed schedule.

Layer

Now

PHASE 1Reply Format

NOWNarrative verdict + evidence thread

PHASE 2Context Enrichment

NEXTInternal DB · config · deploy logs

PHASE 3Multi-fire Intelligence

NEXTGroup · correlate · one triage

PHASE 4Memory & Baselines

LATERVerdict journal · adaptive baselines

PHASE 5Smart Routing

LATERAuto @mention · escalation

CROSS-CUTStart early

Log verdicts · track fire timestamps · document ownership →

The dashed line marks “now”. Phase 1 is live-buildable today; Phase 5 sits last because it requires Phases 1-4 to be stable and trusted before auto-tagging anyone. The Start-early rail runs underneath every band - its data accumulation begins in Phase 1 and feeds Phases 3, 4 and 5.

02

Phase detail

Click a bar above or a tab below to inspect any layer in full.

Reply Format Layer NOW

Goal: transform alert replies from data-dense readouts into 3-4 narrative bullets any reader can scan in 10 seconds and know what to do. Applies to A1/A1x (deposit success rate) plus one funnel-type and one operational-type alert.

Message 1 - narrative verdict (4 bullets)

What happened	Plain English, one number max (the rate), compared to normal. No SE/pp unless borderline.
Why	Root-cause hypothesis in causal language. Name the specific provider/component; connect findings into a chain.
Scope	Isolated or widespread? Which sites/systems are clean? Has it recovered?
Action	One concrete next step with a named team (plain text, not @mention yet).

Message 2 - evidence (thread reply)

Session replays	3-5 sessions hitting the failure point. Clickable PostHog replay links.
Activity traces	For sessions without recordings, query the event sequence for the factual failure timeline.
DB evidence	P2+ specific failed transactions, error codes, timeout logs.
Event links	Direct PostHog person/event links for engineers to click through.

Message 1 tells the story. Message 2 shows the receipts. Numbers policy: at most 1-2 per bullet, always with plain-English context.

Example - current vs. new

Current - last-verdict.txt style

:sos: SYSTEM routing/failover -- RECOVERED -- phk1
success/click 26.2% vs ~58% same-hour-7d (~30 SE), phk1 only
- submit-to-success 47.9% (~75%) <- leak
- providers: vcpay-ph-native 2797->6 dark;
  rmpay 534->2736 @16% failover-flood
- scope: phk1 only (bdk1 73%, mxkzg1 63% normal)
Owner: payments/platform -- phk1 routing

New - narrative verdict

Deposit alert · phk1 · 17:00 (+07) | Real incident, High confidence

• What happened: Success rate dropped to 26% (normally ~58%
  this hour) - a significant drop, not noise.
• Why: vcpay-ph-native went dark (2,797 → 6 orders). Traffic
  failed over to rmpay, which handled the flood at 16% success.
  A single provider outage cascading through failover.
• Scope: phk1 only - bdk1 and mxkzg1 clean. No siblings firing.
  Recovered by 18:00.
• Action: Post-incident review with payments on vcpay
  reliability + rmpay failover capacity. No escalation needed.

Context Enrichment Layer NEXT

Goal: connect the bot to data sources beyond PostHog so root-cause hypotheses are backed by ground-truth evidence, not just event-metric inference.

Data sources

Source	What it provides	Reply impact
Transaction DB	Payment status codes, provider error responses, timeout counts	“Why” cites specific error codes and timestamps
Operational config	Provider enabled/disabled state, routing weight changes	“Why” can attribute to config changes
Deploy logs	Recent deployments with timestamps and affected services	“Why” can correlate with deploys
User account state	KYC status, account tier, balance context	“Scope” can segment by user type

Evidence hierarchy

When internal DB data contradicts PostHog metrics, internal DB wins. PostHog captures client-side events with ingestion lag; the transaction DB holds the server-side source of truth.

Multi-fire Intelligence Layer NEXT

Goal: when multiple alerts fire in a short window, deliver one full triage on the primary incident and brief acknowledgments on the rest.

Process

1 · Group	Incoming alert waits a 3-5 min quiet window to collect co-firing alerts.
2 · Correlate	Check sibling cascades, cross-site, persistent, and shared-upstream signals.
3 · Rank	Highest user impact > root-cause indicator > broadest scope > earliest fire.
4 · Respond	Full triage on primary; 1-2 line ack on secondaries linking back.

Edge cases

Late arrival	Outside quiet window - check correlation with recent batch, ack if related.
No clear primary	Pick earliest fire, note “multiple alerts of equal severity”.
Primary false, secondary real	Note this explicitly so the real one isn’t buried.

Correlation types

Signal	Type	Example
Same site, same hour, different alerts	Sibling cascade	A1 + A3 + A4 → phk1
Same alert, multiple sites	Cross-site	A1-phk1 + A1-bdk1 + A1-mxkzg1
Same alert, same site, repeated	Persistent	A1-phk1 @ 15:00, 16:00, 17:00
Different journeys, same site	Shared upstream	Deposit A1 + registration alert · phk1
No relationship	Independent	Triage separately

Dependencies

Phase 1 format (primary reply uses the narrative template) + a stateful component for tracking recent fires within the window (KV store, Worker state, or lightweight DB).

Memory & Adaptive Baselines LATER

Goal: the bot learns from its own history - remembers past verdicts and outcomes, builds context-aware baselines, and distinguishes real anomalies from normal fluctuation.

4A · Outcome memory - journal uses

Past pattern	Reply adjustment
False alarm 3 of last 4 fires	“Noise 3 of its last 4 fires - consider threshold tuning”
Same root cause 3× in 14 days	“3rd vcpay outage in 2 weeks - systemic reliability issue”
Prior incidents self-recovered ~40 min	“Previous vcpay outages self-recovered in ~40 min. Monitor before escalating”
Human marked verdict incorrect	“Note: similar incident Jun 28 was initially misdiagnosed”

Human feedback loop: lightweight - Slack reaction (✓ correct / ✗ incorrect) or a simple command. Not a full ML pipeline.

4B · Adaptive baselines

Type	What it captures
Day-of-week	Sunday dips, Monday spikes, payday effects (4-week lookback)
Campaign-aware	Secret Code campaigns depress rates 2-3pp for 24-48h
Post-deploy grace	Widen expected range for 2h after deploys
Site-specific	phk1 ~57%, mxkzg1 ~35%, bdk1 ~35% - formalize per-site
Trend-adjusted	30-day linear trend - flag break vs. within-trend

Verdict journal record (logged after each triage)

{
  "fire_id": "...",        "alert_type": "A1",      "site": "phk1",
  "verdict": "real",       "confidence": "high",
  "root_cause": "vcpay-ph-native outage",
  "root_cause_category": "provider_outage",
  "action_taken": "escalated to payments",
  "resolved": true,        "resolution_time_minutes": 73,
  "was_correct": null      // set later by the human feedback loop
}

Storage: verdict journal as monthly JSON files or an internal DB table (~300 records/month at 10 fires/day). Baseline cache precomputed daily by a scheduled job, read at triage time. Dependencies: Phase 2 data improves quality but isn’t required; journal accumulation should start in Phase 1.

Smart Routing Layer LATER

Goal: auto-@mention the right team or person when the bot is confident in its diagnosis - closing the gap between “here’s what happened” and “someone is now looking at it.” Routing depends on the diagnosis, not the alert name.

5A · Ownership registry - maps root-cause categories to owners

"provider_outage": {
  "team": "payments",
  "primary": "@ace",   "secondary": "@nora",
  "channel": "#kz-ops-support",
  "escalation_window": "15min"
}

5C · Escalation chains: no response within the window → @mention secondary → post to escalation channel. Requires tracking thread activity (reactions, replies).

5B · Confidence gate

Confidence	Verdict	Behavior
High	Real	Auto @mention primary
High	False alarm	No mention
Medium	Real	Plain team name · needs review
Low	Any	No mention · manual review

Safeguards

Kill switch - single flag to disable all auto-mentions instantly. Dry-run - first 2 weeks, plain-text team names with a “would have auto-tagged” note. Override registry - per alert + site + time-of-day. Out-of-office awareness - skip primary if marked away. Hard dependency: Phases 1-4 stable and trusted - auto-tagging a wrong diagnosis is worse than no tag at all.

03

Dependency graph

What each phase needs before it can ship - and what it unblocks downstream.

PHASE 1

Reply Format

Foundation. Independent of backend. Provides the narrative template every later phase reuses.

PHASE 2

Context Enrichment

Needs backend data endpoints. Improves - but doesn’t gate - Phase 4 baseline quality.

PHASE 3

Multi-fire

Needs Phase 1 template + a stateful store for recent fires. Independent of Phase 2.

PHASE 4

Memory & Baselines

Needs verdict journal (start logging in P1) + a scheduled baseline job. Can begin PostHog-only.

PHASE 5

Smart Routing

Hard gate: Phases 1-4 stable & trusted. Needs ownership registry + Slack @mention scopes.

Sequential build order (recommended) Soft input - improves quality, not required to ship

Phases 1 and 3 form an independent fast track - neither needs backend data, so they can ship while Phase 2 endpoints are still in flight. Phase 5 is deliberately last: auto-tagging on an inaccurate diagnosis erodes trust faster than no tag at all, so it waits until the diagnostic layers below it are proven.

04

Cross-cutting: start early

Three actions that begin in Phase 1 and quietly accumulate value for later phases.

Action (begins Phase 1)	Feeds	How
Log verdict records	→ Phase 4 memory	Start from the first triage under the new format - the journal only has value once it has history.
Track fire timestamps	→ Phase 3 correlation	Log every fire with alert type, site, and time, so the correlation engine has a window to look back over.
Document ownership	→ Phase 5 registry	As you learn who owns what during Phases 1-2, write it down in a structured format ready for routing.

05

Risk register

Known risks and the mitigation already designed into the plan.

Risk	Impact	Mitigation
Backend data endpoints delayed	P2 BLOCKED	Phases 1 + 3 are independent of backend; Phase 4 baselines can start with PostHog-only data.
Correlation logic groups unrelated alerts	WRONG CONTEXT	Conservative grouping rules + dry-run period. Independent alerts always get full triage.
Auto-tagging on wrong diagnosis	TRUST ERODES	Confidence gate + 2-week dry-run + kill switch. Ms. Giang’s core concern - gated hardest.
Ownership registry goes stale	WRONG PERSON	Monthly review cadence + override mechanism.
Prompt too long for all alert types	COST · LATENCY	Modular prompt architecture - shared template + per-journey query modules.

06

Success criteria · per phase

Tick items as they land - progress persists in this browser. These are the bars each layer must clear.

Phase 1 - Reply Format

0/4

Any reader understands what / why / what-to-do in under 10 seconds.

Engineers access full technical detail via the evidence thread.

Team confirms the new format is an improvement.

Each Message 1 stays at 4 bullets with 1-2 numbers per bullet - no wall of metrics.

Phase 2 - Context Enrichment

0/4

Correctly identifies external cause (deploy / config / outage) in >80% of real incidents.

False-alarm rate drops - bot distinguishes ingestion lag from real outages.

Reply credibility increases - diagnosis cites transaction-level evidence.

When internal DB contradicts PostHog, the reply uses the DB source of truth.

Phase 3 - Multi-fire

0/5

Channel noise drops 60-70% during multi-fire incidents.

One coherent story per incident, not N fragmented reports.

Every alert gets at least an acknowledgment.

Independent alerts still get full triage - no false grouping.

Primary alert is correctly ranked (highest user impact) in the multi-fire incidents reviewed.

Phase 4 - Memory & Baselines

0/5

False-alarm rate drops 40%+.

Recurring incidents flagged as "Day N" with prior dates & outcomes.

Campaign / weekend / post-deploy effects correctly accounted for.

Human feedback loop lightweight enough to actually get used.

Verdict journal accumulates a usable history (started logging in Phase 1).

Phase 5 - Smart Routing

0/4

Time from alert fire to right person acknowledging drops 50%+.

Zero complaints about being tagged on false alarms.

Ownership registry stays accurate (monthly review cadence).

2-week dry-run completed with kill switch verified before any live auto-mention.