Chapter 4 - How NOT to Test Sentinel — and the Exact Tests to Add Today

Hook:

If detections were sprinklers, would you assume they work… without ever pulling the test lever?
If logs were ingredients, would you bake a cake with half the labels missing and hope it rises?

This is your practical checklist for turning noisy, brittle rules into a trustworthy detection system.

Why It’s Needed (Context)

Most Sentinel rollouts fail quietly—not because detections are wrong, but because tests don’t exist. The result: untriggered use-cases, malformed logs, slow KQL (Kusto Query Language) queries, no attack replay, and alert queues that either flood analysts or go silent. In other words: assumptions > evidence. We’ll flip that.

Core Concepts Explained Simply

We’ll stick to one analogy: kitchen & recipe (ingredients = logs, recipe = KQL rule, oven = pipeline/latency, taste test = simulation).

1) Use-Case Validation

Technical Definition: Prove each analytic maps to a clear objective, required signals, ATT&CK technique, and expected alert outcome.
Everyday Example: Check the recipe actually makes cake (not bread) and yields 8 slices.
Technical Example: “Detect risky OAuth consent” needs Entra audit + consent events; trigger both benign and malicious grants and confirm alert fields, severity, and entities.

2) Log Validation (Format, Fields, Completeness)

Technical Definition: Verify incoming events conform to schema (fields, types, timestamps), parsing, time skew, and completeness.
Everyday Example: Make sure the ingredients are labeled, fresh, and the right quantity.
Technical Example: Validate TimeGenerated, UserPrincipalName, AppId exist and parse; reject or quarantine events missing required fields; track % malformed per source.

3) Log Coverage (Expected Sources Present)

Technical Definition: Confirm every required log type (e.g., identity, endpoint, SaaS, IaaS) is actually arriving for the target scope.
Everyday Example: Ensure you actually bought eggs, flour, sugar—not just sugar.
Technical Example: Coverage matrix for subscriptions/tenants: M365 audit ✅, Entra sign-in ✅, Endpoint EDR ❌ → detection blocked until fixed.

4) KQL Performance

Technical Definition: Measure query runtime, memory, and stability at 1×–5× data volume; optimize with filters, summarize, arg_max, materialized views.
Everyday Example: Preheat the oven and time the bake.
Technical Example: Replace 30-day cross-joins with pre-aggregations; keep rule runtime P95 < rule schedule interval/2.

5) Attack Simulation / Replay

Technical Definition: Execute synthetic techniques or replay sanitized incident payloads to validate end-to-end detection & response.
Everyday Example: Taste test before serving.
Technical Example: Atomic test for token theft + replay of real OAuth abuse JSON; verify alert, incident, and playbook actions.

6) Volume & Latency

Technical Definition: Stress ingest and measure end-to-end time: event → ingestion → rule → alert → automation.
Everyday Example: Can the oven handle two trays at once without undercooking?
Technical Example: Track SLIs (Service Level Indicators): data lag, rule runtime, alert creation delay; set SLOs (Service Level Objectives) like “P95 alert latency < 3 min”.

7) False Positives (FP) Review

Technical Definition: Quantify precision/recall, label outcomes, tune thresholds and allow/deny lists.
Everyday Example: If every dish tastes “too salty,” your measuring spoon is wrong.
Technical Example: Weekly FP board: rule, reason, proposed tuning; ship suppressions with expiration + owner.

8) Alert Volume Health

Technical Definition: Balance alert count with analyst capacity; enforce budgets and auto-triage.
Everyday Example: One chef can’t plate 300 orders in 10 minutes.
Technical Example: If daily alerts > (analysts × handling rate), route low-sev to batched review; auto-close stale low-value patterns with audit trail.

Real-World Case Study

Failure — “The Silent Rule”

Situation: A team wrote a beautiful KQL detection for lateral movement but never did log coverage checks. Endpoint EDR wasn’t connected in one region.
Impact: Attack in that region generated zero alerts; discovery took days.
Lesson: No logs → no detection. Coverage gates before rule deployment.

Success — “Replay Saved the Release”

Situation: Another team kept a monthly replay set of sanitized OAuth abuse logs. A parser update broke AppId extraction; replay caught it within an hour.
Impact: Hotfix shipped same day; production detections never regressed.
Lesson: Known-bad payloads are your smoke—use them routinely.

Action Framework — Prevent → Detect → Respond

Prevent (build the right scaffolding)

Detection Charter per use-case: Objective, signals, ATT&CK mapping, owner, expected volume.
Data Gates: Ingest → Parse → Schema validate → Coverage check (fail closed on missing critical fields).
KQL Guardrails: Time-scoped filters first; pre-aggregate hot paths; avoid broad cross-joins.
SLOs: P95 rule runtime, P95 alert latency, % malformed < 0.5%.

Detect (prove it continuously)

Unit Tests for KQL: Given sample rows → expected rows (pass/fail).
Integration Tests: Ingest sample → rule fires → incident fields populated (entity, severity, tactic).
Replay Library: Keep sanitized JSON/CSV/PCAP from real incidents, tagged by ATT&CK.
Load Tests: 1×/2×/5× peak; record lag and schedule drift.

Respond (close the loop fast)

Playbook Tests: Enrichment, assignment, ticket creation; alert on playbook failure.
Weekly FP/Tuning: Track precision; suppress with expiry; re-run replay after tuning.
Queue Health: Alert budgets per tier; overflow routing; executive dashboard on MTTD/MTTR (Mean Time To Detect/Respond).

ASCII Pipeline (where to measure)

[Source] -> [Ingest] -> [Parse/Normalize] -> [KQL Rule] -> [Alert] -> [Playbook] -> [Ticket]
   SLI:lag     SLI:drop%      SLI:schema ok        SLI:runtime    SLI:create    SLI:exec     SLI:ack

Key Differences to Keep in Mind

Validation vs. Enablement — Turning on rules ≠ proving they catch your scenario. Example: OAuth abuse rule enabled, but consent events never ingested.
Correctness vs. Timeliness — Accurate but late alerts still lose. Example: 120-second query on a 60-second schedule.
Format vs. Coverage — Perfectly parsed logs from some sources aren’t enough. Example: No EDR in Region A → blind spot.
Suppression vs. Tuning — Blanket mutes hide real attacks. Example: Global VPN ASN allow-list masks exfil via consumer VPNs.
One-off Tests vs. Continuous Replay — Parsers change; proofs must repeat. Example: Monthly replay catches field regressions early.

Summary Table

Concept	Definition	Everyday Example	Technical Example
Use-Case Validation	Prove rule matches a real scenario & outcome	Recipe yields cake, 8 slices	Trigger benign & malicious OAuth consent, verify alert details
Log Validation	Schema, fields, time, completeness	Ingredients labeled & fresh	Enforce `TimeGenerated`, entity fields; measure % malformed
Log Coverage	Required sources actually arrive	You bought eggs, flour, sugar	Coverage matrix across tenants/regions; block deploy on gaps
KQL Performance	Runtime & efficiency under load	Oven preheated & timed	P95 runtime < schedule/2; use `summarize`, materialized views
Attack Simulation/Replay	Synthetic or real payloads end-to-end	Taste test before serving	Atomic tests + replay of sanitized incident logs
Volume & Latency	E2E timing at 1×–5×	Two trays in oven still bake	Track lag, schedule drift, alert creation delay
False Positives Check	Measure precision; tune safely	Fix salty measuring spoon	Weekly FP board; expiring suppressions with owner
Alert Volume Health	Match alerts to capacity	One chef vs. 300 plates	Budgets, batching, auto-triage for low-sev

What’s Next

Up next: “From Hypothesis to High-Fidelity: Designing one Sentinel detection with a test suite, replay pack, and SLOs.” We’ll build one end-to-end and publish the exact checklist.

🌞 The Last Sun Rays…

Answering the hooks:

Sprinklers without the test lever? Run replay and integration tests.
Half-labeled ingredients? Enforce schema + coverage gates before rules.

Your 30-minute win for tomorrow:

Pick one high-value rule.
Add a coverage gate (all required tables present).
Add a replay test with a sanitized payload.
Record P95 alert latency after one day.

Reflection: If you could show leadership just one metric next week, would it be precision, E2E latency, or queue health—and what decision will it unlock?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

Chapter 4 – How NOT to Test Sentinel — and the Exact Tests to Add Today