Chapter 4 – How NOT to Test Sentinel — and the Exact Tests to Add Today

Hook:

  • If detections were sprinklers, would you assume they work… without ever pulling the test lever?
  • If logs were ingredients, would you bake a cake with half the labels missing and hope it rises?

This is your practical checklist for turning noisy, brittle rules into a trustworthy detection system.


Why It’s Needed (Context)

Most Sentinel rollouts fail quietly—not because detections are wrong, but because tests don’t exist. The result: untriggered use-cases, malformed logs, slow KQL (Kusto Query Language) queries, no attack replay, and alert queues that either flood analysts or go silent. In other words: assumptions > evidence. We’ll flip that.


Core Concepts Explained Simply

We’ll stick to one analogy: kitchen & recipe (ingredients = logs, recipe = KQL rule, oven = pipeline/latency, taste test = simulation).

1) Use-Case Validation

  • Technical Definition: Prove each analytic maps to a clear objective, required signals, ATT&CK technique, and expected alert outcome.
  • Everyday Example: Check the recipe actually makes cake (not bread) and yields 8 slices.
  • Technical Example: “Detect risky OAuth consent” needs Entra audit + consent events; trigger both benign and malicious grants and confirm alert fields, severity, and entities.

2) Log Validation (Format, Fields, Completeness)

  • Technical Definition: Verify incoming events conform to schema (fields, types, timestamps), parsing, time skew, and completeness.
  • Everyday Example: Make sure the ingredients are labeled, fresh, and the right quantity.
  • Technical Example: Validate TimeGenerated, UserPrincipalName, AppId exist and parse; reject or quarantine events missing required fields; track % malformed per source.

3) Log Coverage (Expected Sources Present)

  • Technical Definition: Confirm every required log type (e.g., identity, endpoint, SaaS, IaaS) is actually arriving for the target scope.
  • Everyday Example: Ensure you actually bought eggs, flour, sugar—not just sugar.
  • Technical Example: Coverage matrix for subscriptions/tenants: M365 audit ✅, Entra sign-in ✅, Endpoint EDR ❌ → detection blocked until fixed.

4) KQL Performance

  • Technical Definition: Measure query runtime, memory, and stability at 1×–5× data volume; optimize with filters, summarize, arg_max, materialized views.
  • Everyday Example: Preheat the oven and time the bake.
  • Technical Example: Replace 30-day cross-joins with pre-aggregations; keep rule runtime P95 < rule schedule interval/2.

5) Attack Simulation / Replay

  • Technical Definition: Execute synthetic techniques or replay sanitized incident payloads to validate end-to-end detection & response.
  • Everyday Example: Taste test before serving.
  • Technical Example: Atomic test for token theft + replay of real OAuth abuse JSON; verify alert, incident, and playbook actions.

6) Volume & Latency

  • Technical Definition: Stress ingest and measure end-to-end time: event → ingestion → rule → alert → automation.
  • Everyday Example: Can the oven handle two trays at once without undercooking?
  • Technical Example: Track SLIs (Service Level Indicators): data lag, rule runtime, alert creation delay; set SLOs (Service Level Objectives) like “P95 alert latency < 3 min”.

7) False Positives (FP) Review

  • Technical Definition: Quantify precision/recall, label outcomes, tune thresholds and allow/deny lists.
  • Everyday Example: If every dish tastes “too salty,” your measuring spoon is wrong.
  • Technical Example: Weekly FP board: rule, reason, proposed tuning; ship suppressions with expiration + owner.

8) Alert Volume Health

  • Technical Definition: Balance alert count with analyst capacity; enforce budgets and auto-triage.
  • Everyday Example: One chef can’t plate 300 orders in 10 minutes.
  • Technical Example: If daily alerts > (analysts × handling rate), route low-sev to batched review; auto-close stale low-value patterns with audit trail.

Real-World Case Study

Failure — “The Silent Rule”

  • Situation: A team wrote a beautiful KQL detection for lateral movement but never did log coverage checks. Endpoint EDR wasn’t connected in one region.
  • Impact: Attack in that region generated zero alerts; discovery took days.
  • Lesson: No logs → no detection. Coverage gates before rule deployment.

Success — “Replay Saved the Release”

  • Situation: Another team kept a monthly replay set of sanitized OAuth abuse logs. A parser update broke AppId extraction; replay caught it within an hour.
  • Impact: Hotfix shipped same day; production detections never regressed.
  • Lesson: Known-bad payloads are your smoke—use them routinely.

Action Framework — Prevent → Detect → Respond

Prevent (build the right scaffolding)

  • Detection Charter per use-case: Objective, signals, ATT&CK mapping, owner, expected volume.
  • Data Gates: Ingest → Parse → Schema validate → Coverage check (fail closed on missing critical fields).
  • KQL Guardrails: Time-scoped filters first; pre-aggregate hot paths; avoid broad cross-joins.
  • SLOs: P95 rule runtime, P95 alert latency, % malformed < 0.5%.

Detect (prove it continuously)

  • Unit Tests for KQL: Given sample rows → expected rows (pass/fail).
  • Integration Tests: Ingest sample → rule fires → incident fields populated (entity, severity, tactic).
  • Replay Library: Keep sanitized JSON/CSV/PCAP from real incidents, tagged by ATT&CK.
  • Load Tests: 1×/2×/5× peak; record lag and schedule drift.

Respond (close the loop fast)

  • Playbook Tests: Enrichment, assignment, ticket creation; alert on playbook failure.
  • Weekly FP/Tuning: Track precision; suppress with expiry; re-run replay after tuning.
  • Queue Health: Alert budgets per tier; overflow routing; executive dashboard on MTTD/MTTR (Mean Time To Detect/Respond).

ASCII Pipeline (where to measure)

[Source] -> [Ingest] -> [Parse/Normalize] -> [KQL Rule] -> [Alert] -> [Playbook] -> [Ticket]
   SLI:lag     SLI:drop%      SLI:schema ok        SLI:runtime    SLI:create    SLI:exec     SLI:ack

Key Differences to Keep in Mind

  1. Validation vs. Enablement — Turning on rules ≠ proving they catch your scenario. Example: OAuth abuse rule enabled, but consent events never ingested.
  2. Correctness vs. Timeliness — Accurate but late alerts still lose. Example: 120-second query on a 60-second schedule.
  3. Format vs. Coverage — Perfectly parsed logs from some sources aren’t enough. Example: No EDR in Region A → blind spot.
  4. Suppression vs. Tuning — Blanket mutes hide real attacks. Example: Global VPN ASN allow-list masks exfil via consumer VPNs.
  5. One-off Tests vs. Continuous Replay — Parsers change; proofs must repeat. Example: Monthly replay catches field regressions early.

Summary Table

ConceptDefinitionEveryday ExampleTechnical Example
Use-Case ValidationProve rule matches a real scenario & outcomeRecipe yields cake, 8 slicesTrigger benign & malicious OAuth consent, verify alert details
Log ValidationSchema, fields, time, completenessIngredients labeled & freshEnforce TimeGenerated, entity fields; measure % malformed
Log CoverageRequired sources actually arriveYou bought eggs, flour, sugarCoverage matrix across tenants/regions; block deploy on gaps
KQL PerformanceRuntime & efficiency under loadOven preheated & timedP95 runtime < schedule/2; use summarize, materialized views
Attack Simulation/ReplaySynthetic or real payloads end-to-endTaste test before servingAtomic tests + replay of sanitized incident logs
Volume & LatencyE2E timing at 1×–5×Two trays in oven still bakeTrack lag, schedule drift, alert creation delay
False Positives CheckMeasure precision; tune safelyFix salty measuring spoonWeekly FP board; expiring suppressions with owner
Alert Volume HealthMatch alerts to capacityOne chef vs. 300 platesBudgets, batching, auto-triage for low-sev

What’s Next

Up next: “From Hypothesis to High-Fidelity: Designing one Sentinel detection with a test suite, replay pack, and SLOs.” We’ll build one end-to-end and publish the exact checklist.


🌞 The Last Sun Rays…

Answering the hooks:

  • Sprinklers without the test lever? Run replay and integration tests.
  • Half-labeled ingredients? Enforce schema + coverage gates before rules.

Your 30-minute win for tomorrow:

  1. Pick one high-value rule.
  2. Add a coverage gate (all required tables present).
  3. Add a replay test with a sanitized payload.
  4. Record P95 alert latency after one day.

Reflection: If you could show leadership just one metric next week, would it be precision, E2E latency, or queue health—and what decision will it unlock?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Index