Category: Sentinel

How NOT to Design Microsoft Sentinel

Like Building a Fire Station Without a City Map, a Kitchen Without Labels, and a Dashboard With No Gauges

Designing Sentinel the wrong way is basically:

A fire station without addresses (no asset inventory, no ownership).
A kitchen with unlabeled jars (no standard parsing, field mismatch, custom logs chaos).
A cockpit with pretty screens but no instruments (no health metrics, no latency checks, “heartbeat” lies).

This “How NOT to…” series is a reverse blueprint: the anti-patterns that quietly turn Sentinel into an expensive alert-generator that nobody trusts.

2) Why It’s Needed (Context)

Microsoft Sentinel rarely “fails” because KQL (Kusto Query Language) is hard or detections are missing.

It fails because teams skip the boring, foundational work:

You ingest everything without knowing why → costs explode.
You enable use-cases without confirming logs exist → blind detections.
You run SOC operations without governance → chaos, alert noise, and broken workflows.
You migrate like it’s a tool swap → you carry forward legacy pain.
You measure “health” with ingestion volume only → you miss latency, drift, and silent failures.

Sentinel is a security operations platform, not a log dumpster or a rule-count trophy cabinet.

3) Core Concepts Explained Simply

Concept A: Asset Inventory

Technical Definition: A continuously updated list of systems, identities, apps, and data sources you monitor (with criticality and metadata).
Everyday Example: A city can’t plan emergency response if it doesn’t know what buildings exist.
Technical Example: You can’t validate “critical devices are reporting” if you don’t have a baseline list of critical devices + expected log sources per device.

Concept B: Ownership Model + RACI

Technical Definition: Clear accountability for data sources, detections, incident workflows, and platform changes (RACI = Responsible, Accountable, Consulted, Informed).
Everyday Example: If a shared kitchen has no owner for cleaning and restocking, it becomes unusable.
Technical Example: Connector breaks (token expiry) → nobody renews credentials → logs stop → detections silently fail.

Concept C: Purpose-Driven Data Collection

Technical Definition: Collect logs based on threat coverage + business risk + detection goals, not “collect everything.”
Everyday Example: Don’t store every grocery receipt for life—store the ones you need for taxes or warranty.
Technical Example: Ingesting verbose debug logs into Analytics when they should be Basic Logs / archive / not collected at all.

Concept D: Log Source Strategy (Connectors, DCR, Retention)

Technical Definition: A plan for how data enters Sentinel (native connectors vs custom), how it’s normalized, and how long it’s kept (retention + archival).
Everyday Example: Mail sorting: letters go to the right address, duplicates are rejected, important documents are filed properly.
Technical Example: Duplicate telemetry from multiple pipelines (e.g., same firewall logs via agent + syslog forwarder) → double ingestion cost + duplicate alerts.

Concept E: Detection Engineering (Use-cases + KQL Hygiene)

Technical Definition: Detections mapped to attacker behavior, validated against available logs, tuned for signal, and grouped into meaningful incidents.
Everyday Example: A smoke alarm that triggers on toast every morning gets ignored—even when there’s a real fire.
Technical Example: “Everything High severity” + no incident grouping + alerts with zero context → analyst fatigue and missed real attacks.

Concept F: Sentinel Health Monitoring

Technical Definition: Measuring whether Sentinel is receiving the right data, on time, in the right shape—and whether detections are running as expected.
Everyday Example: A fitness tracker that only counts steps, not heart rate, sleep, or oxygen—looks “fine” until you collapse.
Technical Example: “Heartbeat present” does not guarantee critical logs are landing, parsed, and usable. Latency (P95 delays) can ruin response.

4) Real-World Case Study

Failure Case: “We enabled 200 rules on Day 1”

Situation: A team migrated to Sentinel and enabled lots of analytic rules (quantity over quality). They also did “collect everything ingestion.”

Impact:

Costs spiked immediately.
Alerts fired constantly, many with zero context (missing logs / field mismatch).
Analysts started ignoring Sentinel (“it’s noisy and useless”).
A real credential theft event blended into the noise.

Lesson: Rule count is not security. Coverage + context + tuning is security.

Success Case: “Risk-first ingestion and coverage mapping”

Situation: Another team built:

asset inventory + ownership model,
log-source → use-case mapping,
MITRE ATT&CK mapping,
validation tests (log format + use-case trigger + false positive checks),
health metrics (latency, drift, connector auth expiry).

Impact:

Lower ingestion, lower cost.
Higher signal alerts, faster triage.
Clear ownership for broken connectors and rule changes.
Executives got meaningful metrics (MTTD/MTTR and alert noise).

Lesson: Sentinel becomes powerful when it’s run like a product, not installed like a tool.

5) Action Framework — Prevent → Detect → Respond

Prevent (stop the mess before it happens)

Build asset inventory with criticality tiers (Tier 0/1/2).
Define ownership + RACI (data owners, detection owners, platform owners).
Create purpose-driven ingestion policy:
- “If we collect it, what decision does it enable?”
Create retention + archival strategy (hot vs archive; compliance vs investigation needs).
Prefer native connectors + normalization over custom logs (unless truly needed).
Implement RBAC (Role-Based Access Control) standards early (least privilege, separation of duties).

Detect (prove it works, continuously)

Log validation:
- Are logs landing?
- Are fields parsed correctly?
- Are timestamps sane?
Use-case validation:
- Does the detection trigger when expected?
- Does it include context (entities, enrichment)?
Coverage mapping:
- log-source → use-case coverage
- MITRE ATT&CK mapping
- attacker path mapping (how an attacker actually moves in your environment)
KQL performance testing (avoid slow, expensive queries).

Respond (operate like a mature SOC)

Incident grouping standards (reduce alert spam).
Triage playbooks + automation only where it’s safe.
Change control + documentation as non-negotiable.
Metrics:
- MTTD (Mean Time To Detect)
- MTTR (Mean Time To Respond)
- alert volume + false positive rate + “noisy top rules”
Health monitoring:
- ingestion latency (P95)
- connector auth expiry
- DCR drift (Data Collection Rule drift)
- disabled/modified analytic rules
- “critical devices reporting” checks

Quick visual (what good looks like):

[Business Risks] → [Assets & Owners] → [Log Sources] → [Use-Cases] → [Detections] → [Incidents] → [Metrics]
        |                |                |              |              |             |             |
     (map)            (RACI)          (DCR/Conn)      (coverage)     (KQL)        (grouping)   (MTTD/MTTR)

6) Key Differences to Keep in Mind

“Collect everything” vs “Collect with purpose”
- Scenario: You ingest all verbose logs → bills spike; you still can’t answer “Is Tier-0 authentication monitored?”
“Heartbeat exists” vs “Critical logs are usable”
- Scenario: Heartbeat shows alive, but ingestion latency (P95) is 2 hours → detections fire too late to matter.
“Rule count” vs “Coverage and quality”
- Scenario: 300 rules enabled, but no mapping to MITRE ATT&CK or attacker paths → huge gaps + tons of noise.
“Lift-and-shift migration” vs “Modernization migration”
- Scenario: You migrate every legacy rule → you import legacy SIEM problems into a new platform.

7) Summary Table

Concept	Definition	Everyday Example	Technical Example
Asset Inventory	List of monitored assets with criticality + metadata	City map of buildings	Baseline “critical devices reporting” checks
Ownership Model / RACI	Clear responsibility for platform, data, detections	Shared kitchen ownership	Connector auth expiry gets fixed fast
Purpose-Driven Ingestion	Collect logs to support decisions/detections	Keep receipts you actually need	Avoid “collect everything” ingestion cost bomb
Log Source Strategy	Plan for connectors, parsing, retention, archival	Mail sorting + filing	Native connectors, avoid duplicate pipelines
Detection Engineering	High-signal detections with context, mapped coverage	Smoke alarm that isn’t toast-sensitive	KQL hygiene, incident grouping, MITRE mapping
Testing & Validation	Prove logs and detections work	Test fire drills	Replay attacks, log validation, latency tests
Governance	Change control, docs, process, metrics	Operating a factory safely	RBAC standards, rule lifecycle, SOC runbooks
Health Monitoring	Monitor latency, drift, disabled rules	Cockpit instruments	P95 latency, DCR drift, rule modifications

8) 🌞 The Last Sun Rays…

If Sentinel feels “bad,” it’s usually not because Microsoft shipped weak detections.

It’s because teams:

built no map (no inventory),
assigned no drivers (no owners),
collected random ingredients (no purpose),
measured the wrong things (ingestion volume ≠ health),
and called it “done” without testing.

If you had to pick one thing to fix first:
Would you rather build (1) an asset + ownership map, or (2) a log-source → use-case coverage matrix—and why?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

March 2, 2026

Chapter 6: How Not to Migrate to Microsoft Sentinel

1. Title + Hook

Migrating to Microsoft Sentinel isn’t “moving your SIEM to the cloud.”

It’s closer to:

Switching from a landline call-center to an omnichannel support platform — if you only move phone scripts, you miss chat, automation, and analytics.
Replacing a filing cabinet with a searchable data lake — if you keep the same folders, you waste the power of indexing and correlation.
Upgrading from a smoke alarm to a smart home security system — if you only use the siren, you ignore cameras, motion patterns, and automation.

The tool will work.
The real question is whether your detection capability improves.

2. Why It’s Needed (Context)

Sentinel migrations fail in a specific way: they “succeed” technically (logs ingest, rules run), but security posture doesn’t improve.

Common outcomes when teams carry a legacy mindset:

Alert noise increases (and analysts burn out)
Identity and cloud threats are under-detected
Costs spike because ingestion is enabled without design
SOC processes become inconsistent: “Who owns what? What’s the triage path?”

Sentinel is cloud-native and correlation-rich — but only if you design for it.

3. Core Concepts Explained Simply

Concept 1: Lift-and-Shift Migration Is a Trap (Mistake #1)

Technical Definition
Lift-and-shift is porting legacy rules, dashboards, and searches into Sentinel with minimal redesign.

Everyday Example
Translating a cookbook from French to English but never adjusting for different ingredients or ovens.

Technical Example
Exporting old SIEM correlation rules → converting syntax to KQL (Kusto Query Language) → rebuilding dashboards → declaring success, even though Sentinel’s schemas, enrichment, and correlation patterns differ.

Concept 2: SIEM is an Operating Model, Not a Product (Mistake #2)

Technical Definition
A SIEM program includes threat modeling, data onboarding, detection lifecycle, SOC workflows, automation, governance, and cost management — not just alerts.

Everyday Example
Buying a hospital MRI machine doesn’t create a radiology department.

Technical Example
Migrating rules without migrating case management, triage standards, escalation paths, tuning ownership, and change control causes inconsistent response and alert fatigue.

Concept 3: Threat Model Must Be Revalidated During Migration (Mistake #3)

Technical Definition
Threat modeling aligns detections and telemetry to current attack surfaces (cloud, identity, endpoints, SaaS).

Everyday Example
Upgrading locks but ignoring the open window.

Technical Example
Porting network-focused detections while missing identity-centric attack paths (token theft, consent abuse, privilege escalation, conditional access bypass attempts).

Concept 4: Data Engineering Is Security Engineering (Mistake #4)

Technical Definition
Sentinel detections are only as strong as ingestion design: connectors, normalization, table choice, enrichment, retention, and filtering.

Everyday Example
A GPS is useless if the map data is wrong.

Technical Example
Wrong connector configuration or inconsistent fields → KQL rules become brittle; incident investigation fails due to missing entity context (user/device/IP correlation).

Concept 5: Cost Is a Security Requirement (Mistake #5)

Technical Definition
Sentinel pricing is ingestion-based, so architecture must include cost controls (filtering, tiered retention, data types).

Everyday Example
Buying cloud storage without lifecycle policies — your bill becomes your surprise.

Technical Example
Enabling every diagnostic log, keeping it all “hot,” no retention segmentation, and no forecasting → budget blowout → leadership distrust → reduced logging later (which creates blind spots).

Concept 6: Big Bang Cutovers Cause Blind Spots (Mistake #6)

Technical Definition
A cutover without parallel validation risks missed detections due to schema gaps, logic differences, and tuning immaturity.

Everyday Example
Turning off the old security cameras before testing the new ones at night.

Technical Example
Disabling legacy SIEM on day 1 → Sentinel rules aren’t tuned → noisy alerts drown real incidents → gaps aren’t discovered until post-incident review.

Concept 7: “Go-Live” Is Not a Success Metric (Mistake #7)

Technical Definition
Success is measurable improvement: validated coverage, reduced noise, stable SOC throughput, governance, and predictable cost.

Everyday Example
Launching an app isn’t the same as users being happy and retained.

Technical Example
Workspace is live but:

detection coverage isn’t mapped to threats
false positives are high
analyst time per incident is worse
→ migration failed.

Concept 8: Don’t Ignore Sentinel’s Native Strengths (Mistake #8)

Technical Definition
Sentinel includes built-in analytics, correlation, UEBA, and deep Microsoft ecosystem integration.

Everyday Example
Buying a power drill and using it as a screwdriver.

Technical Example
Rebuilding manual rules for scenarios already covered by built-in analytics + Microsoft Defender integration + correlation features, instead of enabling, validating, tuning, and extending.

Concept 9: Migrating Every Legacy Rule Is a Mistake (Mistake #9)

Technical Definition
Legacy SIEM rule sets often contain duplicates, obsolete detections, and low-value noise generators.

Everyday Example
Moving every item from your junk drawer into a new house.

Technical Example
Copying hundreds of rules without rationalization → increased alert volume with little added detection value.

Concept 10: Sentinel Won’t Behave Like an On-Prem SIEM (Mistake #10)

Technical Definition
Sentinel is cloud-native, elastic, and data-lake-backed; it encourages different detection patterns and operational workflows.

Everyday Example
Expecting a streaming service to behave like a DVD shelf.

Technical Example
Designing searches and dashboards as if compute/storage is fixed and local → inefficiency, cost spikes, poor performance patterns, and missed platform capabilities.

Concept 11: Migration is Mostly Planning (Mistake #11)

Technical Definition
The highest leverage work is done before implementation: ingestion blueprint, detection rationalization, cost modeling, governance, success metrics.

Everyday Example
In construction, a bad blueprint scales mistakes across the whole building.

Technical Example
Skipping architecture and rushing execution → bad logging choices and rule structure multiply at cloud scale.

Concept 12: The Legacy Lens is the Silent Killer (Mistake #12)

Technical Definition
The “legacy lens” is trying to recreate old dashboards, correlation logic, and SOC workflows instead of embracing Sentinel’s strengths and modern detection engineering principles.

Everyday Example
Buying a hybrid car and insisting it only runs in first gear because it feels familiar.

Technical Example
Forcing identical dashboard parity and correlation design:

increases complexity
prevents tuning for identity + cloud signals
blocks automation adoption
→ you underuse Sentinel and miss optimization opportunities.

4. Real-World Case Study

Failure Case: “Translated Everything, Improved Nothing”

Situation

Ported rules, rebuilt dashboards, went live fast
Impact
Noise increased
Identity threats were still weakly covered
Costs spiked
Analysts lost time and confidence
Lesson
You migrated syntax, not detection capability.

Success Case: “Rationalize → Design → Validate → Cut Over”

Situation

Started from threat scenarios
Built logging blueprint + cost model
Enabled built-in Sentinel capabilities first
Ran parallel validation
Impact
Fewer rules, better signal
Stable SOC efficiency
Predictable spending
Lesson
Migration is an opportunity to modernize operations, not just change tools.

5. Action Framework: Prevent → Detect → Respond

Prevent

Threat model refresh (cloud + identity + endpoint first)
Logging blueprint (what signals, why, where filtered)
Cost model (hot vs cold retention tiers, filtering rules)
Governance (ownership, naming, change control)

Detect

Enable built-ins → validate → tune → extend
Rationalize detections (remove duplicates/obsolete)
Coverage mapping to threat scenarios
Quality metrics: false positive rate, coverage %, MTTD

Respond

SOC workflow redesign (triage → investigation → escalation)
Automation playbooks for repetitive tasks
Parallel run comparisons (alerts, misses, workload)
Response metrics: MTTR + analyst effort per incident

ASCII flow (migration pipeline):

Threat Model → Logging Blueprint → Cost Model → Governance
      ↓               ↓               ↓
Built-ins Enable → Validate/Tune → Custom Detections
      ↓
Parallel Run → Metrics Review → Cutover

6. Key Differences to Keep in Mind

Rule Translation vs Capability Redesign
Scenario: Same detection logic doesn’t work because Sentinel tables and enrichment differ.
More Logs vs Better Signals
Scenario: Ingesting everything increases cost/noise without improving incidents.
Go-Live vs Measured Outcomes
Scenario: Workspace live but analysts slower and coverage unclear.
Legacy Dashboards vs Decision Dashboards
Scenario: “Alerts by severity” looks nice; “top false positives + owners” improves operations.

7. Summary Table

Concept	Definition	Everyday Example	Technical Example
Lift-and-shift trap	Porting artifacts without redesign	Translating a recipe without adapting ingredients	Converting legacy rules to KQL without schema redesign
SIEM operating model	Tool + people + process + governance	MRI machine ≠ radiology dept	Rules moved but workflows/playbooks absent
Threat model refresh	Align to modern attack surface	Locking doors, window open	Missing identity and cloud detections
Data engineering	Ingestion quality drives detection quality	GPS with wrong map	Bad connectors/fields → brittle KQL
Cost planning	Security includes financial design	No storage lifecycle policy	Ingest-all → surprise bill → logging cuts
Parallel validation	Avoid blind cutover	Test cameras at night	Run both SIEMs, compare misses/noise
Outcomes > go-live	Measure improvements	App launch ≠ adoption	Coverage + fidelity + SOC efficiency
Use built-ins	Don’t rebuild what exists	Power drill used as screwdriver	Enable/tune built-in analytics + correlations
Rule rationalization	Quality over quantity	Junk drawer migration	Remove duplicates/obsolete rules
Cloud-native mindset	Different architecture	Streaming vs DVDs	Avoid on-prem performance assumptions
Planning first	Architecture is leverage	Bad blueprint scales	No ingestion blueprint/cost model/governance
Legacy lens	Recreating old behavior	Hybrid car stuck in 1st gear	Force parity dashboards, ignore automation

8. What’s Next

Next blog idea: “Sentinel Migration Blueprint: A Step-by-Step Plan (Threat Model → Logging → Detections → SOC Ops → Cost)”
Including a checklist and example success metrics.

9. 🌞 The Last Sun Rays…

So yes — migration is not copying the past. It’s redesigning detection for a cloud-native world.

Lift-and-shift? Easy — and usually noisy.
Redesign? Harder — but that’s where posture improves.
Success isn’t “we went live.” It’s “we detect more, waste less, and respond faster — predictably.”

Reflective question: If you had to pick one thing to prove your migration actually improved security — coverage, false positive rate, MTTD, MTTR, or cost predictability — which would you put on the dashboard first?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

February 17, 2026

Chapter 5 – How NOT to Govern Microsoft Sentinel Operations

Why “Just Turn It On” Becomes “Why Is Everything On Fire?”

Think of Sentinel governance like:

Air traffic control without radar — planes are flying, but no one knows who’s landing or crashing.
A hospital ER with no triage — everyone is “urgent,” so nothing actually is.
A city with traffic lights but no traffic rules — motion everywhere, safety nowhere.

Sentinel will run without governance.
It just won’t protect you.

Why This Matters (Context)

Most Sentinel failures don’t happen because of bad analytics or missing logs.
They happen because operations were never governed.

When governance is missing:

SOC teams burn out
Alerts pile up unchecked
Leadership loses trust in security metrics
Incidents take longer — or never get resolved

Sentinel becomes expensive visibility, not operational security.

Core Governance Anti-Patterns (Explained Simply)

Let’s break down the most common ways Sentinel governance fails — and why each one hurts.

1. No Change Control

Technical Definition
Changes to analytics rules, playbooks, data connectors, or workbooks are made without approval, tracking, or rollback.

Everyday Example
Anyone can move the furniture in a fire station — including blocking the exits.

Technical Example

SOC analyst edits a detection rule in production
False positives spike
No one knows who changed what or why

Result: Unstable detections and incident chaos.

2. No Documentation

Technical Definition
Sentinel configurations exist only in people’s heads, chats, or tribal memory.

Everyday Example
A recipe passed by word of mouth — until the chef quits.

Technical Example

Alerts fire with cryptic names
No runbooks
No explanation of logic, thresholds, or response steps

Result: Slow response and dependency on “that one person.”

3. Too Much or No Governance

Technical Definition
Either every change requires bureaucracy, or nothing is controlled at all.

Everyday Example

Too much: You need a board meeting to change a light bulb
Too little: Anyone rewires the building

Technical Example

Over-governance: SOC can’t tune noisy rules
Under-governance: Junior analysts disable detections to reduce noise

Result: Either stagnation or silent security gaps.

4. No Measurement Loop (MTTD / MTTR / Noise)

Technical Definition
No metrics exist to measure Sentinel’s effectiveness.

Everyday Example
A fitness plan without a scale, stopwatch, or mirror.

Technical Example

No Mean Time To Detect (MTTD)
No Mean Time To Respond (MTTR)
No alert-to-incident ratio tracking

Result: Leadership asks, “Is Sentinel working?”
And no one can answer.

5. No Content Lifecycle Ownership

Technical Definition
Analytics rules and playbooks are deployed but never reviewed, tuned, or retired.

Everyday Example
Smoke alarms installed once — never tested again.

Technical Example

Rules fire on legacy systems that no longer exist
Playbooks reference deprecated APIs
No owner reviews detection quality quarterly

Result: Noise increases while value decreases.

6. No RACI (Responsible, Accountable, Consulted, Informed)

Technical Definition
Ownership of Sentinel components is unclear.

Everyday Example
Everyone assumes someone else is taking out the trash.

Technical Example

Who owns detections?
Who approves changes?
Who tunes false positives?

Result: Alerts get ignored because “that’s not my job.”

7. No Established Process

Technical Definition
Incident handling, tuning, onboarding, and offboarding are ad-hoc.

Everyday Example
Fire drills invented during the fire.

Technical Example

No standard incident workflow
No tuning cadence
No onboarding checklist for new data sources

Result: Inconsistent outcomes and analyst fatigue.

8. No Framework Alignment

Technical Definition
Sentinel detections are not mapped to security frameworks.

Everyday Example
Training for a marathon without knowing the race distance.

Technical Example

Detections not aligned to MITRE ATT&CK
No coverage visibility
Leadership can’t assess risk reduction

Result: Security theater instead of security strategy.

Real-World Case Study

❌ Failure Case: “Alert Avalanche”

Situation
A global company enabled Sentinel rapidly during cloud migration.

What Went Wrong

No RACI
No metrics
No lifecycle ownership

Impact

18,000 alerts/week
Analysts ignored high-severity incidents
Leadership questioned Sentinel ROI

Lesson
Visibility without governance increases risk.

✅ Success Case: “Measured, Managed SOC”

Situation
Another organization paused expansion and fixed governance first.

What They Did

Defined RACI
Implemented MTTD/MTTR tracking
Assigned rule owners
Quarterly detection reviews

Impact

65% noise reduction
Faster incident closure
Clear executive reporting

Lesson
Governance amplifies Sentinel’s value.

Action Framework: Prevent → Detect → Respond

[ Design ] → [ Measure ] → [ Improve ]
     ↓           ↓            ↓
 Governance   Metrics     Continuous Tuning

Prevent

Enforce change control
Define RACI
Align detections to frameworks

Detect

Track MTTD / MTTR
Monitor alert noise
Review rule effectiveness

Respond

Document playbooks
Test automation
Retire stale content

Key Differences to Keep in Mind

Visibility vs Security
Seeing alerts ≠ stopping threats
Governance vs Bureaucracy
Controls should enable speed, not kill it
Metrics vs Vanity Numbers
Alert count means nothing without context

Summary Table

Concept	Definition	Everyday Example	Technical Example
Change Control	Managed configuration updates	Locking emergency exits	Approved rule edits
Documentation	Shared operational knowledge	Written recipe	Runbooks
Metrics	Effectiveness measurement	Fitness tracking	MTTD / MTTR
RACI	Ownership clarity	Assigned chores	Rule ownership
Lifecycle	Ongoing content care	Smoke alarm testing	Rule reviews
Frameworks	Strategic alignment	Training plan	MITRE mapping

What’s Next

In the next post, we’ll flip the script:

“How to Build a Sentinel Governance Model That Actually Works”
→ Roles
→ Metrics
→ Operating cadence
→ Executive-ready dashboards

🌞 The Last Sun Rays…

Sentinel doesn’t fail because it lacks features.
It fails because operations lack structure.

If you had to choose one governance metric to put on your SOC dashboard tomorrow —
would it measure noise, speed, or accountability?

☀️

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

February 10, 2026

Chapter 4 – How NOT to Test Sentinel — and the Exact Tests to Add Today

Hook:

If detections were sprinklers, would you assume they work… without ever pulling the test lever?
If logs were ingredients, would you bake a cake with half the labels missing and hope it rises?

This is your practical checklist for turning noisy, brittle rules into a trustworthy detection system.

Why It’s Needed (Context)

Most Sentinel rollouts fail quietly—not because detections are wrong, but because tests don’t exist. The result: untriggered use-cases, malformed logs, slow KQL (Kusto Query Language) queries, no attack replay, and alert queues that either flood analysts or go silent. In other words: assumptions > evidence. We’ll flip that.

Core Concepts Explained Simply

We’ll stick to one analogy: kitchen & recipe (ingredients = logs, recipe = KQL rule, oven = pipeline/latency, taste test = simulation).

1) Use-Case Validation

Technical Definition: Prove each analytic maps to a clear objective, required signals, ATT&CK technique, and expected alert outcome.
Everyday Example: Check the recipe actually makes cake (not bread) and yields 8 slices.
Technical Example: “Detect risky OAuth consent” needs Entra audit + consent events; trigger both benign and malicious grants and confirm alert fields, severity, and entities.

2) Log Validation (Format, Fields, Completeness)

Technical Definition: Verify incoming events conform to schema (fields, types, timestamps), parsing, time skew, and completeness.
Everyday Example: Make sure the ingredients are labeled, fresh, and the right quantity.
Technical Example: Validate TimeGenerated, UserPrincipalName, AppId exist and parse; reject or quarantine events missing required fields; track % malformed per source.

3) Log Coverage (Expected Sources Present)

Technical Definition: Confirm every required log type (e.g., identity, endpoint, SaaS, IaaS) is actually arriving for the target scope.
Everyday Example: Ensure you actually bought eggs, flour, sugar—not just sugar.
Technical Example: Coverage matrix for subscriptions/tenants: M365 audit ✅, Entra sign-in ✅, Endpoint EDR ❌ → detection blocked until fixed.

4) KQL Performance

Technical Definition: Measure query runtime, memory, and stability at 1×–5× data volume; optimize with filters, summarize, arg_max, materialized views.
Everyday Example: Preheat the oven and time the bake.
Technical Example: Replace 30-day cross-joins with pre-aggregations; keep rule runtime P95 < rule schedule interval/2.

5) Attack Simulation / Replay

Technical Definition: Execute synthetic techniques or replay sanitized incident payloads to validate end-to-end detection & response.
Everyday Example: Taste test before serving.
Technical Example: Atomic test for token theft + replay of real OAuth abuse JSON; verify alert, incident, and playbook actions.

6) Volume & Latency

Technical Definition: Stress ingest and measure end-to-end time: event → ingestion → rule → alert → automation.
Everyday Example: Can the oven handle two trays at once without undercooking?
Technical Example: Track SLIs (Service Level Indicators): data lag, rule runtime, alert creation delay; set SLOs (Service Level Objectives) like “P95 alert latency < 3 min”.

7) False Positives (FP) Review

Technical Definition: Quantify precision/recall, label outcomes, tune thresholds and allow/deny lists.
Everyday Example: If every dish tastes “too salty,” your measuring spoon is wrong.
Technical Example: Weekly FP board: rule, reason, proposed tuning; ship suppressions with expiration + owner.

8) Alert Volume Health

Technical Definition: Balance alert count with analyst capacity; enforce budgets and auto-triage.
Everyday Example: One chef can’t plate 300 orders in 10 minutes.
Technical Example: If daily alerts > (analysts × handling rate), route low-sev to batched review; auto-close stale low-value patterns with audit trail.

Real-World Case Study

Failure — “The Silent Rule”

Situation: A team wrote a beautiful KQL detection for lateral movement but never did log coverage checks. Endpoint EDR wasn’t connected in one region.
Impact: Attack in that region generated zero alerts; discovery took days.
Lesson: No logs → no detection. Coverage gates before rule deployment.

Success — “Replay Saved the Release”

Situation: Another team kept a monthly replay set of sanitized OAuth abuse logs. A parser update broke AppId extraction; replay caught it within an hour.
Impact: Hotfix shipped same day; production detections never regressed.
Lesson: Known-bad payloads are your smoke—use them routinely.

Action Framework — Prevent → Detect → Respond

Prevent (build the right scaffolding)

Detection Charter per use-case: Objective, signals, ATT&CK mapping, owner, expected volume.
Data Gates: Ingest → Parse → Schema validate → Coverage check (fail closed on missing critical fields).
KQL Guardrails: Time-scoped filters first; pre-aggregate hot paths; avoid broad cross-joins.
SLOs: P95 rule runtime, P95 alert latency, % malformed < 0.5%.

Detect (prove it continuously)

Unit Tests for KQL: Given sample rows → expected rows (pass/fail).
Integration Tests: Ingest sample → rule fires → incident fields populated (entity, severity, tactic).
Replay Library: Keep sanitized JSON/CSV/PCAP from real incidents, tagged by ATT&CK.
Load Tests: 1×/2×/5× peak; record lag and schedule drift.

Respond (close the loop fast)

Playbook Tests: Enrichment, assignment, ticket creation; alert on playbook failure.
Weekly FP/Tuning: Track precision; suppress with expiry; re-run replay after tuning.
Queue Health: Alert budgets per tier; overflow routing; executive dashboard on MTTD/MTTR (Mean Time To Detect/Respond).

ASCII Pipeline (where to measure)

[Source] -> [Ingest] -> [Parse/Normalize] -> [KQL Rule] -> [Alert] -> [Playbook] -> [Ticket]
   SLI:lag     SLI:drop%      SLI:schema ok        SLI:runtime    SLI:create    SLI:exec     SLI:ack

Key Differences to Keep in Mind

Validation vs. Enablement — Turning on rules ≠ proving they catch your scenario. Example: OAuth abuse rule enabled, but consent events never ingested.
Correctness vs. Timeliness — Accurate but late alerts still lose. Example: 120-second query on a 60-second schedule.
Format vs. Coverage — Perfectly parsed logs from some sources aren’t enough. Example: No EDR in Region A → blind spot.
Suppression vs. Tuning — Blanket mutes hide real attacks. Example: Global VPN ASN allow-list masks exfil via consumer VPNs.
One-off Tests vs. Continuous Replay — Parsers change; proofs must repeat. Example: Monthly replay catches field regressions early.

Summary Table

Concept	Definition	Everyday Example	Technical Example
Use-Case Validation	Prove rule matches a real scenario & outcome	Recipe yields cake, 8 slices	Trigger benign & malicious OAuth consent, verify alert details
Log Validation	Schema, fields, time, completeness	Ingredients labeled & fresh	Enforce `TimeGenerated`, entity fields; measure % malformed
Log Coverage	Required sources actually arrive	You bought eggs, flour, sugar	Coverage matrix across tenants/regions; block deploy on gaps
KQL Performance	Runtime & efficiency under load	Oven preheated & timed	P95 runtime < schedule/2; use `summarize`, materialized views
Attack Simulation/Replay	Synthetic or real payloads end-to-end	Taste test before serving	Atomic tests + replay of sanitized incident logs
Volume & Latency	E2E timing at 1×–5×	Two trays in oven still bake	Track lag, schedule drift, alert creation delay
False Positives Check	Measure precision; tune safely	Fix salty measuring spoon	Weekly FP board; expiring suppressions with owner
Alert Volume Health	Match alerts to capacity	One chef vs. 300 plates	Budgets, batching, auto-triage for low-sev

What’s Next

Up next: “From Hypothesis to High-Fidelity: Designing one Sentinel detection with a test suite, replay pack, and SLOs.” We’ll build one end-to-end and publish the exact checklist.

🌞 The Last Sun Rays…

Answering the hooks:

Sprinklers without the test lever? Run replay and integration tests.
Half-labeled ingredients? Enforce schema + coverage gates before rules.

Your 30-minute win for tomorrow:

Pick one high-value rule.
Add a coverage gate (all required tables present).
Add a replay test with a sanitized payload.
Record P95 alert latency after one day.

Reflection: If you could show leadership just one metric next week, would it be precision, E2E latency, or queue health—and what decision will it unlock?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

February 3, 2026

Chapter 3 — How Not to Design Detection Use-Cases (and What to Do Instead)

1) Title + Hook

Building detections without the right logs is like writing movie reviews without watching the films.
Marking everything “High” severity is a smoke alarm that screams for toast and wildfires alike.
Ten sloppy rules beat one attacker—once. One precise rule beats ten attackers—daily.

This guide spotlights the anti-patterns that quietly wreck detection programs—and the fixes that make them resilient.

2) Why It’s Needed (Context)

Detection use-cases are your SIEM/SOAR’s north star. When they’re vague, noisy, or unmoored from telemetry, you pay in three currencies: alert fatigue, missed intrusions, and lost credibility with engineering and leadership. We’ll decode the classic mistakes and give you a playbook to align detections with MITRE ATT&CK (Adversarial Tactics, Techniques & Common Knowledge) and real attacker paths.

3) Core Concepts Explained Simply

A) Use-Cases Enabled but Logs Missing

Technical definition: Analytics rules exist, but prerequisite telemetry (tables/fields) is absent, late, or malformed.
Everyday example: Setting up a coffee machine with no water line.
Technical example: A credential-stuffing rule depends on SigninLogs risk state, but RiskyUsers connector isn’t enabled—rule never fires.

B) Everything Marked “High” Severity

Technical definition: Flat severity model (all High/Critical) that ignores confidence, impact, and enrichment.
Everyday example: All emails marked “urgent”—soon, none are.
Technical example: Port scan, failed logins, and confirmed egress beaconing all assigned “High,” drowning triage.

C) No Incident / Alert Grouping

Technical definition: Alerts remain atomic; no correlation by entity/time/TTP (Tactics, Techniques, and Procedures).
Everyday example: Treating 20 notifications from the same delivery as 20 separate packages.
Technical example: Multiple SecurityEvent 4625 failures from one host generate 50 incidents instead of one grouped brute-force case.

D) Alerts with Zero Context

Technical definition: Alerts lack entity resolution, enrichment, or links to playbooks and knowledge.
Everyday example: A fire alarm with no floor or room number.
Technical example: “Suspicious PowerShell” with no command line, user SID, parent process, or MITRE technique tag.

E) No Standard Parsing / Field Mismatch

Technical definition: Inconsistent schemas; fields named differently across sources; missing ASIM (Advanced Security Information Model) normalization.
Everyday example: Mixing metric and imperial tools in the same toolbox.
Technical example: src_ip vs SourceIP vs ClientIP break joins; URL field sometimes base64, sometimes plain.

F) Poor KQL Hygiene

Technical definition: Inefficient or brittle KQL (Kusto Query Language): wildcard scans, no summarization windows, time drift, or unbounded joins.
Everyday example: Searching a library by reading every page of every book.
Technical example: | where tostring(CommandLine) contains "mimikatz" across * tables without time or table scoping.

G) Quantity Over Quality (Rule Count Vanity)

Technical definition: Optimizing for number of rules, not precision, recall, or mean time to detect (MTTD).
Everyday example: Owning 50 kitchen knives but still using a butter knife.
Technical example: 300 rules with <0.5% true-positive rate; no retirement of deadweight rules.

H) No MITRE ATT&CK Coverage Mapping

Technical definition: Detections aren’t mapped to techniques/sub-techniques; gaps unknown.
Everyday example: Playing chess without knowing pieces or the board.
Technical example: Great coverage for execution (T1059) but nothing for discovery (T1087) or privilege escalation (T1068).

I) No Log-Source → Use-Case Coverage Mapping

Technical definition: No matrix that shows which use-cases rely on which sources/fields.
Everyday example: Not knowing which ingredient makes which dish.
Technical example: Disabling DNS logs breaks exfiltration detections—but no one realizes until after an incident.

J) Detections Not Mapped to Attacker Paths

Technical definition: Rules exist in isolation, not aligned to attack chains/kill-chains or common adversary playbooks.
Everyday example: Locking your front door but leaving the windows wide open.
Technical example: Excellent ransomware encryption alerts, but zero coverage for initial access (phish), lateral movement (RDP), or data staging.

4) Real-World Case Study

Failure — The “Everything High” Breach

Situation: A healthcare provider had 240 Sentinel rules. 80% were “High.” No grouping, weak enrichment.
Impact: Analysts ignored 30+ failed-login bursts tied to a compromised VPN account; beaconing went unnoticed for 5 days.
Lesson: Severity discipline + grouping + enrichment would have collapsed 120 noisy alerts into 3 actionable incidents.

Success — Use-Case Contracts & ATT&CK Map

Situation: A fintech created “Detection Contracts”: each rule listed required fields, data sources, ATT&CK technique, severity rubric, and sample incidents. Built a source↔use-case matrix and an ATT&CK heatmap.
Impact: -42% alert volume, +31% true-positive rate, MTTD down from 6h to 90m.
Lesson: Treat detections as products with inputs/outputs and SLOs.

5) Action Framework — Prevent → Detect → Respond

Prevent (Design Right)

Define detection contracts:
- Intent: threat, ATT&CK T#
- Inputs: tables + fields (with ASIM names)
- Logic: KQL with test cases
- Severity rubric: impact × confidence
- Ops: owner, SLOs (latency, FP rate), links to runbooks
Build the Coverage Matrix: Use-case (rows) × Sources/Fields (columns). Color by criticality.
Normalize early: Enforce ASIM (or your schema) at ingest; ban ad-hoc field names.
Set a severity policy: E.g., Critical = confirmed malicious + material impact; High = high confidence + privileged entity; Medium/Low with clear auto-closure criteria.

Detect (Run Well)

Group intelligently: Entity-based (user, host, IP), time-windowed (e.g., 30–60 min), TTP-aware correlation.
Enrich alerts: Entity resolution (UEBA), asset tags, geolocation, exposure (internet-facing), vuln context (CVSS).
Harden KQL:
- Scope tables & time (project before join).
- Use make-series, summarize with bins, toscalar for thresholds.
- Add null/format checks and time-zone normalization.
Measure quality: Track precision, recall, FP rate, FNR, and rule runtime. Retire or refactor rules quarterly.

Respond (Improve Fast)

Playbooks (SOAR): Map each severity to a minimal response checklist; automate enrichment and ticketing.
Drill with simulations: Use Atomic Red Team/ATT&CK emulations; confirm end-to-end (log present → rule fires → grouped → playbook runs).
Feedback loop: Every false positive updates the contract (logic or enrichment). Every miss creates a backlog item with ATT&CK mapping.

6) Key Differences to Keep in Mind

Severity vs Priority — Severity = inherent risk; Priority = queue order (contextual).
- Scenario: A “Medium” alert on a domain admin in production becomes top priority.
Alert vs Incident — Alerts are signals; incidents are stories (grouped evidence).
- Scenario: 15 brute-force alerts across users → 1 incident with attacker IP, timeframe, and impact.
Rule Count vs Coverage Quality — More rules ≠ better defense.
- Scenario: 60 well-mapped detections covering ATT&CK tactics beat 300 shallow ones.
Detection Logic vs Enrichment — Logic finds; enrichment explains.
- Scenario: A hash match (logic) + EDR verdict + VT score + asset criticality (enrichment) drives faster action.
Schema Normalization vs Parser Sprawl — One language, fewer bugs.
- Scenario: ASIM fields (SrcIp, DstIp, User) enable reusable joins and content packs.

7) Summary Table

Concept	Definition	Everyday Example	Technical Example
Logs missing	Rule needs data that isn’t there	Coffee machine w/o water	`SigninLogs` dependencies not connected
All High severity	Flat model; no nuance	Everything marked “urgent”	Port scan = High same as C2 beacon
No alert grouping	No correlation into incidents	20 packages treated separately	50 4625s = 50 incidents, not 1
Zero context	No enrichment/links	Fire alarm w/o floor	No command line, no parent PID
Field mismatch	Inconsistent schemas	Metric vs imperial mix	`src_ip` vs `SourceIP` breaks joins
Poor KQL hygiene	Inefficient/brittle queries	Reading every page to search	Unbounded `contains` across `*`
Rule vanity	Optimize for count not quality	50 knives, use one	300 rules, <0.5% TP
No ATT&CK mapping	No technique coverage view	Playing chess blind	Gaps in discovery/priv-esc
No source mapping	No data→use-case matrix	Unknown ingredients	DNS disabled breaks exfil rules
Not on attacker paths	No kill-chain alignment	Lock door, open windows	Encrypt detect but no lateral move detect

8) ASCII Diagram — Detection Product Loop

[ATT&CK Technique] → [Detection Contract] → [KQL Logic]
        ↓                     ↓                   ↓
 [Required Sources/Fields] → [Normalization/ASIM] → [Alert Enrichment]
        ↓                     ↓                   ↓
     [Grouping/Incidents] → [Severity Policy] → [SOAR Playbook]
        ↓
   [Metrics: Precision | Recall | FP/FN | Latency]
        ↓
   [Refactor/Retire]  ←——  [Purple Team Tests]

9) What’s Next

Next in this series: “Detection Contracts in Practice: A Step-by-Step Template (with KQL patterns and ATT&CK mapping).” We’ll publish a fill-in-the-blanks worksheet plus sample tests.

🌞 The Last Sun Rays…

Hook answers:

Don’t write reviews without watching the film—connect detections to telemetry and verify it’s present.
Don’t let every toaster trip the fire alarm—calibrate severity and group signals into incidents.
Don’t collect knives for the drawer—optimize for coverage quality, not rule count.

Your turn: If you could only fix one thing this week, would you choose severity discipline, schema normalization, or source↔use-case mapping—and how would you prove it worked (which metric first)?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

January 20, 2026

Chapter 2 —How Not to Design Log Sources (with Microsoft Sentinel)

1) Title + Hook

Hook:

Treating Microsoft Sentinel like a “Dropbox for logs” is like buying a cargo ship to mail a postcard.
Pouring every signal into your Security Information and Event Management (SIEM) is like turning on every light in a stadium to find your keys—bright, expensive, and still not helpful.

This post shows the anti-patterns that quietly destroy SIEM value—and what to do instead.

2) Why It’s Needed (Context)

Security teams love visibility. Finance teams hate surprise bills. Engineering hates noise.
When log-source design is sloppy, you get: runaway costs, alert fatigue, blind spots, and weak investigations.
Microsoft Sentinel is powerful, but it’s metered. Bad choices at the ingest layer ripple into detect, respond, and retain layers.

3) Core Concepts Explained Simply

A) “Collect-Everything” Ingestion → Huge Costs

Technical definition: Ingesting all available telemetry without scoping by use case, severity, or deduplication—often at high-cost data tables (e.g., SecurityAlert, CommonSecurityLog, Syslog with verbose facilities).
Everyday example: Subscribing to every streaming service “just in case,” then watching YouTube.
Technical example: Forwarding full Endpoint Detection and Response (EDR) raw telemetry and verbose Windows Event Forwarding (WEF) for the same hosts, plus firewall flows at 1:1 cadence—no filters.

B) Logs Collected but Not Used

Technical definition: Sources ingested with no mapped analytics rules, hunting queries, or workbooks.
Everyday example: Paying for a gym you never visit.
Technical example: Shipping detailed DNS logs but no detections/queries reference them; no Kusto Query Language (KQL) saved searches.

C) No Retention & Archival Strategy

Technical definition: Single retention setting for all tables; no hot/cold split, no Azure Data Explorer (ADX) or Azure Blob/Archive offload, and no legal hold mapping.
Everyday example: Keeping all photos on your phone forever—until it’s full right before a trip.
Technical example: 180-day retention for chatty Syslog/CommonSecurityLog tables when only 30 days are needed for detections; no archive to cheaper storage.

D) Custom Logs over Native Connectors

Technical definition: Using custom ingestion (HTTP API, custom tables) instead of Microsoft Sentinel data connectors that provide schemas, Advanced Security Information Model (ASIM) normalization, and content packs.
Everyday example: Cooking from scratch when a healthy, cheaper meal kit exists.
Technical example: Parsing Palo Alto logs via custom functions instead of the native connector and ASIM mapping—losing built-in analytics.

E) Duplicate Telemetry from Multiple Pipelines

Technical definition: Same events reaching Sentinel via parallel paths (e.g., agent + syslog forwarder + third-party pipeline), creating cost bloat and duplicate alerts.
Everyday example: Getting the same bank alerts by SMS, email, app, and phone call—annoying and redundant.
Technical example: Windows events ingested from both Azure Monitor Agent (AMA) and a legacy Log Analytics agent (MMA); cloud audit logs via both native connector and a custom ingestion app.

F) No Log Validation

Technical definition: Lack of pre-ingest checks for schema, timestamps, severity, and required fields; no Service Level Objectives (SLOs) for delay, completeness, or deduplication.
Everyday example: Accepting every delivery without checking the box contents.
Technical example: Timestamps ingested in local time, breaking correlation; device hostname missing → entity mapping fails; uneven daily volume with silent drops.

4) Real-World Case Study

Failure — The $180k Surprise

Situation: A global SaaS firm enabled “everything” from firewalls, proxies, endpoints, and cloud audit logs. No content mapped; no filtering; 180-day retention on all tables.
Impact: Monthly Sentinel bill spiked by 60%. Analysts drowned in duplicate alerts; incident MTTR (Mean Time To Remediate) rose from 9h to 16h.
Lesson: Cost without context adds negative value. Start with use cases → data needed → retention tiering.

Success — Use-Case-Driven Design

Situation: A fintech defined 12 priority detections (credential misuse, exfiltration, MFA bypass). They mapped required fields to ASIM schemas and trimmed sources to those fields.
Impact: 37% ingest reduction, +22% detection precision, 2× faster hunts due to consistent entity mapping.
Lesson: Design sources to serve detections, not the other way around.

5) Action Framework — Prevent → Detect → Respond

Prevent (Design & Cost Control)

Define top 15–20 detections first; list required fields (IP, User, Device, App, Action, Result, Timestamp TZ).
Prefer native connectors + ASIM; only custom when absolutely necessary.
Build ingestion policies: include tables, exclude noise (facility/level filters, sampling for flows).
Implement tiered retention:
- Hot (30–60 days): detection & investigation.
- Cold/Archive (6–12 months+): compliance, rare hunts (use ADX/Blob).
Prevent duplicates: one authoritative pipeline per source; document routing.

Detect (Quality & Coverage)

For each table, create at least one analytic rule and one scheduled query that uses it.
Enforce schema validation in parsing functions; normalize to ASIM.
Track signal health KPIs: daily event count deltas, null critical fields, late arrivals (>10 min), duplication rate.

Respond (Operate & Improve)

Build a workbook: cost by table, events by connector, rule hits by source.
Automate feedback loops: when an analytic fires with low confidence, refine source fields/filters.
Quarterly table review: drop unused sources, move low-value logs to archive, merge pipelines.

6) Key Differences to Keep in Mind

Native vs Custom Ingest — Native brings schemas/content; custom brings flexibility & maintenance.
- Scenario: Choose native for popular firewalls; custom only when niche vendor lacks support.
Hot vs Cold Retention — Hot is for speed; cold is for savings.
- Scenario: Keep 30 days hot for IR (Incident Response); move month 2–12 to archive.
Field Completeness vs Volume — Fewer, richer events beat many shallow events.
- Scenario: Keep DNS with query, response, client IP; drop verbose debug flags.
One Pipeline vs Many — Single route is traceable; multiple routes multiply duplicates.
- Scenario: Consolidate to AMA; retire MMA and third-party forwarders.
Use-Case vs Curiosity — Detections drive data; curiosity drives cost.
- Scenario: Only ingest proxy categories needed for DLP (Data Loss Prevention) alerts.

7) Summary Table

Concept	Definition	Everyday Example	Technical Example
Collect-everything ingestion	Ingest all signals without scoping/filters	Subscribing to every streaming service	EDR + WEF + flow logs all verbose to Sentinel
Unused logs	Data with no rules/queries/workbooks	Paying for a gym you don’t use	DNS ingested but no KQL uses it
No retention strategy	One-size retention; no hot/cold	Keeping all photos on phone forever	180 days on Syslog with no archive
Custom over native	DIY ingestion instead of connectors	Cooking from scratch vs meal kit	Custom Palo Alto parsing vs native + ASIM
Duplicate telemetry	Same events via multiple routes	Bank alerts by SMS/email/app/phone	AMA + MMA + syslog duplicating Windows events
No validation	No checks for schema/time/fields	Accepting packages uninspected	Local-time timestamps; missing hostname

8) ASCII Diagram (Signal Health Funnel)

[Sources] --(validated, deduped)--> [Normalization/ASIM]
             \--x duplicates drop--/          |
                                              v
                                   [Analytic Rules & Hunts]
                                              |
                                              v
                                    [Incidents & Response]
                                              |
                                              v
                           [Retention: Hot 30-60d | Archive 6-12m+]

9) What’s Next

Next in this series: “Designing a Use-Case-First Log Strategy for Sentinel: From Detections to Data Contracts.” We’ll publish a field-tested worksheet to map detections → fields → connectors → retention.

🌞 The Last Sun Rays…

Hook answers:

Sentinel isn’t a dump truck for logs; it’s a tuned sensor grid.
More light (data) isn’t better if it blinds you; focused beams (use-cases) win.

Your move:
What one log source would you drop, filter, or archive tomorrow to improve both signal quality and cost—and what detection would stay intact after that change?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

January 20, 2026

Chapter 7 – How Your Platform Health Suite Protects Outcomes, Not Just Logs

Turning “Sentinel Noise” into an Executive Radar: How Your Platform Health Suite Protects Outcomes, Not Just Logs

Think of your platform like an airport: collectors are runways, AMA agents are ground crews, DCRs are flight plans, and analytic rules are the control tower. If any one falters, flights (events) stack up or vanish.
Or like a hospital: ingestion is triage, tables are departments, rules are diagnostic protocols, and audit is infection control. A delay at triage hides the real emergencies.
Or like a supply chain: collectors are loading docks, EPS is trucks-per-minute, DCRs are routing labels, and rules are QA checkpoints. Mislabel one box and the whole chain gets blind spots.

This session shows executives how your components form one radar that tells them: Are we safe, is the telemetry flowing, and will detections fire when it matters?

Why It’s Needed (Context)

Security leaders don’t buy features; they buy assurance. Modern threats exploit blind spots: missing telemetry, delayed ingestion, disabled rules, or misconfigured agents. Your suite closes those gaps by:

Quantifying telemetry health (EPS, size, latency)
Surfacing blind spots (non-reporting devices, tables not ingesting)
Protecting detection integrity (analytic rule tampering, disabled rules)
Assuring platform reliability (Sentinel health, audit, connectors)

Value to execs: fewer surprises, faster incident confidence, and measurable resilience. In plain terms: “Are we seeing what matters, fast enough, with rules that still work?”

Core Concepts Explained Simply

Below, each concept has: Technical Definition → Everyday Example → Technical Example (we’ll reuse the airport analogy).

All Log Collector Ingestion & EPS (events per second)

Technical: Measures event throughput per collector to spot saturation/backpressure.
Everyday: Runway landings per minute—too few or too many means trouble.
Technical ex: 8K EPS baseline; spike to 20K EPS triggers auto-scale review.

Log Size by Log Collector

Technical: Tracks daily/rolling log volume per collector for anomalies.
Everyday: Cargo tonnage per runway day-over-day.
Technical ex: 40% drop on Collector-03 flags upstream firewall change.

Abnormal Workspace Spikes & Dips

Technical: Detects ingestion anomalies at the workspace level.
Everyday: Airport sees an unexpected lull or surge in flights.
Technical ex: z-score/seasonality anomaly on _LogManagement table.

Ingestion Delays

Technical: Measures latency from source timestamp to ingestion time.
Everyday: Planes circling because runways are jammed.
Technical ex: P95 delay > 15 minutes = raise incident sev-2.

OOTB (out-of-the-box) Data Connector Monitor

Technical: Checks health/config for native connectors.
Everyday: Prebuilt jetways—are they powered and attached?
Technical ex: Office 365 connector shows auth failure after token expiry.

Identify Warnings (Incident, Workspace)

Technical: Aggregates SOC warnings across incidents and workspace health.
Everyday: Tower alerts: “Runway lights flickering, storm inbound.”
Technical ex: Sentinel “Data Collection partially degraded” alert surfaces.

Critical Devices Monitoring

Technical: Ensures top-tier assets continuously report (DCs, firewalls, crown jewels).
Everyday: VIP planes (organ transplants, heads of state) tracked end-to-end.
Technical ex: Domain controller event gap >10 min triggers page.

Devices Not Reporting (Windows/Linux/Network)

Technical: Detects endpoints missing expected heartbeat/events.
Everyday: A parked plane went silent on the tarmac.
Technical ex: Syslog source silent for 60 min → create problem record.

Sentinel Health & Audit Monitoring

Technical: Internal service checks, API limits, configuration drift, audit events.
Everyday: Airport power, radios, and control systems diagnostics.
Technical ex: Audit log shows permission changes to Analytics blade.

Unhealthy AMA (Azure Monitor Agent) Agents

Technical: Flags agent install/health/config failures.
Everyday: Ground crew short-staffed or missing tools.
Technical ex: AMA heartbeat present but data channels failing (DCR mismatch).

Data Collection Rule (DCR) Monitoring

Technical: Validates DCR consistency and scope across resources.
Everyday: Flight plans correctly applied to the right aircraft.
Technical ex: New subnet lacks DCR mapping → 0 logs from that segment.

Unauthorized Modification of Use Case

Technical: Detects tampering to detection rules (query edits, schedules).
Everyday: Someone rewrote tower procedures without approval.
Technical ex: KQL (Kusto Query Language) diff shows removed join on identity table.

New Log Collector Out of Intended List

Technical: Flags newly-registered collectors not in the approved inventory.
Everyday: An unlisted aircraft lands without a flight plan.
Technical ex: Unknown syslog IP starts forwarding—validate source & owner.

Collector Health (No Heartbeat)

Technical: Collector process/host unavailable.
Everyday: Runway lights off—no signals.
Technical ex: VM down event correlates with EPS collapse.

Collector Health (No Logs)

Technical: Collector up but not sending logs.
Everyday: Runway open, but no planes using it.
Technical ex: Ingest pipeline credentials expired; heartbeat OK, EPS 0.

Tables Not Ingesting Logs

Technical: Detects schema-level gaps (e.g., SecurityEvent empty).
Everyday: A department with no patients for hours—impossible.
Technical ex: FirewallLogs table flatline after parser update.

Analytic Rule Disabled/Deleted

Technical: Ensures detections remain active & intact.
Everyday: Tower turned off weather radar.
Technical ex: High-severity rule disabled by change window w/o approval.

Sentinel Health & Audit (Platform performance)

Technical: (Aggregated view) Platform performance, limits, and governance.
Everyday: Airport operations dashboard for execs.
Technical ex: API throttle near limits during IR surge; scale-out advised.

Real-World Case Study

Failure Case — Ransomware Quietly Gained Time

Situation: EPS looked “normal” per day totals, but ingestion delays at P95 grew to 35–40 minutes after a network change. Analytic rules were fine, but they fired late.
Impact: SOC saw lateral movement an hour after the fact. Restores needed; downtime cost and reputational hit.
Lesson: Volume ≠ timeliness. Latency is a first-class SLO (service-level objective). Your “Ingestion Delays” and “Abnormal Spikes & Dips” would’ve caught this earlier.

Success Case — Misconfigured DCR Contained in Minutes

Situation: A new Linux subnet went live; DCR Monitoring flagged no mapping. Devices Not Reporting confirmed silent hosts; Tables Not Ingesting showed flat SecurityEventLinux table.
Impact: Fix within 20 minutes; coverage gap avoided during a vendor compromise alert wave.
Lesson: Layered monitors (DCR + device heartbeat + table health) shrink MTTR (mean time to repair).

Action Framework — Prevent → Detect → Respond

Prevent

Standardize collector baselines (EPS, log size).
Enforce DCR-as-code with approvals; alert on drift.
Lock analytic rules (change control + audit).
Maintain approved collector inventory; block unknowns.

Detect

SLOs: EPS ±25% anomaly, P95 delay <15 min, table freshness <10 min.
Triangulate: device heartbeat + table freshness + rule health.
Prioritize critical devices and OOTB connectors with higher alert sensitivity.

Respond

Playbooks:
1. No Heartbeat: auto-scale or restart collector VM; reroute sources.
2. No Logs: token refresh, pipeline test, sample event injection.
3. DCR Drift: auto-rollback via Git; notify change owner.
4. Rule Tampering: revert from versioned store; open P1; audit who/when.
Dashboards for execs: “Are we blind anywhere?” + “How fast are we seeing?” (MTTD/MTTR for telemetry issues).

Key Differences to Keep in Mind

No Heartbeat vs No Logs — Down host vs live host with broken pipeline.
Scenario: Heartbeat OK but EPS 0 → investigate tokens/parsers, not VM.
Volume vs Timeliness — Total GB ≠ real-time visibility.
Scenario: Daily volume steady but P95 delay 30 min → IR effectiveness drops.
Device Health vs Table Freshness — Endpoints alive doesn’t mean data landed.
Scenario: AMA OK; SecurityEvent table flat → DCR scope missing.
Anomaly vs Planned Change — Spikes can be normal during patch night.
Scenario: Annotate change windows to suppress false noise.
Connector Status vs Detection Integrity — Data present doesn’t ensure rules run.
Scenario: Rule disabled after tuning—coverage illusion.
Approved vs Rogue Collectors — Inventory matters.
Scenario: New IP starts sending logs → verify ownership before trusting data.

(ASCII) Executive Dashboard Sketch

+---------------- Sentinel Health & Telemetry Radar ----------------+
|  Ingestion SLOs     |  Collectors          |  Coverage           |
|  EPS: 7.8k (↔)      |  HB: 12/12 (✓)       |  Critical: 48/50 (⚠)|
|  P95 Delay: 7m (✓)  |  No Logs: 1 (⚠)      |  Tables Fresh: 96%  |
|  Spikes/Dips: OK    |  Unknown: 0 (✓)      |  DCR Drift: 0 (✓)   |
+------------------------------------------------+------------------+
|  Detection Integrity                             |  Audit & Changes|
|  Rules Disabled: 0 (✓)  Tamper Attempts: 1 (⚠)  |  Priv Changes: 2 |
+------------------------------------------------+------------------+
|  Top Alerts: Ingestion Delay @ COL-03 (sev-2) | ETA fix: 10m     |
+-------------------------------------------------------------------+

Summary Table

Concept	Definition	Everyday Example	Technical Example
All Log Collector Ingestion & EPS	Event throughput per collector	Landings/minute on a runway	Baseline 8K EPS; sustained 20K triggers scale
Log Size by Log Collector	Daily/rolling volume per collector	Cargo tonnage per runway	40% drop on COL-03 after firewall change
Abnormal Workspace Spikes & Dips	Workspace-level ingestion anomalies	Airport-wide lull/surge	z-score anomaly on _LogManagement
Ingestion Delays	Source-to-ingest latency	Planes circling	P95 delay >15m = sev-2
OOTB Data Connector Monitor	Health of native connectors	Prebuilt jetways attached	O365 token expiry alarm
Identify Warnings (Incident, Workspace)	Aggregated SOC/platform warnings	Tower status alerts	Sentinel “partial degradation” surfaced
Critical Devices Monitoring	Coverage for crown-jewel assets	VIP flights tracked	DC event gap >10m page
Devices Not Reporting	Missing endpoint telemetry	Silent plane on tarmac	Syslog source 60m silent
Sentinel Health & Audit Monitoring	Internal checks & audit	Airport systems diagnostics	Permission change on Analytics blade
Unhealthy AMA Agents	Agent failure/misconfig	Ground crew missing tools	Heartbeat OK; channel fail
Data Collection Rule Monitoring	DCR consistency & scope	Correct flight plans	New subnet lacks DCR
Unauthorized Modification of Use Case	Rule tampering detection	Unapproved tower procedure	KQL diff shows removed join
New Log Collector Out of Intended List	Unapproved collectors	Unlisted aircraft lands	Unknown syslog IP sending
Collector Health (No Heartbeat)	Collector host down	Runway lights off	VM down + EPS collapse
Collector Health (No Logs)	Host up, no events sent	Open runway, no planes	Token/parsers expired
Tables Not Ingesting Logs	Schema/table freshness gap	Department with no patients	FirewallLogs flatline
Analytic Rule Disabled/Deleted	Detections turned off/removed	Weather radar off	High-sev rule disabled
Sentinel Health & Audit (Performance)	Aggregated platform performance	Airport ops dashboard	Near API throttle during IR

What’s Next

In the next post of this mini-series, we’ll go from health to outcomes: mapping telemetry SLOs to detection KPIs (true positive rate, MTTD/MTTR for security events), and how to automate executive scorecards that tie telemetry health → detection fidelity → business risk.

🌞 The Last Sun Rays…

Hook answers:

Airports, hospitals, supply chains—all fail the same way: silent delays and hidden blind spots. Your suite exposes both in real time and proves the control tower (Sentinel) is awake.
Executives get a single radar: Are we seeing the right data, fast enough, with rules that still work—and will tomorrow?

Your move: If you could add one metric to your exec dashboard tomorrow, which would it be—P95 ingestion delay, critical device freshness, or rules integrity drift?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

January 15, 2026

Chapter 1 — How NOT to Plan a Sentinel Deployment

(Where security programs quietly fail before day one)

1) Title + Hook

Before we talk Sentinel, picture these everyday slip-ups that create invisible risk:

Analogy 1: Moving into a new house without labeling boxes.
Everything’s technically there… but you can’t find what matters.
Urgency becomes guesswork.
Analogy 2: Installing a home security system but forgetting which door each sensor protects.
Your phone keeps saying “Sensor triggered!”
Useful? Not if you don’t know where.
Analogy 3: Turning on notifications for every app on your phone.
The constant pinging forces you to ignore everything — including the important ones.

Security fails in the same quiet way: not dramatically, but by missing clarity, ownership, and context when you need them most.

2) Why It’s Needed (Context)

Most Sentinel deployments fail long before the first alert.

Why? Because security tools don’t create security — structure and intent do.

When planning lacks:

visibility into what exists,
clarity on who owns decisions,
defined purpose for each log collected,
disciplined priority-setting,
and least-privilege boundaries,

…then Sentinel becomes a sophisticated storage device instead of a decision platform.

The team gets data.
But not direction.

The result?
A system that looks operational yet struggles to protect anything that matters.

3) Core Concepts — Explained Simply Through “How NOT to Plan Sentinel”

Each anti-pattern includes:

What fails
A relatable everyday example
A simple technical example
And each subtly reinforces principles like classification, ownership, least privilege, and business alignment.

A. No Asset Inventory → “Protect everything, understand nothing.”

What fails:
No clear picture of systems, data, owners, or importance levels.
Without classification, everything becomes equally urgent — and equally neglected.
Everyday Example:
Trying to account for people after a fire drill without a guest list.
“Is everyone out?” → “We think so.”
Technical Example:
Sentinel fires an alert on Server-07, but:
- Is it a payroll server?
- A lab VM?
- An abandoned machine from last year?
  No one knows.
  Triage becomes guesswork.

B. No Operating Model → “Sentinel is live, but no one knows what to do.”

What fails:
No defined responsibilities, escalation criteria, or decision authority.
When everyone owns alerts, no one owns the outcome.
Everyday Example:
“Back to office Monday.”
No seating plan. No timings. No team norms.
Everyone arrives, but no one functions well.
Technical Example:
A high-severity incident lands in Sentinel.
No assignment logic.
Five analysts assume someone else will handle it.
The clock keeps ticking.

C. No Purpose Behind Data Collection → “Logs as hoarding, not protection.”

What fails:
Logs are onboarded “just because,” not tied to risks, outcomes, or decisions.
You get volume instead of value.
Everyday Example:
Installing CCTV cameras all over the house but never mapping what each one covers.
When something happens, the footage exists but insight doesn’t.
Technical Example:
You ingest massive firewall logs (hundreds of GB/day)
but have zero detections built on them.
You’re buying storage, not reducing risk.

D. Cost Cutting Without Classification → “Saving money by removing emergency exits.”

What fails:
Budget reviews remove logs based on cost instead of importance.
Critical data disappears because no one defined which logs protect essential functions.
Everyday Example:
Removing emergency exit signs in a building to lower electricity bills.
Looks fine until it matters.
Technical Example:
To save money, identity logs are disabled.
Result:
Account takeover goes undetected — attackers blend in as normal users.

E. No Standard RBAC → “Everyone gets the master key.”

What fails:
Access is granted ad-hoc.
High-risk permissions spread unintentionally.
Integrity of the system erodes.
Everyday Example:
A shared Google Sheet where everyone can edit.
By week’s end, formulas break, data changes, and no one knows how.
Technical Example:
An analyst modifies a rule meant for engineers.
Suppression breaks.
An alert storm floods the SOC for days.

F. No Business Risk Mapping → “Treating all systems as equal.”

What fails:
Machines are prioritized by technical severity, not business impact.
You lose sight of what truly matters.
Everyday Example:
Treating a broken coffee machine and a payroll outage as identical problems.
Both create noise — only one creates consequences.
Technical Example:
A medium-severity alert on the payroll server should outrank a high-severity alert on a test VM.
Without mapping, Sentinel can’t tell the difference.

4) Real-World Case Study

Failure — “The Breach Hidden by Cost Savings”

Situation:
A fintech ingested everything early on.
When bills grew, they removed logs based purely on cost.
Identity logs went first.
Impact:
OAuth abuse persisted quietly for 11 days.
No user-based anomalies, no geographic risk flags, no token alerts.
Governance Lesson (subtle):
Decisions made without classification or purpose create blind spots attackers love.

Success — “The Alert That Knew Its Importance”

Situation:
A healthcare organization:
- Classified assets
- Mapped business services
- Defined clear roles
- Enforced least privilege
Impact:
A lateral movement alert auto-tagged as reaching an EHR cluster (highest tier).
Instantly escalated to P1.
The right team responded. Containment in 30 minutes.
Governance Lesson (subtle):
Clarity enables the right response at the right time.

5) Action Framework — Prevent → Detect → Respond

Prevent (Before Sentinel ingests anything)

Build a real asset list → with ownership and criticality.
Define roles, authority, and escalation logic.
Create data purpose contracts for every log source.
Classify telemetry into tiers (non-negotiable → optional).
Set least-privilege access upfront.
Tag systems with business impact levels.

Detect (Once Sentinel is receiving data)

Write detections tied to risks and outcomes, not logs.
Enrich each alert with context: owner, criticality, business service.
Prioritize incidents using business impact × technical severity.
Build dashboards that reveal gaps:
- Unmonitored key assets
- Tier-0 coverage
- Alert quality indicators

Respond

Automate responses for critical systems.
Auto-route based on business impact and role.
After each incident, refine the inventory, rules, and purpose contracts.
Track metrics that matter (time to detect, time to respond, false-positive trend, cost vs value).

ASCII Flow

[Classified Assets] 
      ↓
[Purpose-Based Data Collection]
      ↓
[Clear Roles + Least Privilege]
      ↓
      Sentinel
      ↓
[Context-Enriched, Business-Aware Alerts]
      ↓
[Correct Team → Fast Response → Reduced Risk]

6) Key Differences to Keep in Mind

Severity vs Priority
Severity = how loud the alert looks
Priority = how important the underlying asset is
Scenario:
High-severity alert on a lab VM < Medium-severity alert on payroll server.
Logs vs Insight
More logs don’t equal more protection — relevance does.
Scenario:
1 TB of firewall logs with no detections is noise, not security.
Equal Coverage vs Informed Coverage
You cannot treat every machine the same.
Classification drives protection.
Scenario:
Crown-jewel systems get 24×7 watch.
Test environments get baseline visibility.

7) Summary Table

Concept (Failure Pattern)	What It Means	Everyday Example	Simple Technical Example
No Asset Inventory	No clarity on what’s important	Fire drill without guest list	Alerts lack owner/criticality
No Operating Model	Undefined roles & escalation	“Office Monday” with no plan	High-severity incidents unassigned
Purpose-Free Logs	Collecting data without outcome	CCTV installed everywhere	Logs ingested but unused
Blind Cost Cutting	Removing logs that matter	Removing exit signs	Disabling identity logs
No Standard RBAC	Over-permissioning	Shared sheet with full access	Analysts editing detection rules
No Risk Mapping	All systems treated equal	Coffee vs payroll outage	Test VM alert = payroll alert

8) What’s Next

Chapter 2 — Building the Asset Truth Layer: The Security Foundation Sentinel Depends On
We’ll design classification, ownership, metadata enrichment, and automated freshness — the bedrock of meaningful detection.

9) 🌞 The Last Sun Rays…

Remember our everyday mistakes?

Labeled boxes turn chaos into clarity.
Door sensors only help when you know which door they belong to.
Notifications are useful only when prioritized.

Security works the same way:
Clarity, classification, and accountability reduce risk long before technology does.

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

January 1, 2026