Tag: Log Analytics

Chapter 2 —How Not to Design Log Sources (with Microsoft Sentinel)

1) Title + Hook

Hook:

Treating Microsoft Sentinel like a “Dropbox for logs” is like buying a cargo ship to mail a postcard.
Pouring every signal into your Security Information and Event Management (SIEM) is like turning on every light in a stadium to find your keys—bright, expensive, and still not helpful.

This post shows the anti-patterns that quietly destroy SIEM value—and what to do instead.

2) Why It’s Needed (Context)

Security teams love visibility. Finance teams hate surprise bills. Engineering hates noise.
When log-source design is sloppy, you get: runaway costs, alert fatigue, blind spots, and weak investigations.
Microsoft Sentinel is powerful, but it’s metered. Bad choices at the ingest layer ripple into detect, respond, and retain layers.

3) Core Concepts Explained Simply

A) “Collect-Everything” Ingestion → Huge Costs

Technical definition: Ingesting all available telemetry without scoping by use case, severity, or deduplication—often at high-cost data tables (e.g., SecurityAlert, CommonSecurityLog, Syslog with verbose facilities).
Everyday example: Subscribing to every streaming service “just in case,” then watching YouTube.
Technical example: Forwarding full Endpoint Detection and Response (EDR) raw telemetry and verbose Windows Event Forwarding (WEF) for the same hosts, plus firewall flows at 1:1 cadence—no filters.

B) Logs Collected but Not Used

Technical definition: Sources ingested with no mapped analytics rules, hunting queries, or workbooks.
Everyday example: Paying for a gym you never visit.
Technical example: Shipping detailed DNS logs but no detections/queries reference them; no Kusto Query Language (KQL) saved searches.

C) No Retention & Archival Strategy

Technical definition: Single retention setting for all tables; no hot/cold split, no Azure Data Explorer (ADX) or Azure Blob/Archive offload, and no legal hold mapping.
Everyday example: Keeping all photos on your phone forever—until it’s full right before a trip.
Technical example: 180-day retention for chatty Syslog/CommonSecurityLog tables when only 30 days are needed for detections; no archive to cheaper storage.

D) Custom Logs over Native Connectors

Technical definition: Using custom ingestion (HTTP API, custom tables) instead of Microsoft Sentinel data connectors that provide schemas, Advanced Security Information Model (ASIM) normalization, and content packs.
Everyday example: Cooking from scratch when a healthy, cheaper meal kit exists.
Technical example: Parsing Palo Alto logs via custom functions instead of the native connector and ASIM mapping—losing built-in analytics.

E) Duplicate Telemetry from Multiple Pipelines

Technical definition: Same events reaching Sentinel via parallel paths (e.g., agent + syslog forwarder + third-party pipeline), creating cost bloat and duplicate alerts.
Everyday example: Getting the same bank alerts by SMS, email, app, and phone call—annoying and redundant.
Technical example: Windows events ingested from both Azure Monitor Agent (AMA) and a legacy Log Analytics agent (MMA); cloud audit logs via both native connector and a custom ingestion app.

F) No Log Validation

Technical definition: Lack of pre-ingest checks for schema, timestamps, severity, and required fields; no Service Level Objectives (SLOs) for delay, completeness, or deduplication.
Everyday example: Accepting every delivery without checking the box contents.
Technical example: Timestamps ingested in local time, breaking correlation; device hostname missing → entity mapping fails; uneven daily volume with silent drops.

4) Real-World Case Study

Failure — The $180k Surprise

Situation: A global SaaS firm enabled “everything” from firewalls, proxies, endpoints, and cloud audit logs. No content mapped; no filtering; 180-day retention on all tables.
Impact: Monthly Sentinel bill spiked by 60%. Analysts drowned in duplicate alerts; incident MTTR (Mean Time To Remediate) rose from 9h to 16h.
Lesson: Cost without context adds negative value. Start with use cases → data needed → retention tiering.

Success — Use-Case-Driven Design

Situation: A fintech defined 12 priority detections (credential misuse, exfiltration, MFA bypass). They mapped required fields to ASIM schemas and trimmed sources to those fields.
Impact: 37% ingest reduction, +22% detection precision, 2× faster hunts due to consistent entity mapping.
Lesson: Design sources to serve detections, not the other way around.

5) Action Framework — Prevent → Detect → Respond

Prevent (Design & Cost Control)

Define top 15–20 detections first; list required fields (IP, User, Device, App, Action, Result, Timestamp TZ).
Prefer native connectors + ASIM; only custom when absolutely necessary.
Build ingestion policies: include tables, exclude noise (facility/level filters, sampling for flows).
Implement tiered retention:
- Hot (30–60 days): detection & investigation.
- Cold/Archive (6–12 months+): compliance, rare hunts (use ADX/Blob).
Prevent duplicates: one authoritative pipeline per source; document routing.

Detect (Quality & Coverage)

For each table, create at least one analytic rule and one scheduled query that uses it.
Enforce schema validation in parsing functions; normalize to ASIM.
Track signal health KPIs: daily event count deltas, null critical fields, late arrivals (>10 min), duplication rate.

Respond (Operate & Improve)

Build a workbook: cost by table, events by connector, rule hits by source.
Automate feedback loops: when an analytic fires with low confidence, refine source fields/filters.
Quarterly table review: drop unused sources, move low-value logs to archive, merge pipelines.

6) Key Differences to Keep in Mind

Native vs Custom Ingest — Native brings schemas/content; custom brings flexibility & maintenance.
- Scenario: Choose native for popular firewalls; custom only when niche vendor lacks support.
Hot vs Cold Retention — Hot is for speed; cold is for savings.
- Scenario: Keep 30 days hot for IR (Incident Response); move month 2–12 to archive.
Field Completeness vs Volume — Fewer, richer events beat many shallow events.
- Scenario: Keep DNS with query, response, client IP; drop verbose debug flags.
One Pipeline vs Many — Single route is traceable; multiple routes multiply duplicates.
- Scenario: Consolidate to AMA; retire MMA and third-party forwarders.
Use-Case vs Curiosity — Detections drive data; curiosity drives cost.
- Scenario: Only ingest proxy categories needed for DLP (Data Loss Prevention) alerts.

7) Summary Table

Concept	Definition	Everyday Example	Technical Example
Collect-everything ingestion	Ingest all signals without scoping/filters	Subscribing to every streaming service	EDR + WEF + flow logs all verbose to Sentinel
Unused logs	Data with no rules/queries/workbooks	Paying for a gym you don’t use	DNS ingested but no KQL uses it
No retention strategy	One-size retention; no hot/cold	Keeping all photos on phone forever	180 days on Syslog with no archive
Custom over native	DIY ingestion instead of connectors	Cooking from scratch vs meal kit	Custom Palo Alto parsing vs native + ASIM
Duplicate telemetry	Same events via multiple routes	Bank alerts by SMS/email/app/phone	AMA + MMA + syslog duplicating Windows events
No validation	No checks for schema/time/fields	Accepting packages uninspected	Local-time timestamps; missing hostname

8) ASCII Diagram (Signal Health Funnel)

[Sources] --(validated, deduped)--> [Normalization/ASIM]
             \--x duplicates drop--/          |
                                              v
                                   [Analytic Rules & Hunts]
                                              |
                                              v
                                    [Incidents & Response]
                                              |
                                              v
                           [Retention: Hot 30-60d | Archive 6-12m+]

9) What’s Next

Next in this series: “Designing a Use-Case-First Log Strategy for Sentinel: From Detections to Data Contracts.” We’ll publish a field-tested worksheet to map detections → fields → connectors → retention.

🌞 The Last Sun Rays…

Hook answers:

Sentinel isn’t a dump truck for logs; it’s a tuned sensor grid.
More light (data) isn’t better if it blinds you; focused beams (use-cases) win.

Your move:
What one log source would you drop, filter, or archive tomorrow to improve both signal quality and cost—and what detection would stay intact after that change?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

January 20, 2026

Chapter 7 – How Your Platform Health Suite Protects Outcomes, Not Just Logs

Turning “Sentinel Noise” into an Executive Radar: How Your Platform Health Suite Protects Outcomes, Not Just Logs

Think of your platform like an airport: collectors are runways, AMA agents are ground crews, DCRs are flight plans, and analytic rules are the control tower. If any one falters, flights (events) stack up or vanish.
Or like a hospital: ingestion is triage, tables are departments, rules are diagnostic protocols, and audit is infection control. A delay at triage hides the real emergencies.
Or like a supply chain: collectors are loading docks, EPS is trucks-per-minute, DCRs are routing labels, and rules are QA checkpoints. Mislabel one box and the whole chain gets blind spots.

This session shows executives how your components form one radar that tells them: Are we safe, is the telemetry flowing, and will detections fire when it matters?

Why It’s Needed (Context)

Security leaders don’t buy features; they buy assurance. Modern threats exploit blind spots: missing telemetry, delayed ingestion, disabled rules, or misconfigured agents. Your suite closes those gaps by:

Quantifying telemetry health (EPS, size, latency)
Surfacing blind spots (non-reporting devices, tables not ingesting)
Protecting detection integrity (analytic rule tampering, disabled rules)
Assuring platform reliability (Sentinel health, audit, connectors)

Value to execs: fewer surprises, faster incident confidence, and measurable resilience. In plain terms: “Are we seeing what matters, fast enough, with rules that still work?”

Core Concepts Explained Simply

Below, each concept has: Technical Definition → Everyday Example → Technical Example (we’ll reuse the airport analogy).

All Log Collector Ingestion & EPS (events per second)

Technical: Measures event throughput per collector to spot saturation/backpressure.
Everyday: Runway landings per minute—too few or too many means trouble.
Technical ex: 8K EPS baseline; spike to 20K EPS triggers auto-scale review.

Log Size by Log Collector

Technical: Tracks daily/rolling log volume per collector for anomalies.
Everyday: Cargo tonnage per runway day-over-day.
Technical ex: 40% drop on Collector-03 flags upstream firewall change.

Abnormal Workspace Spikes & Dips

Technical: Detects ingestion anomalies at the workspace level.
Everyday: Airport sees an unexpected lull or surge in flights.
Technical ex: z-score/seasonality anomaly on _LogManagement table.

Ingestion Delays

Technical: Measures latency from source timestamp to ingestion time.
Everyday: Planes circling because runways are jammed.
Technical ex: P95 delay > 15 minutes = raise incident sev-2.

OOTB (out-of-the-box) Data Connector Monitor

Technical: Checks health/config for native connectors.
Everyday: Prebuilt jetways—are they powered and attached?
Technical ex: Office 365 connector shows auth failure after token expiry.

Identify Warnings (Incident, Workspace)

Technical: Aggregates SOC warnings across incidents and workspace health.
Everyday: Tower alerts: “Runway lights flickering, storm inbound.”
Technical ex: Sentinel “Data Collection partially degraded” alert surfaces.

Critical Devices Monitoring

Technical: Ensures top-tier assets continuously report (DCs, firewalls, crown jewels).
Everyday: VIP planes (organ transplants, heads of state) tracked end-to-end.
Technical ex: Domain controller event gap >10 min triggers page.

Devices Not Reporting (Windows/Linux/Network)

Technical: Detects endpoints missing expected heartbeat/events.
Everyday: A parked plane went silent on the tarmac.
Technical ex: Syslog source silent for 60 min → create problem record.

Sentinel Health & Audit Monitoring

Technical: Internal service checks, API limits, configuration drift, audit events.
Everyday: Airport power, radios, and control systems diagnostics.
Technical ex: Audit log shows permission changes to Analytics blade.

Unhealthy AMA (Azure Monitor Agent) Agents

Technical: Flags agent install/health/config failures.
Everyday: Ground crew short-staffed or missing tools.
Technical ex: AMA heartbeat present but data channels failing (DCR mismatch).

Data Collection Rule (DCR) Monitoring

Technical: Validates DCR consistency and scope across resources.
Everyday: Flight plans correctly applied to the right aircraft.
Technical ex: New subnet lacks DCR mapping → 0 logs from that segment.

Unauthorized Modification of Use Case

Technical: Detects tampering to detection rules (query edits, schedules).
Everyday: Someone rewrote tower procedures without approval.
Technical ex: KQL (Kusto Query Language) diff shows removed join on identity table.

New Log Collector Out of Intended List

Technical: Flags newly-registered collectors not in the approved inventory.
Everyday: An unlisted aircraft lands without a flight plan.
Technical ex: Unknown syslog IP starts forwarding—validate source & owner.

Collector Health (No Heartbeat)

Technical: Collector process/host unavailable.
Everyday: Runway lights off—no signals.
Technical ex: VM down event correlates with EPS collapse.

Collector Health (No Logs)

Technical: Collector up but not sending logs.
Everyday: Runway open, but no planes using it.
Technical ex: Ingest pipeline credentials expired; heartbeat OK, EPS 0.

Tables Not Ingesting Logs

Technical: Detects schema-level gaps (e.g., SecurityEvent empty).
Everyday: A department with no patients for hours—impossible.
Technical ex: FirewallLogs table flatline after parser update.

Analytic Rule Disabled/Deleted

Technical: Ensures detections remain active & intact.
Everyday: Tower turned off weather radar.
Technical ex: High-severity rule disabled by change window w/o approval.

Sentinel Health & Audit (Platform performance)

Technical: (Aggregated view) Platform performance, limits, and governance.
Everyday: Airport operations dashboard for execs.
Technical ex: API throttle near limits during IR surge; scale-out advised.

Real-World Case Study

Failure Case — Ransomware Quietly Gained Time

Situation: EPS looked “normal” per day totals, but ingestion delays at P95 grew to 35–40 minutes after a network change. Analytic rules were fine, but they fired late.
Impact: SOC saw lateral movement an hour after the fact. Restores needed; downtime cost and reputational hit.
Lesson: Volume ≠ timeliness. Latency is a first-class SLO (service-level objective). Your “Ingestion Delays” and “Abnormal Spikes & Dips” would’ve caught this earlier.

Success Case — Misconfigured DCR Contained in Minutes

Situation: A new Linux subnet went live; DCR Monitoring flagged no mapping. Devices Not Reporting confirmed silent hosts; Tables Not Ingesting showed flat SecurityEventLinux table.
Impact: Fix within 20 minutes; coverage gap avoided during a vendor compromise alert wave.
Lesson: Layered monitors (DCR + device heartbeat + table health) shrink MTTR (mean time to repair).

Action Framework — Prevent → Detect → Respond

Prevent

Standardize collector baselines (EPS, log size).
Enforce DCR-as-code with approvals; alert on drift.
Lock analytic rules (change control + audit).
Maintain approved collector inventory; block unknowns.

Detect

SLOs: EPS ±25% anomaly, P95 delay <15 min, table freshness <10 min.
Triangulate: device heartbeat + table freshness + rule health.
Prioritize critical devices and OOTB connectors with higher alert sensitivity.

Respond

Playbooks:
1. No Heartbeat: auto-scale or restart collector VM; reroute sources.
2. No Logs: token refresh, pipeline test, sample event injection.
3. DCR Drift: auto-rollback via Git; notify change owner.
4. Rule Tampering: revert from versioned store; open P1; audit who/when.
Dashboards for execs: “Are we blind anywhere?” + “How fast are we seeing?” (MTTD/MTTR for telemetry issues).

Key Differences to Keep in Mind

No Heartbeat vs No Logs — Down host vs live host with broken pipeline.
Scenario: Heartbeat OK but EPS 0 → investigate tokens/parsers, not VM.
Volume vs Timeliness — Total GB ≠ real-time visibility.
Scenario: Daily volume steady but P95 delay 30 min → IR effectiveness drops.
Device Health vs Table Freshness — Endpoints alive doesn’t mean data landed.
Scenario: AMA OK; SecurityEvent table flat → DCR scope missing.
Anomaly vs Planned Change — Spikes can be normal during patch night.
Scenario: Annotate change windows to suppress false noise.
Connector Status vs Detection Integrity — Data present doesn’t ensure rules run.
Scenario: Rule disabled after tuning—coverage illusion.
Approved vs Rogue Collectors — Inventory matters.
Scenario: New IP starts sending logs → verify ownership before trusting data.

(ASCII) Executive Dashboard Sketch

+---------------- Sentinel Health & Telemetry Radar ----------------+
|  Ingestion SLOs     |  Collectors          |  Coverage           |
|  EPS: 7.8k (↔)      |  HB: 12/12 (✓)       |  Critical: 48/50 (⚠)|
|  P95 Delay: 7m (✓)  |  No Logs: 1 (⚠)      |  Tables Fresh: 96%  |
|  Spikes/Dips: OK    |  Unknown: 0 (✓)      |  DCR Drift: 0 (✓)   |
+------------------------------------------------+------------------+
|  Detection Integrity                             |  Audit & Changes|
|  Rules Disabled: 0 (✓)  Tamper Attempts: 1 (⚠)  |  Priv Changes: 2 |
+------------------------------------------------+------------------+
|  Top Alerts: Ingestion Delay @ COL-03 (sev-2) | ETA fix: 10m     |
+-------------------------------------------------------------------+

Summary Table

Concept	Definition	Everyday Example	Technical Example
All Log Collector Ingestion & EPS	Event throughput per collector	Landings/minute on a runway	Baseline 8K EPS; sustained 20K triggers scale
Log Size by Log Collector	Daily/rolling volume per collector	Cargo tonnage per runway	40% drop on COL-03 after firewall change
Abnormal Workspace Spikes & Dips	Workspace-level ingestion anomalies	Airport-wide lull/surge	z-score anomaly on _LogManagement
Ingestion Delays	Source-to-ingest latency	Planes circling	P95 delay >15m = sev-2
OOTB Data Connector Monitor	Health of native connectors	Prebuilt jetways attached	O365 token expiry alarm
Identify Warnings (Incident, Workspace)	Aggregated SOC/platform warnings	Tower status alerts	Sentinel “partial degradation” surfaced
Critical Devices Monitoring	Coverage for crown-jewel assets	VIP flights tracked	DC event gap >10m page
Devices Not Reporting	Missing endpoint telemetry	Silent plane on tarmac	Syslog source 60m silent
Sentinel Health & Audit Monitoring	Internal checks & audit	Airport systems diagnostics	Permission change on Analytics blade
Unhealthy AMA Agents	Agent failure/misconfig	Ground crew missing tools	Heartbeat OK; channel fail
Data Collection Rule Monitoring	DCR consistency & scope	Correct flight plans	New subnet lacks DCR
Unauthorized Modification of Use Case	Rule tampering detection	Unapproved tower procedure	KQL diff shows removed join
New Log Collector Out of Intended List	Unapproved collectors	Unlisted aircraft lands	Unknown syslog IP sending
Collector Health (No Heartbeat)	Collector host down	Runway lights off	VM down + EPS collapse
Collector Health (No Logs)	Host up, no events sent	Open runway, no planes	Token/parsers expired
Tables Not Ingesting Logs	Schema/table freshness gap	Department with no patients	FirewallLogs flatline
Analytic Rule Disabled/Deleted	Detections turned off/removed	Weather radar off	High-sev rule disabled
Sentinel Health & Audit (Performance)	Aggregated platform performance	Airport ops dashboard	Near API throttle during IR

What’s Next

In the next post of this mini-series, we’ll go from health to outcomes: mapping telemetry SLOs to detection KPIs (true positive rate, MTTD/MTTR for security events), and how to automate executive scorecards that tie telemetry health → detection fidelity → business risk.

🌞 The Last Sun Rays…

Hook answers:

Airports, hospitals, supply chains—all fail the same way: silent delays and hidden blind spots. Your suite exposes both in real time and proves the control tower (Sentinel) is awake.
Executives get a single radar: Are we seeing the right data, fast enough, with rules that still work—and will tomorrow?

Your move: If you could add one metric to your exec dashboard tomorrow, which would it be—P95 ingestion delay, critical device freshness, or rules integrity drift?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

January 15, 2026