Turning “Sentinel Noise” into an Executive Radar: How Your Platform Health Suite Protects Outcomes, Not Just Logs
- Think of your platform like an airport: collectors are runways, AMA agents are ground crews, DCRs are flight plans, and analytic rules are the control tower. If any one falters, flights (events) stack up or vanish.
- Or like a hospital: ingestion is triage, tables are departments, rules are diagnostic protocols, and audit is infection control. A delay at triage hides the real emergencies.
- Or like a supply chain: collectors are loading docks, EPS is trucks-per-minute, DCRs are routing labels, and rules are QA checkpoints. Mislabel one box and the whole chain gets blind spots.
This session shows executives how your components form one radar that tells them: Are we safe, is the telemetry flowing, and will detections fire when it matters?
Why It’s Needed (Context)
Security leaders don’t buy features; they buy assurance. Modern threats exploit blind spots: missing telemetry, delayed ingestion, disabled rules, or misconfigured agents. Your suite closes those gaps by:
- Quantifying telemetry health (EPS, size, latency)
- Surfacing blind spots (non-reporting devices, tables not ingesting)
- Protecting detection integrity (analytic rule tampering, disabled rules)
- Assuring platform reliability (Sentinel health, audit, connectors)
Value to execs: fewer surprises, faster incident confidence, and measurable resilience. In plain terms: “Are we seeing what matters, fast enough, with rules that still work?”
Core Concepts Explained Simply
Below, each concept has: Technical Definition → Everyday Example → Technical Example (we’ll reuse the airport analogy).
- All Log Collector Ingestion & EPS (events per second)
- Technical: Measures event throughput per collector to spot saturation/backpressure.
- Everyday: Runway landings per minute—too few or too many means trouble.
- Technical ex: 8K EPS baseline; spike to 20K EPS triggers auto-scale review.
- Log Size by Log Collector
- Technical: Tracks daily/rolling log volume per collector for anomalies.
- Everyday: Cargo tonnage per runway day-over-day.
- Technical ex: 40% drop on Collector-03 flags upstream firewall change.
- Abnormal Workspace Spikes & Dips
- Technical: Detects ingestion anomalies at the workspace level.
- Everyday: Airport sees an unexpected lull or surge in flights.
- Technical ex: z-score/seasonality anomaly on _LogManagement table.
- Ingestion Delays
- Technical: Measures latency from source timestamp to ingestion time.
- Everyday: Planes circling because runways are jammed.
- Technical ex: P95 delay > 15 minutes = raise incident sev-2.
- OOTB (out-of-the-box) Data Connector Monitor
- Technical: Checks health/config for native connectors.
- Everyday: Prebuilt jetways—are they powered and attached?
- Technical ex: Office 365 connector shows auth failure after token expiry.
- Identify Warnings (Incident, Workspace)
- Technical: Aggregates SOC warnings across incidents and workspace health.
- Everyday: Tower alerts: “Runway lights flickering, storm inbound.”
- Technical ex: Sentinel “Data Collection partially degraded” alert surfaces.
- Critical Devices Monitoring
- Technical: Ensures top-tier assets continuously report (DCs, firewalls, crown jewels).
- Everyday: VIP planes (organ transplants, heads of state) tracked end-to-end.
- Technical ex: Domain controller event gap >10 min triggers page.
- Devices Not Reporting (Windows/Linux/Network)
- Technical: Detects endpoints missing expected heartbeat/events.
- Everyday: A parked plane went silent on the tarmac.
- Technical ex: Syslog source silent for 60 min → create problem record.
- Sentinel Health & Audit Monitoring
- Technical: Internal service checks, API limits, configuration drift, audit events.
- Everyday: Airport power, radios, and control systems diagnostics.
- Technical ex: Audit log shows permission changes to Analytics blade.
- Unhealthy AMA (Azure Monitor Agent) Agents
- Technical: Flags agent install/health/config failures.
- Everyday: Ground crew short-staffed or missing tools.
- Technical ex: AMA heartbeat present but data channels failing (DCR mismatch).
- Data Collection Rule (DCR) Monitoring
- Technical: Validates DCR consistency and scope across resources.
- Everyday: Flight plans correctly applied to the right aircraft.
- Technical ex: New subnet lacks DCR mapping → 0 logs from that segment.
- Unauthorized Modification of Use Case
- Technical: Detects tampering to detection rules (query edits, schedules).
- Everyday: Someone rewrote tower procedures without approval.
- Technical ex: KQL (Kusto Query Language) diff shows removed join on identity table.
- New Log Collector Out of Intended List
- Technical: Flags newly-registered collectors not in the approved inventory.
- Everyday: An unlisted aircraft lands without a flight plan.
- Technical ex: Unknown syslog IP starts forwarding—validate source & owner.
- Collector Health (No Heartbeat)
- Technical: Collector process/host unavailable.
- Everyday: Runway lights off—no signals.
- Technical ex: VM down event correlates with EPS collapse.
- Collector Health (No Logs)
- Technical: Collector up but not sending logs.
- Everyday: Runway open, but no planes using it.
- Technical ex: Ingest pipeline credentials expired; heartbeat OK, EPS 0.
- Tables Not Ingesting Logs
- Technical: Detects schema-level gaps (e.g., SecurityEvent empty).
- Everyday: A department with no patients for hours—impossible.
- Technical ex: FirewallLogs table flatline after parser update.
- Analytic Rule Disabled/Deleted
- Technical: Ensures detections remain active & intact.
- Everyday: Tower turned off weather radar.
- Technical ex: High-severity rule disabled by change window w/o approval.
- Sentinel Health & Audit (Platform performance)
- Technical: (Aggregated view) Platform performance, limits, and governance.
- Everyday: Airport operations dashboard for execs.
- Technical ex: API throttle near limits during IR surge; scale-out advised.
Real-World Case Study
Failure Case — Ransomware Quietly Gained Time
- Situation: EPS looked “normal” per day totals, but ingestion delays at P95 grew to 35–40 minutes after a network change. Analytic rules were fine, but they fired late.
- Impact: SOC saw lateral movement an hour after the fact. Restores needed; downtime cost and reputational hit.
- Lesson: Volume ≠ timeliness. Latency is a first-class SLO (service-level objective). Your “Ingestion Delays” and “Abnormal Spikes & Dips” would’ve caught this earlier.
Success Case — Misconfigured DCR Contained in Minutes
- Situation: A new Linux subnet went live; DCR Monitoring flagged no mapping. Devices Not Reporting confirmed silent hosts; Tables Not Ingesting showed flat SecurityEventLinux table.
- Impact: Fix within 20 minutes; coverage gap avoided during a vendor compromise alert wave.
- Lesson: Layered monitors (DCR + device heartbeat + table health) shrink MTTR (mean time to repair).
Action Framework — Prevent → Detect → Respond
Prevent
- Standardize collector baselines (EPS, log size).
- Enforce DCR-as-code with approvals; alert on drift.
- Lock analytic rules (change control + audit).
- Maintain approved collector inventory; block unknowns.
Detect
- SLOs: EPS ±25% anomaly, P95 delay <15 min, table freshness <10 min.
- Triangulate: device heartbeat + table freshness + rule health.
- Prioritize critical devices and OOTB connectors with higher alert sensitivity.
Respond
- Playbooks:
- No Heartbeat: auto-scale or restart collector VM; reroute sources.
- No Logs: token refresh, pipeline test, sample event injection.
- DCR Drift: auto-rollback via Git; notify change owner.
- Rule Tampering: revert from versioned store; open P1; audit who/when.
- Dashboards for execs: “Are we blind anywhere?” + “How fast are we seeing?” (MTTD/MTTR for telemetry issues).
Key Differences to Keep in Mind
- No Heartbeat vs No Logs — Down host vs live host with broken pipeline.
Scenario: Heartbeat OK but EPS 0 → investigate tokens/parsers, not VM. - Volume vs Timeliness — Total GB ≠ real-time visibility.
Scenario: Daily volume steady but P95 delay 30 min → IR effectiveness drops. - Device Health vs Table Freshness — Endpoints alive doesn’t mean data landed.
Scenario: AMA OK;SecurityEventtable flat → DCR scope missing. - Anomaly vs Planned Change — Spikes can be normal during patch night.
Scenario: Annotate change windows to suppress false noise. - Connector Status vs Detection Integrity — Data present doesn’t ensure rules run.
Scenario: Rule disabled after tuning—coverage illusion. - Approved vs Rogue Collectors — Inventory matters.
Scenario: New IP starts sending logs → verify ownership before trusting data.
(ASCII) Executive Dashboard Sketch
+---------------- Sentinel Health & Telemetry Radar ----------------+
| Ingestion SLOs | Collectors | Coverage |
| EPS: 7.8k (↔) | HB: 12/12 (✓) | Critical: 48/50 (⚠)|
| P95 Delay: 7m (✓) | No Logs: 1 (⚠) | Tables Fresh: 96% |
| Spikes/Dips: OK | Unknown: 0 (✓) | DCR Drift: 0 (✓) |
+------------------------------------------------+------------------+
| Detection Integrity | Audit & Changes|
| Rules Disabled: 0 (✓) Tamper Attempts: 1 (⚠) | Priv Changes: 2 |
+------------------------------------------------+------------------+
| Top Alerts: Ingestion Delay @ COL-03 (sev-2) | ETA fix: 10m |
+-------------------------------------------------------------------+
Summary Table
| Concept | Definition | Everyday Example | Technical Example |
|---|---|---|---|
| All Log Collector Ingestion & EPS | Event throughput per collector | Landings/minute on a runway | Baseline 8K EPS; sustained 20K triggers scale |
| Log Size by Log Collector | Daily/rolling volume per collector | Cargo tonnage per runway | 40% drop on COL-03 after firewall change |
| Abnormal Workspace Spikes & Dips | Workspace-level ingestion anomalies | Airport-wide lull/surge | z-score anomaly on _LogManagement |
| Ingestion Delays | Source-to-ingest latency | Planes circling | P95 delay >15m = sev-2 |
| OOTB Data Connector Monitor | Health of native connectors | Prebuilt jetways attached | O365 token expiry alarm |
| Identify Warnings (Incident, Workspace) | Aggregated SOC/platform warnings | Tower status alerts | Sentinel “partial degradation” surfaced |
| Critical Devices Monitoring | Coverage for crown-jewel assets | VIP flights tracked | DC event gap >10m page |
| Devices Not Reporting | Missing endpoint telemetry | Silent plane on tarmac | Syslog source 60m silent |
| Sentinel Health & Audit Monitoring | Internal checks & audit | Airport systems diagnostics | Permission change on Analytics blade |
| Unhealthy AMA Agents | Agent failure/misconfig | Ground crew missing tools | Heartbeat OK; channel fail |
| Data Collection Rule Monitoring | DCR consistency & scope | Correct flight plans | New subnet lacks DCR |
| Unauthorized Modification of Use Case | Rule tampering detection | Unapproved tower procedure | KQL diff shows removed join |
| New Log Collector Out of Intended List | Unapproved collectors | Unlisted aircraft lands | Unknown syslog IP sending |
| Collector Health (No Heartbeat) | Collector host down | Runway lights off | VM down + EPS collapse |
| Collector Health (No Logs) | Host up, no events sent | Open runway, no planes | Token/parsers expired |
| Tables Not Ingesting Logs | Schema/table freshness gap | Department with no patients | FirewallLogs flatline |
| Analytic Rule Disabled/Deleted | Detections turned off/removed | Weather radar off | High-sev rule disabled |
| Sentinel Health & Audit (Performance) | Aggregated platform performance | Airport ops dashboard | Near API throttle during IR |
What’s Next
In the next post of this mini-series, we’ll go from health to outcomes: mapping telemetry SLOs to detection KPIs (true positive rate, MTTD/MTTR for security events), and how to automate executive scorecards that tie telemetry health → detection fidelity → business risk.
🌞 The Last Sun Rays…
Hook answers:
- Airports, hospitals, supply chains—all fail the same way: silent delays and hidden blind spots. Your suite exposes both in real time and proves the control tower (Sentinel) is awake.
- Executives get a single radar: Are we seeing the right data, fast enough, with rules that still work—and will tomorrow?
Your move: If you could add one metric to your exec dashboard tomorrow, which would it be—P95 ingestion delay, critical device freshness, or rules integrity drift?

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.
Leave a Reply