Tag: Log Analytics

  • Chapter 2 —How Not to Design Log Sources (with Microsoft Sentinel)

    1) Title + Hook

    Hook:

    • Treating Microsoft Sentinel like a “Dropbox for logs” is like buying a cargo ship to mail a postcard.
    • Pouring every signal into your Security Information and Event Management (SIEM) is like turning on every light in a stadium to find your keys—bright, expensive, and still not helpful.

    This post shows the anti-patterns that quietly destroy SIEM value—and what to do instead.


    2) Why It’s Needed (Context)

    Security teams love visibility. Finance teams hate surprise bills. Engineering hates noise.
    When log-source design is sloppy, you get: runaway costs, alert fatigue, blind spots, and weak investigations.
    Microsoft Sentinel is powerful, but it’s metered. Bad choices at the ingest layer ripple into detect, respond, and retain layers.


    3) Core Concepts Explained Simply

    A) “Collect-Everything” Ingestion → Huge Costs

    • Technical definition: Ingesting all available telemetry without scoping by use case, severity, or deduplication—often at high-cost data tables (e.g., SecurityAlert, CommonSecurityLog, Syslog with verbose facilities).
    • Everyday example: Subscribing to every streaming service “just in case,” then watching YouTube.
    • Technical example: Forwarding full Endpoint Detection and Response (EDR) raw telemetry and verbose Windows Event Forwarding (WEF) for the same hosts, plus firewall flows at 1:1 cadence—no filters.

    B) Logs Collected but Not Used

    • Technical definition: Sources ingested with no mapped analytics rules, hunting queries, or workbooks.
    • Everyday example: Paying for a gym you never visit.
    • Technical example: Shipping detailed DNS logs but no detections/queries reference them; no Kusto Query Language (KQL) saved searches.

    C) No Retention & Archival Strategy

    • Technical definition: Single retention setting for all tables; no hot/cold split, no Azure Data Explorer (ADX) or Azure Blob/Archive offload, and no legal hold mapping.
    • Everyday example: Keeping all photos on your phone forever—until it’s full right before a trip.
    • Technical example: 180-day retention for chatty Syslog/CommonSecurityLog tables when only 30 days are needed for detections; no archive to cheaper storage.

    D) Custom Logs over Native Connectors

    • Technical definition: Using custom ingestion (HTTP API, custom tables) instead of Microsoft Sentinel data connectors that provide schemas, Advanced Security Information Model (ASIM) normalization, and content packs.
    • Everyday example: Cooking from scratch when a healthy, cheaper meal kit exists.
    • Technical example: Parsing Palo Alto logs via custom functions instead of the native connector and ASIM mapping—losing built-in analytics.

    E) Duplicate Telemetry from Multiple Pipelines

    • Technical definition: Same events reaching Sentinel via parallel paths (e.g., agent + syslog forwarder + third-party pipeline), creating cost bloat and duplicate alerts.
    • Everyday example: Getting the same bank alerts by SMS, email, app, and phone call—annoying and redundant.
    • Technical example: Windows events ingested from both Azure Monitor Agent (AMA) and a legacy Log Analytics agent (MMA); cloud audit logs via both native connector and a custom ingestion app.

    F) No Log Validation

    • Technical definition: Lack of pre-ingest checks for schema, timestamps, severity, and required fields; no Service Level Objectives (SLOs) for delay, completeness, or deduplication.
    • Everyday example: Accepting every delivery without checking the box contents.
    • Technical example: Timestamps ingested in local time, breaking correlation; device hostname missing → entity mapping fails; uneven daily volume with silent drops.

    4) Real-World Case Study

    Failure — The $180k Surprise

    • Situation: A global SaaS firm enabled “everything” from firewalls, proxies, endpoints, and cloud audit logs. No content mapped; no filtering; 180-day retention on all tables.
    • Impact: Monthly Sentinel bill spiked by 60%. Analysts drowned in duplicate alerts; incident MTTR (Mean Time To Remediate) rose from 9h to 16h.
    • Lesson: Cost without context adds negative value. Start with use cases → data needed → retention tiering.

    Success — Use-Case-Driven Design

    • Situation: A fintech defined 12 priority detections (credential misuse, exfiltration, MFA bypass). They mapped required fields to ASIM schemas and trimmed sources to those fields.
    • Impact: 37% ingest reduction, +22% detection precision, 2× faster hunts due to consistent entity mapping.
    • Lesson: Design sources to serve detections, not the other way around.

    5) Action Framework — Prevent → Detect → Respond

    Prevent (Design & Cost Control)

    • Define top 15–20 detections first; list required fields (IP, User, Device, App, Action, Result, Timestamp TZ).
    • Prefer native connectors + ASIM; only custom when absolutely necessary.
    • Build ingestion policies: include tables, exclude noise (facility/level filters, sampling for flows).
    • Implement tiered retention:
      • Hot (30–60 days): detection & investigation.
      • Cold/Archive (6–12 months+): compliance, rare hunts (use ADX/Blob).
    • Prevent duplicates: one authoritative pipeline per source; document routing.

    Detect (Quality & Coverage)

    • For each table, create at least one analytic rule and one scheduled query that uses it.
    • Enforce schema validation in parsing functions; normalize to ASIM.
    • Track signal health KPIs: daily event count deltas, null critical fields, late arrivals (>10 min), duplication rate.

    Respond (Operate & Improve)

    • Build a workbook: cost by table, events by connector, rule hits by source.
    • Automate feedback loops: when an analytic fires with low confidence, refine source fields/filters.
    • Quarterly table review: drop unused sources, move low-value logs to archive, merge pipelines.

    6) Key Differences to Keep in Mind

    1. Native vs Custom Ingest — Native brings schemas/content; custom brings flexibility & maintenance.
      • Scenario: Choose native for popular firewalls; custom only when niche vendor lacks support.
    2. Hot vs Cold Retention — Hot is for speed; cold is for savings.
      • Scenario: Keep 30 days hot for IR (Incident Response); move month 2–12 to archive.
    3. Field Completeness vs Volume — Fewer, richer events beat many shallow events.
      • Scenario: Keep DNS with query, response, client IP; drop verbose debug flags.
    4. One Pipeline vs Many — Single route is traceable; multiple routes multiply duplicates.
      • Scenario: Consolidate to AMA; retire MMA and third-party forwarders.
    5. Use-Case vs Curiosity — Detections drive data; curiosity drives cost.
      • Scenario: Only ingest proxy categories needed for DLP (Data Loss Prevention) alerts.

    7) Summary Table

    ConceptDefinitionEveryday ExampleTechnical Example
    Collect-everything ingestionIngest all signals without scoping/filtersSubscribing to every streaming serviceEDR + WEF + flow logs all verbose to Sentinel
    Unused logsData with no rules/queries/workbooksPaying for a gym you don’t useDNS ingested but no KQL uses it
    No retention strategyOne-size retention; no hot/coldKeeping all photos on phone forever180 days on Syslog with no archive
    Custom over nativeDIY ingestion instead of connectorsCooking from scratch vs meal kitCustom Palo Alto parsing vs native + ASIM
    Duplicate telemetrySame events via multiple routesBank alerts by SMS/email/app/phoneAMA + MMA + syslog duplicating Windows events
    No validationNo checks for schema/time/fieldsAccepting packages uninspectedLocal-time timestamps; missing hostname

    8) ASCII Diagram (Signal Health Funnel)

    [Sources] --(validated, deduped)--> [Normalization/ASIM]
                 \--x duplicates drop--/          |
                                                  v
                                       [Analytic Rules & Hunts]
                                                  |
                                                  v
                                        [Incidents & Response]
                                                  |
                                                  v
                               [Retention: Hot 30-60d | Archive 6-12m+]
    

    9) What’s Next

    Next in this series: “Designing a Use-Case-First Log Strategy for Sentinel: From Detections to Data Contracts.” We’ll publish a field-tested worksheet to map detections → fields → connectors → retention.


    🌞 The Last Sun Rays…

    Hook answers:

    • Sentinel isn’t a dump truck for logs; it’s a tuned sensor grid.
    • More light (data) isn’t better if it blinds you; focused beams (use-cases) win.

    Your move:
    What one log source would you drop, filter, or archive tomorrow to improve both signal quality and cost—and what detection would stay intact after that change?

  • Chapter 7 – How Your Platform Health Suite Protects Outcomes, Not Just Logs

    Turning “Sentinel Noise” into an Executive Radar: How Your Platform Health Suite Protects Outcomes, Not Just Logs

    • Think of your platform like an airport: collectors are runways, AMA agents are ground crews, DCRs are flight plans, and analytic rules are the control tower. If any one falters, flights (events) stack up or vanish.
    • Or like a hospital: ingestion is triage, tables are departments, rules are diagnostic protocols, and audit is infection control. A delay at triage hides the real emergencies.
    • Or like a supply chain: collectors are loading docks, EPS is trucks-per-minute, DCRs are routing labels, and rules are QA checkpoints. Mislabel one box and the whole chain gets blind spots.

    This session shows executives how your components form one radar that tells them: Are we safe, is the telemetry flowing, and will detections fire when it matters?


    Why It’s Needed (Context)

    Security leaders don’t buy features; they buy assurance. Modern threats exploit blind spots: missing telemetry, delayed ingestion, disabled rules, or misconfigured agents. Your suite closes those gaps by:

    • Quantifying telemetry health (EPS, size, latency)
    • Surfacing blind spots (non-reporting devices, tables not ingesting)
    • Protecting detection integrity (analytic rule tampering, disabled rules)
    • Assuring platform reliability (Sentinel health, audit, connectors)

    Value to execs: fewer surprises, faster incident confidence, and measurable resilience. In plain terms: “Are we seeing what matters, fast enough, with rules that still work?”


    Core Concepts Explained Simply

    Below, each concept has: Technical Definition → Everyday Example → Technical Example (we’ll reuse the airport analogy).

    1. All Log Collector Ingestion & EPS (events per second)
    • Technical: Measures event throughput per collector to spot saturation/backpressure.
    • Everyday: Runway landings per minute—too few or too many means trouble.
    • Technical ex: 8K EPS baseline; spike to 20K EPS triggers auto-scale review.
    1. Log Size by Log Collector
    • Technical: Tracks daily/rolling log volume per collector for anomalies.
    • Everyday: Cargo tonnage per runway day-over-day.
    • Technical ex: 40% drop on Collector-03 flags upstream firewall change.
    1. Abnormal Workspace Spikes & Dips
    • Technical: Detects ingestion anomalies at the workspace level.
    • Everyday: Airport sees an unexpected lull or surge in flights.
    • Technical ex: z-score/seasonality anomaly on _LogManagement table.
    1. Ingestion Delays
    • Technical: Measures latency from source timestamp to ingestion time.
    • Everyday: Planes circling because runways are jammed.
    • Technical ex: P95 delay > 15 minutes = raise incident sev-2.
    1. OOTB (out-of-the-box) Data Connector Monitor
    • Technical: Checks health/config for native connectors.
    • Everyday: Prebuilt jetways—are they powered and attached?
    • Technical ex: Office 365 connector shows auth failure after token expiry.
    1. Identify Warnings (Incident, Workspace)
    • Technical: Aggregates SOC warnings across incidents and workspace health.
    • Everyday: Tower alerts: “Runway lights flickering, storm inbound.”
    • Technical ex: Sentinel “Data Collection partially degraded” alert surfaces.
    1. Critical Devices Monitoring
    • Technical: Ensures top-tier assets continuously report (DCs, firewalls, crown jewels).
    • Everyday: VIP planes (organ transplants, heads of state) tracked end-to-end.
    • Technical ex: Domain controller event gap >10 min triggers page.
    1. Devices Not Reporting (Windows/Linux/Network)
    • Technical: Detects endpoints missing expected heartbeat/events.
    • Everyday: A parked plane went silent on the tarmac.
    • Technical ex: Syslog source silent for 60 min → create problem record.
    1. Sentinel Health & Audit Monitoring
    • Technical: Internal service checks, API limits, configuration drift, audit events.
    • Everyday: Airport power, radios, and control systems diagnostics.
    • Technical ex: Audit log shows permission changes to Analytics blade.
    1. Unhealthy AMA (Azure Monitor Agent) Agents
    • Technical: Flags agent install/health/config failures.
    • Everyday: Ground crew short-staffed or missing tools.
    • Technical ex: AMA heartbeat present but data channels failing (DCR mismatch).
    1. Data Collection Rule (DCR) Monitoring
    • Technical: Validates DCR consistency and scope across resources.
    • Everyday: Flight plans correctly applied to the right aircraft.
    • Technical ex: New subnet lacks DCR mapping → 0 logs from that segment.
    1. Unauthorized Modification of Use Case
    • Technical: Detects tampering to detection rules (query edits, schedules).
    • Everyday: Someone rewrote tower procedures without approval.
    • Technical ex: KQL (Kusto Query Language) diff shows removed join on identity table.
    1. New Log Collector Out of Intended List
    • Technical: Flags newly-registered collectors not in the approved inventory.
    • Everyday: An unlisted aircraft lands without a flight plan.
    • Technical ex: Unknown syslog IP starts forwarding—validate source & owner.
    1. Collector Health (No Heartbeat)
    • Technical: Collector process/host unavailable.
    • Everyday: Runway lights off—no signals.
    • Technical ex: VM down event correlates with EPS collapse.
    1. Collector Health (No Logs)
    • Technical: Collector up but not sending logs.
    • Everyday: Runway open, but no planes using it.
    • Technical ex: Ingest pipeline credentials expired; heartbeat OK, EPS 0.
    1. Tables Not Ingesting Logs
    • Technical: Detects schema-level gaps (e.g., SecurityEvent empty).
    • Everyday: A department with no patients for hours—impossible.
    • Technical ex: FirewallLogs table flatline after parser update.
    1. Analytic Rule Disabled/Deleted
    • Technical: Ensures detections remain active & intact.
    • Everyday: Tower turned off weather radar.
    • Technical ex: High-severity rule disabled by change window w/o approval.
    1. Sentinel Health & Audit (Platform performance)
    • Technical: (Aggregated view) Platform performance, limits, and governance.
    • Everyday: Airport operations dashboard for execs.
    • Technical ex: API throttle near limits during IR surge; scale-out advised.

    Real-World Case Study

    Failure Case — Ransomware Quietly Gained Time

    • Situation: EPS looked “normal” per day totals, but ingestion delays at P95 grew to 35–40 minutes after a network change. Analytic rules were fine, but they fired late.
    • Impact: SOC saw lateral movement an hour after the fact. Restores needed; downtime cost and reputational hit.
    • Lesson: Volume ≠ timeliness. Latency is a first-class SLO (service-level objective). Your “Ingestion Delays” and “Abnormal Spikes & Dips” would’ve caught this earlier.

    Success Case — Misconfigured DCR Contained in Minutes

    • Situation: A new Linux subnet went live; DCR Monitoring flagged no mapping. Devices Not Reporting confirmed silent hosts; Tables Not Ingesting showed flat SecurityEventLinux table.
    • Impact: Fix within 20 minutes; coverage gap avoided during a vendor compromise alert wave.
    • Lesson: Layered monitors (DCR + device heartbeat + table health) shrink MTTR (mean time to repair).

    Action Framework — Prevent → Detect → Respond

    Prevent

    • Standardize collector baselines (EPS, log size).
    • Enforce DCR-as-code with approvals; alert on drift.
    • Lock analytic rules (change control + audit).
    • Maintain approved collector inventory; block unknowns.

    Detect

    • SLOs: EPS ±25% anomaly, P95 delay <15 min, table freshness <10 min.
    • Triangulate: device heartbeat + table freshness + rule health.
    • Prioritize critical devices and OOTB connectors with higher alert sensitivity.

    Respond

    • Playbooks:
      1. No Heartbeat: auto-scale or restart collector VM; reroute sources.
      2. No Logs: token refresh, pipeline test, sample event injection.
      3. DCR Drift: auto-rollback via Git; notify change owner.
      4. Rule Tampering: revert from versioned store; open P1; audit who/when.
    • Dashboards for execs: “Are we blind anywhere?” + “How fast are we seeing?” (MTTD/MTTR for telemetry issues).

    Key Differences to Keep in Mind

    1. No Heartbeat vs No Logs — Down host vs live host with broken pipeline.
      Scenario: Heartbeat OK but EPS 0 → investigate tokens/parsers, not VM.
    2. Volume vs Timeliness — Total GB ≠ real-time visibility.
      Scenario: Daily volume steady but P95 delay 30 min → IR effectiveness drops.
    3. Device Health vs Table Freshness — Endpoints alive doesn’t mean data landed.
      Scenario: AMA OK; SecurityEvent table flat → DCR scope missing.
    4. Anomaly vs Planned Change — Spikes can be normal during patch night.
      Scenario: Annotate change windows to suppress false noise.
    5. Connector Status vs Detection Integrity — Data present doesn’t ensure rules run.
      Scenario: Rule disabled after tuning—coverage illusion.
    6. Approved vs Rogue Collectors — Inventory matters.
      Scenario: New IP starts sending logs → verify ownership before trusting data.

    (ASCII) Executive Dashboard Sketch

    +---------------- Sentinel Health & Telemetry Radar ----------------+
    |  Ingestion SLOs     |  Collectors          |  Coverage           |
    |  EPS: 7.8k (↔)      |  HB: 12/12 (✓)       |  Critical: 48/50 (⚠)|
    |  P95 Delay: 7m (✓)  |  No Logs: 1 (⚠)      |  Tables Fresh: 96%  |
    |  Spikes/Dips: OK    |  Unknown: 0 (✓)      |  DCR Drift: 0 (✓)   |
    +------------------------------------------------+------------------+
    |  Detection Integrity                             |  Audit & Changes|
    |  Rules Disabled: 0 (✓)  Tamper Attempts: 1 (⚠)  |  Priv Changes: 2 |
    +------------------------------------------------+------------------+
    |  Top Alerts: Ingestion Delay @ COL-03 (sev-2) | ETA fix: 10m     |
    +-------------------------------------------------------------------+
    

    Summary Table

    ConceptDefinitionEveryday ExampleTechnical Example
    All Log Collector Ingestion & EPSEvent throughput per collectorLandings/minute on a runwayBaseline 8K EPS; sustained 20K triggers scale
    Log Size by Log CollectorDaily/rolling volume per collectorCargo tonnage per runway40% drop on COL-03 after firewall change
    Abnormal Workspace Spikes & DipsWorkspace-level ingestion anomaliesAirport-wide lull/surgez-score anomaly on _LogManagement
    Ingestion DelaysSource-to-ingest latencyPlanes circlingP95 delay >15m = sev-2
    OOTB Data Connector MonitorHealth of native connectorsPrebuilt jetways attachedO365 token expiry alarm
    Identify Warnings (Incident, Workspace)Aggregated SOC/platform warningsTower status alertsSentinel “partial degradation” surfaced
    Critical Devices MonitoringCoverage for crown-jewel assetsVIP flights trackedDC event gap >10m page
    Devices Not ReportingMissing endpoint telemetrySilent plane on tarmacSyslog source 60m silent
    Sentinel Health & Audit MonitoringInternal checks & auditAirport systems diagnosticsPermission change on Analytics blade
    Unhealthy AMA AgentsAgent failure/misconfigGround crew missing toolsHeartbeat OK; channel fail
    Data Collection Rule MonitoringDCR consistency & scopeCorrect flight plansNew subnet lacks DCR
    Unauthorized Modification of Use CaseRule tampering detectionUnapproved tower procedureKQL diff shows removed join
    New Log Collector Out of Intended ListUnapproved collectorsUnlisted aircraft landsUnknown syslog IP sending
    Collector Health (No Heartbeat)Collector host downRunway lights offVM down + EPS collapse
    Collector Health (No Logs)Host up, no events sentOpen runway, no planesToken/parsers expired
    Tables Not Ingesting LogsSchema/table freshness gapDepartment with no patientsFirewallLogs flatline
    Analytic Rule Disabled/DeletedDetections turned off/removedWeather radar offHigh-sev rule disabled
    Sentinel Health & Audit (Performance)Aggregated platform performanceAirport ops dashboardNear API throt­tle during IR

    What’s Next

    In the next post of this mini-series, we’ll go from health to outcomes: mapping telemetry SLOs to detection KPIs (true positive rate, MTTD/MTTR for security events), and how to automate executive scorecards that tie telemetry health → detection fidelity → business risk.


    🌞 The Last Sun Rays…

    Hook answers:

    • Airports, hospitals, supply chains—all fail the same way: silent delays and hidden blind spots. Your suite exposes both in real time and proves the control tower (Sentinel) is awake.
    • Executives get a single radar: Are we seeing the right data, fast enough, with rules that still work—and will tomorrow?

    Your move: If you could add one metric to your exec dashboard tomorrow, which would it be—P95 ingestion delay, critical device freshness, or rules integrity drift?