Category: Blogs

Blog articles covering cybersecurity topics, CISSP domains, security tools, and practical security implementation guides.

Chapter 2: Security Alignment & Governance

Security alignment & governance is like:

Seatbelt: doesn’t stop the trip — it makes the trip survivable.
GPS: routes you around risk based on destination + constraints.
Business contract: defines who decides, who executes, and who audits.

2. Why It’s Needed (Context)

Most orgs don’t “fail security” because they lack tools. They fail because:

Security is treated like a departmental opinion, not a business decision.
The board asks “Are we secure?” but what they really mean is: “Are we within risk appetite?”
Teams measure what’s easy (KPIs) instead of what predicts pain (KRIs).
Nobody knows who owns what, so incidents become a blame relay race.

If you align security to business strategy and put governance around it, you get:
✅ faster decisions, ✅ defensible budgets, ✅ fewer surprise risks, ✅ cleaner audits, ✅ calmer incident response.

3. Core Concepts Explained Simply

A) Security as Business Enabler

Technical Definition: Security is integrated into strategy to enable safe growth, innovation, and resilience — not to block outcomes.
Everyday Example: A mall adds CCTV + fire exits so it can stay open longer hours safely (more revenue, less risk).
Technical Example: Designing secure cloud landing zones so the business can launch products fast without chaos (identity, segmentation, logging, guardrails).

Exam brain: “BEST way security supports growth?” → enable business outcomes inside risk appetite.

B) Alignment of Security to Business Strategy

Technical Definition: Security goals, investments, and controls directly support mission/vision and strategic objectives.
Everyday Example: A restaurant expanding delivery checks packaging, payment fraud, and delivery partner reliability before scaling.
Technical Example: Security roadmap mapped to enterprise roadmap (e.g., new market entry → data residency, geopolitical risk, third-party risk, regulatory controls).

Exam brain: “What does CISO do FIRST for new market/initiative?” → understand objective, do risk assessment aligned to strategy.

C) Risk-Based Decision Making

Technical Definition: Controls and investments are prioritized by likelihood × impact × tolerance (risk appetite), not fear or checkbox compliance.
Everyday Example: You lock your front door every day, but only install a vault if you store diamonds.
Technical Example: MFA (Multi-Factor Authentication) first for admin accounts + remote access, not necessarily every kiosk on day 1.

Exam brain: “MOST appropriate control?” → the one that reduces risk to acceptable level with cost/benefit + appetite.

D) KPIs (Key Performance Indicators)

Technical Definition: Metrics that measure performance of security processes/control execution.
Everyday Example: Gym dashboard: workouts completed this week (activity/performance).
Technical Example: % critical patches applied within SLA (Service Level Agreement), mean time to remediate (MTTR).

Exam brain: “BEST measures program effectiveness/performance?” → KPI tied to objectives, measurable, trendable.

E) KRIs (Key Risk Indicators)

Technical Definition: Metrics that give early warning signals that risk exposure is increasing.
Everyday Example: Weather forecast + dark clouds = early warning you might get soaked.
Technical Example: Rising count of unpatched critical vulns on internet-facing systems; spike in third-party incidents; growing number of policy exceptions.

Exam brain: “EARLY WARNING of risk?” → KRI (predictive risk signal), not KPI (process performance).

F) Governance Models

Technical Definition: Frameworks that define how security is directed, controlled, and monitored to meet objectives (who decides, who’s accountable, how oversight works).
Everyday Example: City government: elected officials set direction, departments execute, watchdogs audit.
Technical Example: Board risk committee sets risk appetite; CISO runs program; management reports metrics; audit validates.

Exam brain: “ULTIMATELY responsible for governance?” → Senior leadership / Board.

G) Three Lines of Defense (3LoD) Model

Technical Definition: Splits responsibilities into operations (own risk), oversight (monitor/guide), and independent assurance (audit).
Everyday Example:

Store manager runs the store (1st)
Compliance team checks rules (2nd)
External/internal auditor verifies independently (3rd)
Technical Example:
1st: IT/SecOps implements controls
2nd: Risk/Compliance defines policies + monitors
3rd: Internal audit validates effectiveness and reports to audit committee

Exam brain (trap-proof):

“Who is responsible?” → 1st line
“Who oversees?” → 2nd line
“Who independently verifies?” → 3rd line

4. Real-World Case Study

Failure Case: “KPI Theater” → Breach Surprise

Situation: Org reports “98% security training completion” and “95% patch compliance.”
What actually happened: The missing 5% included internet-facing legacy systems and a privileged admin workflow with weak MFA. KRIs (exception count, critical vulns exposed, admin account anomalies) weren’t tracked.
Impact: Attacker exploited the exposed weak spot → lateral movement → data theft → board asks why dashboard looked “green.”
Lesson: KPIs show activity; KRIs show rising danger. Governance should force risk-based prioritization, not vanity metrics.

Success Case: New Market Entry Done Right

Situation: Company expands into a new country with strict data localization + higher third-party risk.
What went right:

CISO aligned security plan to business goal (growth)
Risk assessment identified top risks (data residency, supplier ecosystem, fraud)
Governance body approved risk treatment options (mitigate/transfer/accept)
KRIs tracked early signals (3rd-party incidents, policy exceptions, vuln exposure)
Impact: Faster launch with fewer surprises; audit outcomes strong; board confidence improved.
Lesson: Alignment + governance turns security from “No” to “Yes, safely.”

5. Action Framework — Prevent → Detect → Respond

Prevent (reduce likelihood)

Tie controls to business objectives + risk appetite (not “best practice for everything”).
Build a risk-based control baseline (admin > internet-facing > crown jewels).
Require exception management (time-bound, approved, tracked as KRI).

Detect (spot drift early)

KPI set: patch SLA, incident response drill completion, logging coverage.
KRI set: critical vulns backlog, privileged access anomalies, third-party incident trend, exception count trend.
Board reporting: trends + risk narrative, not raw numbers.

Respond (limit impact)

Pre-define decision rights: who declares incident severity, who approves containment tradeoffs.
Map response to 3LoD:
- 1st line executes containment
- 2nd line ensures compliance/risk posture
- 3rd line reviews effectiveness post-incident
Post-incident governance: lessons learned → control updates → metric updates.

6. Key Differences to Keep in Mind

KPI vs KRI

Difference: KPI = performance; KRI = risk warning.
Scenario: “95% patched” (KPI) but “critical internet-facing vulns rising” (KRI) = danger.

Governance vs Management

Difference: Governance decides direction/accountability; management executes.
Scenario: Board sets risk appetite (governance); CISO implements program (management).

Risk-based vs Compliance-based Security

Difference: Risk-based optimizes for reduction of real risk; compliance-based optimizes for passing audits.
Scenario: You can be compliant and still breached if controls don’t cover actual threats.

3LoD Roles (Responsible vs Oversight vs Assurance)

Difference: 1st owns; 2nd monitors/defines; 3rd independently verifies.
Scenario: Audit can’t “implement controls” or it loses independence.

7. Summary Table

Concept	Definition	Everyday Example	Technical Example
Security as Business Enabler	Security enables safe growth, not blocks it	Seatbelt lets you drive, not avoid driving	Secure cloud landing zone enabling fast delivery
Alignment to Business Strategy	Security goals map to mission/strategy	Delivery expansion needs fraud + partner checks	Market entry risk assessment + roadmap mapping
Risk-Based Decision Making	Prioritize controls by likelihood × impact × tolerance	Vault for diamonds, lock for door	MFA first for admins/internet access
KPI	Measures performance of security operations	Workouts completed	% patched within SLA, MTTR
KRI	Early warning of rising risk	Storm clouds warning	Rising critical vulns, rising exceptions
Governance Models	Define direction, oversight, accountability	City governance structure	Board risk appetite → CISO program → audit check
Three Lines of Defense	Ops owns risk; oversight monitors; audit verifies	Manager vs compliance vs auditor	1st IT/SecOps, 2nd Risk/Compliance, 3rd Internal Audit

ASCII Diagram Placeholder (Governance Flow)

Business Strategy → Risk Appetite → Security Strategy → Controls + Metrics → Assurance
       |                 |               |               |                  |
     Board            Board/Risk       CISO/Exec       1st+2nd Line       3rd Line

8. 🌞 The Last Sun Rays…

So what’s the real punchline?

Security is not a brake pedal — it’s the seatbelt + GPS that lets the business go faster without flying off a cliff.
KPIs tell you if the engine is running; KRIs tell you if the bridge ahead is collapsing.
Governance decides who has the steering wheel, and the Three Lines of Defense ensures nobody marks their own homework.

Reflective challenge: If you could put one metric on your security dashboard tomorrow — would you choose a KPI that proves activity, or a KRI that predicts pain? Which one, specifically?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

February 26, 2026

Chapter 6: How Not to Migrate to Microsoft Sentinel

1. Title + Hook

Migrating to Microsoft Sentinel isn’t “moving your SIEM to the cloud.”

It’s closer to:

Switching from a landline call-center to an omnichannel support platform — if you only move phone scripts, you miss chat, automation, and analytics.
Replacing a filing cabinet with a searchable data lake — if you keep the same folders, you waste the power of indexing and correlation.
Upgrading from a smoke alarm to a smart home security system — if you only use the siren, you ignore cameras, motion patterns, and automation.

The tool will work.
The real question is whether your detection capability improves.

2. Why It’s Needed (Context)

Sentinel migrations fail in a specific way: they “succeed” technically (logs ingest, rules run), but security posture doesn’t improve.

Common outcomes when teams carry a legacy mindset:

Alert noise increases (and analysts burn out)
Identity and cloud threats are under-detected
Costs spike because ingestion is enabled without design
SOC processes become inconsistent: “Who owns what? What’s the triage path?”

Sentinel is cloud-native and correlation-rich — but only if you design for it.

3. Core Concepts Explained Simply

Concept 1: Lift-and-Shift Migration Is a Trap (Mistake #1)

Technical Definition
Lift-and-shift is porting legacy rules, dashboards, and searches into Sentinel with minimal redesign.

Everyday Example
Translating a cookbook from French to English but never adjusting for different ingredients or ovens.

Technical Example
Exporting old SIEM correlation rules → converting syntax to KQL (Kusto Query Language) → rebuilding dashboards → declaring success, even though Sentinel’s schemas, enrichment, and correlation patterns differ.

Concept 2: SIEM is an Operating Model, Not a Product (Mistake #2)

Technical Definition
A SIEM program includes threat modeling, data onboarding, detection lifecycle, SOC workflows, automation, governance, and cost management — not just alerts.

Everyday Example
Buying a hospital MRI machine doesn’t create a radiology department.

Technical Example
Migrating rules without migrating case management, triage standards, escalation paths, tuning ownership, and change control causes inconsistent response and alert fatigue.

Concept 3: Threat Model Must Be Revalidated During Migration (Mistake #3)

Technical Definition
Threat modeling aligns detections and telemetry to current attack surfaces (cloud, identity, endpoints, SaaS).

Everyday Example
Upgrading locks but ignoring the open window.

Technical Example
Porting network-focused detections while missing identity-centric attack paths (token theft, consent abuse, privilege escalation, conditional access bypass attempts).

Concept 4: Data Engineering Is Security Engineering (Mistake #4)

Technical Definition
Sentinel detections are only as strong as ingestion design: connectors, normalization, table choice, enrichment, retention, and filtering.

Everyday Example
A GPS is useless if the map data is wrong.

Technical Example
Wrong connector configuration or inconsistent fields → KQL rules become brittle; incident investigation fails due to missing entity context (user/device/IP correlation).

Concept 5: Cost Is a Security Requirement (Mistake #5)

Technical Definition
Sentinel pricing is ingestion-based, so architecture must include cost controls (filtering, tiered retention, data types).

Everyday Example
Buying cloud storage without lifecycle policies — your bill becomes your surprise.

Technical Example
Enabling every diagnostic log, keeping it all “hot,” no retention segmentation, and no forecasting → budget blowout → leadership distrust → reduced logging later (which creates blind spots).

Concept 6: Big Bang Cutovers Cause Blind Spots (Mistake #6)

Technical Definition
A cutover without parallel validation risks missed detections due to schema gaps, logic differences, and tuning immaturity.

Everyday Example
Turning off the old security cameras before testing the new ones at night.

Technical Example
Disabling legacy SIEM on day 1 → Sentinel rules aren’t tuned → noisy alerts drown real incidents → gaps aren’t discovered until post-incident review.

Concept 7: “Go-Live” Is Not a Success Metric (Mistake #7)

Technical Definition
Success is measurable improvement: validated coverage, reduced noise, stable SOC throughput, governance, and predictable cost.

Everyday Example
Launching an app isn’t the same as users being happy and retained.

Technical Example
Workspace is live but:

detection coverage isn’t mapped to threats
false positives are high
analyst time per incident is worse
→ migration failed.

Concept 8: Don’t Ignore Sentinel’s Native Strengths (Mistake #8)

Technical Definition
Sentinel includes built-in analytics, correlation, UEBA, and deep Microsoft ecosystem integration.

Everyday Example
Buying a power drill and using it as a screwdriver.

Technical Example
Rebuilding manual rules for scenarios already covered by built-in analytics + Microsoft Defender integration + correlation features, instead of enabling, validating, tuning, and extending.

Concept 9: Migrating Every Legacy Rule Is a Mistake (Mistake #9)

Technical Definition
Legacy SIEM rule sets often contain duplicates, obsolete detections, and low-value noise generators.

Everyday Example
Moving every item from your junk drawer into a new house.

Technical Example
Copying hundreds of rules without rationalization → increased alert volume with little added detection value.

Concept 10: Sentinel Won’t Behave Like an On-Prem SIEM (Mistake #10)

Technical Definition
Sentinel is cloud-native, elastic, and data-lake-backed; it encourages different detection patterns and operational workflows.

Everyday Example
Expecting a streaming service to behave like a DVD shelf.

Technical Example
Designing searches and dashboards as if compute/storage is fixed and local → inefficiency, cost spikes, poor performance patterns, and missed platform capabilities.

Concept 11: Migration is Mostly Planning (Mistake #11)

Technical Definition
The highest leverage work is done before implementation: ingestion blueprint, detection rationalization, cost modeling, governance, success metrics.

Everyday Example
In construction, a bad blueprint scales mistakes across the whole building.

Technical Example
Skipping architecture and rushing execution → bad logging choices and rule structure multiply at cloud scale.

Concept 12: The Legacy Lens is the Silent Killer (Mistake #12)

Technical Definition
The “legacy lens” is trying to recreate old dashboards, correlation logic, and SOC workflows instead of embracing Sentinel’s strengths and modern detection engineering principles.

Everyday Example
Buying a hybrid car and insisting it only runs in first gear because it feels familiar.

Technical Example
Forcing identical dashboard parity and correlation design:

increases complexity
prevents tuning for identity + cloud signals
blocks automation adoption
→ you underuse Sentinel and miss optimization opportunities.

4. Real-World Case Study

Failure Case: “Translated Everything, Improved Nothing”

Situation

Ported rules, rebuilt dashboards, went live fast
Impact
Noise increased
Identity threats were still weakly covered
Costs spiked
Analysts lost time and confidence
Lesson
You migrated syntax, not detection capability.

Success Case: “Rationalize → Design → Validate → Cut Over”

Situation

Started from threat scenarios
Built logging blueprint + cost model
Enabled built-in Sentinel capabilities first
Ran parallel validation
Impact
Fewer rules, better signal
Stable SOC efficiency
Predictable spending
Lesson
Migration is an opportunity to modernize operations, not just change tools.

5. Action Framework: Prevent → Detect → Respond

Prevent

Threat model refresh (cloud + identity + endpoint first)
Logging blueprint (what signals, why, where filtered)
Cost model (hot vs cold retention tiers, filtering rules)
Governance (ownership, naming, change control)

Detect

Enable built-ins → validate → tune → extend
Rationalize detections (remove duplicates/obsolete)
Coverage mapping to threat scenarios
Quality metrics: false positive rate, coverage %, MTTD

Respond

SOC workflow redesign (triage → investigation → escalation)
Automation playbooks for repetitive tasks
Parallel run comparisons (alerts, misses, workload)
Response metrics: MTTR + analyst effort per incident

ASCII flow (migration pipeline):

Threat Model → Logging Blueprint → Cost Model → Governance
      ↓               ↓               ↓
Built-ins Enable → Validate/Tune → Custom Detections
      ↓
Parallel Run → Metrics Review → Cutover

6. Key Differences to Keep in Mind

Rule Translation vs Capability Redesign
Scenario: Same detection logic doesn’t work because Sentinel tables and enrichment differ.
More Logs vs Better Signals
Scenario: Ingesting everything increases cost/noise without improving incidents.
Go-Live vs Measured Outcomes
Scenario: Workspace live but analysts slower and coverage unclear.
Legacy Dashboards vs Decision Dashboards
Scenario: “Alerts by severity” looks nice; “top false positives + owners” improves operations.

7. Summary Table

Concept	Definition	Everyday Example	Technical Example
Lift-and-shift trap	Porting artifacts without redesign	Translating a recipe without adapting ingredients	Converting legacy rules to KQL without schema redesign
SIEM operating model	Tool + people + process + governance	MRI machine ≠ radiology dept	Rules moved but workflows/playbooks absent
Threat model refresh	Align to modern attack surface	Locking doors, window open	Missing identity and cloud detections
Data engineering	Ingestion quality drives detection quality	GPS with wrong map	Bad connectors/fields → brittle KQL
Cost planning	Security includes financial design	No storage lifecycle policy	Ingest-all → surprise bill → logging cuts
Parallel validation	Avoid blind cutover	Test cameras at night	Run both SIEMs, compare misses/noise
Outcomes > go-live	Measure improvements	App launch ≠ adoption	Coverage + fidelity + SOC efficiency
Use built-ins	Don’t rebuild what exists	Power drill used as screwdriver	Enable/tune built-in analytics + correlations
Rule rationalization	Quality over quantity	Junk drawer migration	Remove duplicates/obsolete rules
Cloud-native mindset	Different architecture	Streaming vs DVDs	Avoid on-prem performance assumptions
Planning first	Architecture is leverage	Bad blueprint scales	No ingestion blueprint/cost model/governance
Legacy lens	Recreating old behavior	Hybrid car stuck in 1st gear	Force parity dashboards, ignore automation

8. What’s Next

Next blog idea: “Sentinel Migration Blueprint: A Step-by-Step Plan (Threat Model → Logging → Detections → SOC Ops → Cost)”
Including a checklist and example success metrics.

9. 🌞 The Last Sun Rays…

So yes — migration is not copying the past. It’s redesigning detection for a cloud-native world.

Lift-and-shift? Easy — and usually noisy.
Redesign? Harder — but that’s where posture improves.
Success isn’t “we went live.” It’s “we detect more, waste less, and respond faster — predictably.”

Reflective question: If you had to pick one thing to prove your migration actually improved security — coverage, false positive rate, MTTD, MTTR, or cost predictability — which would you put on the dashboard first?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

February 17, 2026

Chapter 5 – How NOT to Govern Microsoft Sentinel Operations

Why “Just Turn It On” Becomes “Why Is Everything On Fire?”

Think of Sentinel governance like:

Air traffic control without radar — planes are flying, but no one knows who’s landing or crashing.
A hospital ER with no triage — everyone is “urgent,” so nothing actually is.
A city with traffic lights but no traffic rules — motion everywhere, safety nowhere.

Sentinel will run without governance.
It just won’t protect you.

Why This Matters (Context)

Most Sentinel failures don’t happen because of bad analytics or missing logs.
They happen because operations were never governed.

When governance is missing:

SOC teams burn out
Alerts pile up unchecked
Leadership loses trust in security metrics
Incidents take longer — or never get resolved

Sentinel becomes expensive visibility, not operational security.

Core Governance Anti-Patterns (Explained Simply)

Let’s break down the most common ways Sentinel governance fails — and why each one hurts.

1. No Change Control

Technical Definition
Changes to analytics rules, playbooks, data connectors, or workbooks are made without approval, tracking, or rollback.

Everyday Example
Anyone can move the furniture in a fire station — including blocking the exits.

Technical Example

SOC analyst edits a detection rule in production
False positives spike
No one knows who changed what or why

Result: Unstable detections and incident chaos.

2. No Documentation

Technical Definition
Sentinel configurations exist only in people’s heads, chats, or tribal memory.

Everyday Example
A recipe passed by word of mouth — until the chef quits.

Technical Example

Alerts fire with cryptic names
No runbooks
No explanation of logic, thresholds, or response steps

Result: Slow response and dependency on “that one person.”

3. Too Much or No Governance

Technical Definition
Either every change requires bureaucracy, or nothing is controlled at all.

Everyday Example

Too much: You need a board meeting to change a light bulb
Too little: Anyone rewires the building

Technical Example

Over-governance: SOC can’t tune noisy rules
Under-governance: Junior analysts disable detections to reduce noise

Result: Either stagnation or silent security gaps.

4. No Measurement Loop (MTTD / MTTR / Noise)

Technical Definition
No metrics exist to measure Sentinel’s effectiveness.

Everyday Example
A fitness plan without a scale, stopwatch, or mirror.

Technical Example

No Mean Time To Detect (MTTD)
No Mean Time To Respond (MTTR)
No alert-to-incident ratio tracking

Result: Leadership asks, “Is Sentinel working?”
And no one can answer.

5. No Content Lifecycle Ownership

Technical Definition
Analytics rules and playbooks are deployed but never reviewed, tuned, or retired.

Everyday Example
Smoke alarms installed once — never tested again.

Technical Example

Rules fire on legacy systems that no longer exist
Playbooks reference deprecated APIs
No owner reviews detection quality quarterly

Result: Noise increases while value decreases.

6. No RACI (Responsible, Accountable, Consulted, Informed)

Technical Definition
Ownership of Sentinel components is unclear.

Everyday Example
Everyone assumes someone else is taking out the trash.

Technical Example

Who owns detections?
Who approves changes?
Who tunes false positives?

Result: Alerts get ignored because “that’s not my job.”

7. No Established Process

Technical Definition
Incident handling, tuning, onboarding, and offboarding are ad-hoc.

Everyday Example
Fire drills invented during the fire.

Technical Example

No standard incident workflow
No tuning cadence
No onboarding checklist for new data sources

Result: Inconsistent outcomes and analyst fatigue.

8. No Framework Alignment

Technical Definition
Sentinel detections are not mapped to security frameworks.

Everyday Example
Training for a marathon without knowing the race distance.

Technical Example

Detections not aligned to MITRE ATT&CK
No coverage visibility
Leadership can’t assess risk reduction

Result: Security theater instead of security strategy.

Real-World Case Study

❌ Failure Case: “Alert Avalanche”

Situation
A global company enabled Sentinel rapidly during cloud migration.

What Went Wrong

No RACI
No metrics
No lifecycle ownership

Impact

18,000 alerts/week
Analysts ignored high-severity incidents
Leadership questioned Sentinel ROI

Lesson
Visibility without governance increases risk.

✅ Success Case: “Measured, Managed SOC”

Situation
Another organization paused expansion and fixed governance first.

What They Did

Defined RACI
Implemented MTTD/MTTR tracking
Assigned rule owners
Quarterly detection reviews

Impact

65% noise reduction
Faster incident closure
Clear executive reporting

Lesson
Governance amplifies Sentinel’s value.

Action Framework: Prevent → Detect → Respond

[ Design ] → [ Measure ] → [ Improve ]
     ↓           ↓            ↓
 Governance   Metrics     Continuous Tuning

Prevent

Enforce change control
Define RACI
Align detections to frameworks

Detect

Track MTTD / MTTR
Monitor alert noise
Review rule effectiveness

Respond

Document playbooks
Test automation
Retire stale content

Key Differences to Keep in Mind

Visibility vs Security
Seeing alerts ≠ stopping threats
Governance vs Bureaucracy
Controls should enable speed, not kill it
Metrics vs Vanity Numbers
Alert count means nothing without context

Summary Table

Concept	Definition	Everyday Example	Technical Example
Change Control	Managed configuration updates	Locking emergency exits	Approved rule edits
Documentation	Shared operational knowledge	Written recipe	Runbooks
Metrics	Effectiveness measurement	Fitness tracking	MTTD / MTTR
RACI	Ownership clarity	Assigned chores	Rule ownership
Lifecycle	Ongoing content care	Smoke alarm testing	Rule reviews
Frameworks	Strategic alignment	Training plan	MITRE mapping

What’s Next

In the next post, we’ll flip the script:

“How to Build a Sentinel Governance Model That Actually Works”
→ Roles
→ Metrics
→ Operating cadence
→ Executive-ready dashboards

🌞 The Last Sun Rays…

Sentinel doesn’t fail because it lacks features.
It fails because operations lack structure.

If you had to choose one governance metric to put on your SOC dashboard tomorrow —
would it measure noise, speed, or accountability?

☀️

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

February 10, 2026

Chapter 4 – How NOT to Test Sentinel — and the Exact Tests to Add Today

Hook:

If detections were sprinklers, would you assume they work… without ever pulling the test lever?
If logs were ingredients, would you bake a cake with half the labels missing and hope it rises?

This is your practical checklist for turning noisy, brittle rules into a trustworthy detection system.

Why It’s Needed (Context)

Most Sentinel rollouts fail quietly—not because detections are wrong, but because tests don’t exist. The result: untriggered use-cases, malformed logs, slow KQL (Kusto Query Language) queries, no attack replay, and alert queues that either flood analysts or go silent. In other words: assumptions > evidence. We’ll flip that.

Core Concepts Explained Simply

We’ll stick to one analogy: kitchen & recipe (ingredients = logs, recipe = KQL rule, oven = pipeline/latency, taste test = simulation).

1) Use-Case Validation

Technical Definition: Prove each analytic maps to a clear objective, required signals, ATT&CK technique, and expected alert outcome.
Everyday Example: Check the recipe actually makes cake (not bread) and yields 8 slices.
Technical Example: “Detect risky OAuth consent” needs Entra audit + consent events; trigger both benign and malicious grants and confirm alert fields, severity, and entities.

2) Log Validation (Format, Fields, Completeness)

Technical Definition: Verify incoming events conform to schema (fields, types, timestamps), parsing, time skew, and completeness.
Everyday Example: Make sure the ingredients are labeled, fresh, and the right quantity.
Technical Example: Validate TimeGenerated, UserPrincipalName, AppId exist and parse; reject or quarantine events missing required fields; track % malformed per source.

3) Log Coverage (Expected Sources Present)

Technical Definition: Confirm every required log type (e.g., identity, endpoint, SaaS, IaaS) is actually arriving for the target scope.
Everyday Example: Ensure you actually bought eggs, flour, sugar—not just sugar.
Technical Example: Coverage matrix for subscriptions/tenants: M365 audit ✅, Entra sign-in ✅, Endpoint EDR ❌ → detection blocked until fixed.

4) KQL Performance

Technical Definition: Measure query runtime, memory, and stability at 1×–5× data volume; optimize with filters, summarize, arg_max, materialized views.
Everyday Example: Preheat the oven and time the bake.
Technical Example: Replace 30-day cross-joins with pre-aggregations; keep rule runtime P95 < rule schedule interval/2.

5) Attack Simulation / Replay

Technical Definition: Execute synthetic techniques or replay sanitized incident payloads to validate end-to-end detection & response.
Everyday Example: Taste test before serving.
Technical Example: Atomic test for token theft + replay of real OAuth abuse JSON; verify alert, incident, and playbook actions.

6) Volume & Latency

Technical Definition: Stress ingest and measure end-to-end time: event → ingestion → rule → alert → automation.
Everyday Example: Can the oven handle two trays at once without undercooking?
Technical Example: Track SLIs (Service Level Indicators): data lag, rule runtime, alert creation delay; set SLOs (Service Level Objectives) like “P95 alert latency < 3 min”.

7) False Positives (FP) Review

Technical Definition: Quantify precision/recall, label outcomes, tune thresholds and allow/deny lists.
Everyday Example: If every dish tastes “too salty,” your measuring spoon is wrong.
Technical Example: Weekly FP board: rule, reason, proposed tuning; ship suppressions with expiration + owner.

8) Alert Volume Health

Technical Definition: Balance alert count with analyst capacity; enforce budgets and auto-triage.
Everyday Example: One chef can’t plate 300 orders in 10 minutes.
Technical Example: If daily alerts > (analysts × handling rate), route low-sev to batched review; auto-close stale low-value patterns with audit trail.

Real-World Case Study

Failure — “The Silent Rule”

Situation: A team wrote a beautiful KQL detection for lateral movement but never did log coverage checks. Endpoint EDR wasn’t connected in one region.
Impact: Attack in that region generated zero alerts; discovery took days.
Lesson: No logs → no detection. Coverage gates before rule deployment.

Success — “Replay Saved the Release”

Situation: Another team kept a monthly replay set of sanitized OAuth abuse logs. A parser update broke AppId extraction; replay caught it within an hour.
Impact: Hotfix shipped same day; production detections never regressed.
Lesson: Known-bad payloads are your smoke—use them routinely.

Action Framework — Prevent → Detect → Respond

Prevent (build the right scaffolding)

Detection Charter per use-case: Objective, signals, ATT&CK mapping, owner, expected volume.
Data Gates: Ingest → Parse → Schema validate → Coverage check (fail closed on missing critical fields).
KQL Guardrails: Time-scoped filters first; pre-aggregate hot paths; avoid broad cross-joins.
SLOs: P95 rule runtime, P95 alert latency, % malformed < 0.5%.

Detect (prove it continuously)

Unit Tests for KQL: Given sample rows → expected rows (pass/fail).
Integration Tests: Ingest sample → rule fires → incident fields populated (entity, severity, tactic).
Replay Library: Keep sanitized JSON/CSV/PCAP from real incidents, tagged by ATT&CK.
Load Tests: 1×/2×/5× peak; record lag and schedule drift.

Respond (close the loop fast)

Playbook Tests: Enrichment, assignment, ticket creation; alert on playbook failure.
Weekly FP/Tuning: Track precision; suppress with expiry; re-run replay after tuning.
Queue Health: Alert budgets per tier; overflow routing; executive dashboard on MTTD/MTTR (Mean Time To Detect/Respond).

ASCII Pipeline (where to measure)

[Source] -> [Ingest] -> [Parse/Normalize] -> [KQL Rule] -> [Alert] -> [Playbook] -> [Ticket]
   SLI:lag     SLI:drop%      SLI:schema ok        SLI:runtime    SLI:create    SLI:exec     SLI:ack

Key Differences to Keep in Mind

Validation vs. Enablement — Turning on rules ≠ proving they catch your scenario. Example: OAuth abuse rule enabled, but consent events never ingested.
Correctness vs. Timeliness — Accurate but late alerts still lose. Example: 120-second query on a 60-second schedule.
Format vs. Coverage — Perfectly parsed logs from some sources aren’t enough. Example: No EDR in Region A → blind spot.
Suppression vs. Tuning — Blanket mutes hide real attacks. Example: Global VPN ASN allow-list masks exfil via consumer VPNs.
One-off Tests vs. Continuous Replay — Parsers change; proofs must repeat. Example: Monthly replay catches field regressions early.

Summary Table

Concept	Definition	Everyday Example	Technical Example
Use-Case Validation	Prove rule matches a real scenario & outcome	Recipe yields cake, 8 slices	Trigger benign & malicious OAuth consent, verify alert details
Log Validation	Schema, fields, time, completeness	Ingredients labeled & fresh	Enforce `TimeGenerated`, entity fields; measure % malformed
Log Coverage	Required sources actually arrive	You bought eggs, flour, sugar	Coverage matrix across tenants/regions; block deploy on gaps
KQL Performance	Runtime & efficiency under load	Oven preheated & timed	P95 runtime < schedule/2; use `summarize`, materialized views
Attack Simulation/Replay	Synthetic or real payloads end-to-end	Taste test before serving	Atomic tests + replay of sanitized incident logs
Volume & Latency	E2E timing at 1×–5×	Two trays in oven still bake	Track lag, schedule drift, alert creation delay
False Positives Check	Measure precision; tune safely	Fix salty measuring spoon	Weekly FP board; expiring suppressions with owner
Alert Volume Health	Match alerts to capacity	One chef vs. 300 plates	Budgets, batching, auto-triage for low-sev

What’s Next

Up next: “From Hypothesis to High-Fidelity: Designing one Sentinel detection with a test suite, replay pack, and SLOs.” We’ll build one end-to-end and publish the exact checklist.

🌞 The Last Sun Rays…

Answering the hooks:

Sprinklers without the test lever? Run replay and integration tests.
Half-labeled ingredients? Enforce schema + coverage gates before rules.

Your 30-minute win for tomorrow:

Pick one high-value rule.
Add a coverage gate (all required tables present).
Add a replay test with a sanitized payload.
Record P95 alert latency after one day.

Reflection: If you could show leadership just one metric next week, would it be precision, E2E latency, or queue health—and what decision will it unlock?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

February 3, 2026

Chapter 3 — How Not to Design Detection Use-Cases (and What to Do Instead)

1) Title + Hook

Building detections without the right logs is like writing movie reviews without watching the films.
Marking everything “High” severity is a smoke alarm that screams for toast and wildfires alike.
Ten sloppy rules beat one attacker—once. One precise rule beats ten attackers—daily.

This guide spotlights the anti-patterns that quietly wreck detection programs—and the fixes that make them resilient.

2) Why It’s Needed (Context)

Detection use-cases are your SIEM/SOAR’s north star. When they’re vague, noisy, or unmoored from telemetry, you pay in three currencies: alert fatigue, missed intrusions, and lost credibility with engineering and leadership. We’ll decode the classic mistakes and give you a playbook to align detections with MITRE ATT&CK (Adversarial Tactics, Techniques & Common Knowledge) and real attacker paths.

3) Core Concepts Explained Simply

A) Use-Cases Enabled but Logs Missing

Technical definition: Analytics rules exist, but prerequisite telemetry (tables/fields) is absent, late, or malformed.
Everyday example: Setting up a coffee machine with no water line.
Technical example: A credential-stuffing rule depends on SigninLogs risk state, but RiskyUsers connector isn’t enabled—rule never fires.

B) Everything Marked “High” Severity

Technical definition: Flat severity model (all High/Critical) that ignores confidence, impact, and enrichment.
Everyday example: All emails marked “urgent”—soon, none are.
Technical example: Port scan, failed logins, and confirmed egress beaconing all assigned “High,” drowning triage.

C) No Incident / Alert Grouping

Technical definition: Alerts remain atomic; no correlation by entity/time/TTP (Tactics, Techniques, and Procedures).
Everyday example: Treating 20 notifications from the same delivery as 20 separate packages.
Technical example: Multiple SecurityEvent 4625 failures from one host generate 50 incidents instead of one grouped brute-force case.

D) Alerts with Zero Context

Technical definition: Alerts lack entity resolution, enrichment, or links to playbooks and knowledge.
Everyday example: A fire alarm with no floor or room number.
Technical example: “Suspicious PowerShell” with no command line, user SID, parent process, or MITRE technique tag.

E) No Standard Parsing / Field Mismatch

Technical definition: Inconsistent schemas; fields named differently across sources; missing ASIM (Advanced Security Information Model) normalization.
Everyday example: Mixing metric and imperial tools in the same toolbox.
Technical example: src_ip vs SourceIP vs ClientIP break joins; URL field sometimes base64, sometimes plain.

F) Poor KQL Hygiene

Technical definition: Inefficient or brittle KQL (Kusto Query Language): wildcard scans, no summarization windows, time drift, or unbounded joins.
Everyday example: Searching a library by reading every page of every book.
Technical example: | where tostring(CommandLine) contains "mimikatz" across * tables without time or table scoping.

G) Quantity Over Quality (Rule Count Vanity)

Technical definition: Optimizing for number of rules, not precision, recall, or mean time to detect (MTTD).
Everyday example: Owning 50 kitchen knives but still using a butter knife.
Technical example: 300 rules with <0.5% true-positive rate; no retirement of deadweight rules.

H) No MITRE ATT&CK Coverage Mapping

Technical definition: Detections aren’t mapped to techniques/sub-techniques; gaps unknown.
Everyday example: Playing chess without knowing pieces or the board.
Technical example: Great coverage for execution (T1059) but nothing for discovery (T1087) or privilege escalation (T1068).

I) No Log-Source → Use-Case Coverage Mapping

Technical definition: No matrix that shows which use-cases rely on which sources/fields.
Everyday example: Not knowing which ingredient makes which dish.
Technical example: Disabling DNS logs breaks exfiltration detections—but no one realizes until after an incident.

J) Detections Not Mapped to Attacker Paths

Technical definition: Rules exist in isolation, not aligned to attack chains/kill-chains or common adversary playbooks.
Everyday example: Locking your front door but leaving the windows wide open.
Technical example: Excellent ransomware encryption alerts, but zero coverage for initial access (phish), lateral movement (RDP), or data staging.

4) Real-World Case Study

Failure — The “Everything High” Breach

Situation: A healthcare provider had 240 Sentinel rules. 80% were “High.” No grouping, weak enrichment.
Impact: Analysts ignored 30+ failed-login bursts tied to a compromised VPN account; beaconing went unnoticed for 5 days.
Lesson: Severity discipline + grouping + enrichment would have collapsed 120 noisy alerts into 3 actionable incidents.

Success — Use-Case Contracts & ATT&CK Map

Situation: A fintech created “Detection Contracts”: each rule listed required fields, data sources, ATT&CK technique, severity rubric, and sample incidents. Built a source↔use-case matrix and an ATT&CK heatmap.
Impact: -42% alert volume, +31% true-positive rate, MTTD down from 6h to 90m.
Lesson: Treat detections as products with inputs/outputs and SLOs.

5) Action Framework — Prevent → Detect → Respond

Prevent (Design Right)

Define detection contracts:
- Intent: threat, ATT&CK T#
- Inputs: tables + fields (with ASIM names)
- Logic: KQL with test cases
- Severity rubric: impact × confidence
- Ops: owner, SLOs (latency, FP rate), links to runbooks
Build the Coverage Matrix: Use-case (rows) × Sources/Fields (columns). Color by criticality.
Normalize early: Enforce ASIM (or your schema) at ingest; ban ad-hoc field names.
Set a severity policy: E.g., Critical = confirmed malicious + material impact; High = high confidence + privileged entity; Medium/Low with clear auto-closure criteria.

Detect (Run Well)

Group intelligently: Entity-based (user, host, IP), time-windowed (e.g., 30–60 min), TTP-aware correlation.
Enrich alerts: Entity resolution (UEBA), asset tags, geolocation, exposure (internet-facing), vuln context (CVSS).
Harden KQL:
- Scope tables & time (project before join).
- Use make-series, summarize with bins, toscalar for thresholds.
- Add null/format checks and time-zone normalization.
Measure quality: Track precision, recall, FP rate, FNR, and rule runtime. Retire or refactor rules quarterly.

Respond (Improve Fast)

Playbooks (SOAR): Map each severity to a minimal response checklist; automate enrichment and ticketing.
Drill with simulations: Use Atomic Red Team/ATT&CK emulations; confirm end-to-end (log present → rule fires → grouped → playbook runs).
Feedback loop: Every false positive updates the contract (logic or enrichment). Every miss creates a backlog item with ATT&CK mapping.

6) Key Differences to Keep in Mind

Severity vs Priority — Severity = inherent risk; Priority = queue order (contextual).
- Scenario: A “Medium” alert on a domain admin in production becomes top priority.
Alert vs Incident — Alerts are signals; incidents are stories (grouped evidence).
- Scenario: 15 brute-force alerts across users → 1 incident with attacker IP, timeframe, and impact.
Rule Count vs Coverage Quality — More rules ≠ better defense.
- Scenario: 60 well-mapped detections covering ATT&CK tactics beat 300 shallow ones.
Detection Logic vs Enrichment — Logic finds; enrichment explains.
- Scenario: A hash match (logic) + EDR verdict + VT score + asset criticality (enrichment) drives faster action.
Schema Normalization vs Parser Sprawl — One language, fewer bugs.
- Scenario: ASIM fields (SrcIp, DstIp, User) enable reusable joins and content packs.

7) Summary Table

Concept	Definition	Everyday Example	Technical Example
Logs missing	Rule needs data that isn’t there	Coffee machine w/o water	`SigninLogs` dependencies not connected
All High severity	Flat model; no nuance	Everything marked “urgent”	Port scan = High same as C2 beacon
No alert grouping	No correlation into incidents	20 packages treated separately	50 4625s = 50 incidents, not 1
Zero context	No enrichment/links	Fire alarm w/o floor	No command line, no parent PID
Field mismatch	Inconsistent schemas	Metric vs imperial mix	`src_ip` vs `SourceIP` breaks joins
Poor KQL hygiene	Inefficient/brittle queries	Reading every page to search	Unbounded `contains` across `*`
Rule vanity	Optimize for count not quality	50 knives, use one	300 rules, <0.5% TP
No ATT&CK mapping	No technique coverage view	Playing chess blind	Gaps in discovery/priv-esc
No source mapping	No data→use-case matrix	Unknown ingredients	DNS disabled breaks exfil rules
Not on attacker paths	No kill-chain alignment	Lock door, open windows	Encrypt detect but no lateral move detect

8) ASCII Diagram — Detection Product Loop

[ATT&CK Technique] → [Detection Contract] → [KQL Logic]
        ↓                     ↓                   ↓
 [Required Sources/Fields] → [Normalization/ASIM] → [Alert Enrichment]
        ↓                     ↓                   ↓
     [Grouping/Incidents] → [Severity Policy] → [SOAR Playbook]
        ↓
   [Metrics: Precision | Recall | FP/FN | Latency]
        ↓
   [Refactor/Retire]  ←——  [Purple Team Tests]

9) What’s Next

Next in this series: “Detection Contracts in Practice: A Step-by-Step Template (with KQL patterns and ATT&CK mapping).” We’ll publish a fill-in-the-blanks worksheet plus sample tests.

🌞 The Last Sun Rays…

Hook answers:

Don’t write reviews without watching the film—connect detections to telemetry and verify it’s present.
Don’t let every toaster trip the fire alarm—calibrate severity and group signals into incidents.
Don’t collect knives for the drawer—optimize for coverage quality, not rule count.

Your turn: If you could only fix one thing this week, would you choose severity discipline, schema normalization, or source↔use-case mapping—and how would you prove it worked (which metric first)?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

January 20, 2026

Chapter 2 —How Not to Design Log Sources (with Microsoft Sentinel)

1) Title + Hook

Hook:

Treating Microsoft Sentinel like a “Dropbox for logs” is like buying a cargo ship to mail a postcard.
Pouring every signal into your Security Information and Event Management (SIEM) is like turning on every light in a stadium to find your keys—bright, expensive, and still not helpful.

This post shows the anti-patterns that quietly destroy SIEM value—and what to do instead.

2) Why It’s Needed (Context)

Security teams love visibility. Finance teams hate surprise bills. Engineering hates noise.
When log-source design is sloppy, you get: runaway costs, alert fatigue, blind spots, and weak investigations.
Microsoft Sentinel is powerful, but it’s metered. Bad choices at the ingest layer ripple into detect, respond, and retain layers.

3) Core Concepts Explained Simply

A) “Collect-Everything” Ingestion → Huge Costs

Technical definition: Ingesting all available telemetry without scoping by use case, severity, or deduplication—often at high-cost data tables (e.g., SecurityAlert, CommonSecurityLog, Syslog with verbose facilities).
Everyday example: Subscribing to every streaming service “just in case,” then watching YouTube.
Technical example: Forwarding full Endpoint Detection and Response (EDR) raw telemetry and verbose Windows Event Forwarding (WEF) for the same hosts, plus firewall flows at 1:1 cadence—no filters.

B) Logs Collected but Not Used

Technical definition: Sources ingested with no mapped analytics rules, hunting queries, or workbooks.
Everyday example: Paying for a gym you never visit.
Technical example: Shipping detailed DNS logs but no detections/queries reference them; no Kusto Query Language (KQL) saved searches.

C) No Retention & Archival Strategy

Technical definition: Single retention setting for all tables; no hot/cold split, no Azure Data Explorer (ADX) or Azure Blob/Archive offload, and no legal hold mapping.
Everyday example: Keeping all photos on your phone forever—until it’s full right before a trip.
Technical example: 180-day retention for chatty Syslog/CommonSecurityLog tables when only 30 days are needed for detections; no archive to cheaper storage.

D) Custom Logs over Native Connectors

Technical definition: Using custom ingestion (HTTP API, custom tables) instead of Microsoft Sentinel data connectors that provide schemas, Advanced Security Information Model (ASIM) normalization, and content packs.
Everyday example: Cooking from scratch when a healthy, cheaper meal kit exists.
Technical example: Parsing Palo Alto logs via custom functions instead of the native connector and ASIM mapping—losing built-in analytics.

E) Duplicate Telemetry from Multiple Pipelines

Technical definition: Same events reaching Sentinel via parallel paths (e.g., agent + syslog forwarder + third-party pipeline), creating cost bloat and duplicate alerts.
Everyday example: Getting the same bank alerts by SMS, email, app, and phone call—annoying and redundant.
Technical example: Windows events ingested from both Azure Monitor Agent (AMA) and a legacy Log Analytics agent (MMA); cloud audit logs via both native connector and a custom ingestion app.

F) No Log Validation

Technical definition: Lack of pre-ingest checks for schema, timestamps, severity, and required fields; no Service Level Objectives (SLOs) for delay, completeness, or deduplication.
Everyday example: Accepting every delivery without checking the box contents.
Technical example: Timestamps ingested in local time, breaking correlation; device hostname missing → entity mapping fails; uneven daily volume with silent drops.

4) Real-World Case Study

Failure — The $180k Surprise

Situation: A global SaaS firm enabled “everything” from firewalls, proxies, endpoints, and cloud audit logs. No content mapped; no filtering; 180-day retention on all tables.
Impact: Monthly Sentinel bill spiked by 60%. Analysts drowned in duplicate alerts; incident MTTR (Mean Time To Remediate) rose from 9h to 16h.
Lesson: Cost without context adds negative value. Start with use cases → data needed → retention tiering.

Success — Use-Case-Driven Design

Situation: A fintech defined 12 priority detections (credential misuse, exfiltration, MFA bypass). They mapped required fields to ASIM schemas and trimmed sources to those fields.
Impact: 37% ingest reduction, +22% detection precision, 2× faster hunts due to consistent entity mapping.
Lesson: Design sources to serve detections, not the other way around.

5) Action Framework — Prevent → Detect → Respond

Prevent (Design & Cost Control)

Define top 15–20 detections first; list required fields (IP, User, Device, App, Action, Result, Timestamp TZ).
Prefer native connectors + ASIM; only custom when absolutely necessary.
Build ingestion policies: include tables, exclude noise (facility/level filters, sampling for flows).
Implement tiered retention:
- Hot (30–60 days): detection & investigation.
- Cold/Archive (6–12 months+): compliance, rare hunts (use ADX/Blob).
Prevent duplicates: one authoritative pipeline per source; document routing.

Detect (Quality & Coverage)

For each table, create at least one analytic rule and one scheduled query that uses it.
Enforce schema validation in parsing functions; normalize to ASIM.
Track signal health KPIs: daily event count deltas, null critical fields, late arrivals (>10 min), duplication rate.

Respond (Operate & Improve)

Build a workbook: cost by table, events by connector, rule hits by source.
Automate feedback loops: when an analytic fires with low confidence, refine source fields/filters.
Quarterly table review: drop unused sources, move low-value logs to archive, merge pipelines.

6) Key Differences to Keep in Mind

Native vs Custom Ingest — Native brings schemas/content; custom brings flexibility & maintenance.
- Scenario: Choose native for popular firewalls; custom only when niche vendor lacks support.
Hot vs Cold Retention — Hot is for speed; cold is for savings.
- Scenario: Keep 30 days hot for IR (Incident Response); move month 2–12 to archive.
Field Completeness vs Volume — Fewer, richer events beat many shallow events.
- Scenario: Keep DNS with query, response, client IP; drop verbose debug flags.
One Pipeline vs Many — Single route is traceable; multiple routes multiply duplicates.
- Scenario: Consolidate to AMA; retire MMA and third-party forwarders.
Use-Case vs Curiosity — Detections drive data; curiosity drives cost.
- Scenario: Only ingest proxy categories needed for DLP (Data Loss Prevention) alerts.

7) Summary Table

Concept	Definition	Everyday Example	Technical Example
Collect-everything ingestion	Ingest all signals without scoping/filters	Subscribing to every streaming service	EDR + WEF + flow logs all verbose to Sentinel
Unused logs	Data with no rules/queries/workbooks	Paying for a gym you don’t use	DNS ingested but no KQL uses it
No retention strategy	One-size retention; no hot/cold	Keeping all photos on phone forever	180 days on Syslog with no archive
Custom over native	DIY ingestion instead of connectors	Cooking from scratch vs meal kit	Custom Palo Alto parsing vs native + ASIM
Duplicate telemetry	Same events via multiple routes	Bank alerts by SMS/email/app/phone	AMA + MMA + syslog duplicating Windows events
No validation	No checks for schema/time/fields	Accepting packages uninspected	Local-time timestamps; missing hostname

8) ASCII Diagram (Signal Health Funnel)

[Sources] --(validated, deduped)--> [Normalization/ASIM]
             \--x duplicates drop--/          |
                                              v
                                   [Analytic Rules & Hunts]
                                              |
                                              v
                                    [Incidents & Response]
                                              |
                                              v
                           [Retention: Hot 30-60d | Archive 6-12m+]

9) What’s Next

Next in this series: “Designing a Use-Case-First Log Strategy for Sentinel: From Detections to Data Contracts.” We’ll publish a field-tested worksheet to map detections → fields → connectors → retention.

🌞 The Last Sun Rays…

Hook answers:

Sentinel isn’t a dump truck for logs; it’s a tuned sensor grid.
More light (data) isn’t better if it blinds you; focused beams (use-cases) win.

Your move:
What one log source would you drop, filter, or archive tomorrow to improve both signal quality and cost—and what detection would stay intact after that change?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

January 20, 2026

Chapter 7 – How Your Platform Health Suite Protects Outcomes, Not Just Logs

Turning “Sentinel Noise” into an Executive Radar: How Your Platform Health Suite Protects Outcomes, Not Just Logs

Think of your platform like an airport: collectors are runways, AMA agents are ground crews, DCRs are flight plans, and analytic rules are the control tower. If any one falters, flights (events) stack up or vanish.
Or like a hospital: ingestion is triage, tables are departments, rules are diagnostic protocols, and audit is infection control. A delay at triage hides the real emergencies.
Or like a supply chain: collectors are loading docks, EPS is trucks-per-minute, DCRs are routing labels, and rules are QA checkpoints. Mislabel one box and the whole chain gets blind spots.

This session shows executives how your components form one radar that tells them: Are we safe, is the telemetry flowing, and will detections fire when it matters?

Why It’s Needed (Context)

Security leaders don’t buy features; they buy assurance. Modern threats exploit blind spots: missing telemetry, delayed ingestion, disabled rules, or misconfigured agents. Your suite closes those gaps by:

Quantifying telemetry health (EPS, size, latency)
Surfacing blind spots (non-reporting devices, tables not ingesting)
Protecting detection integrity (analytic rule tampering, disabled rules)
Assuring platform reliability (Sentinel health, audit, connectors)

Value to execs: fewer surprises, faster incident confidence, and measurable resilience. In plain terms: “Are we seeing what matters, fast enough, with rules that still work?”

Core Concepts Explained Simply

Below, each concept has: Technical Definition → Everyday Example → Technical Example (we’ll reuse the airport analogy).

All Log Collector Ingestion & EPS (events per second)

Technical: Measures event throughput per collector to spot saturation/backpressure.
Everyday: Runway landings per minute—too few or too many means trouble.
Technical ex: 8K EPS baseline; spike to 20K EPS triggers auto-scale review.

Log Size by Log Collector

Technical: Tracks daily/rolling log volume per collector for anomalies.
Everyday: Cargo tonnage per runway day-over-day.
Technical ex: 40% drop on Collector-03 flags upstream firewall change.

Abnormal Workspace Spikes & Dips

Technical: Detects ingestion anomalies at the workspace level.
Everyday: Airport sees an unexpected lull or surge in flights.
Technical ex: z-score/seasonality anomaly on _LogManagement table.

Ingestion Delays

Technical: Measures latency from source timestamp to ingestion time.
Everyday: Planes circling because runways are jammed.
Technical ex: P95 delay > 15 minutes = raise incident sev-2.

OOTB (out-of-the-box) Data Connector Monitor

Technical: Checks health/config for native connectors.
Everyday: Prebuilt jetways—are they powered and attached?
Technical ex: Office 365 connector shows auth failure after token expiry.

Identify Warnings (Incident, Workspace)

Technical: Aggregates SOC warnings across incidents and workspace health.
Everyday: Tower alerts: “Runway lights flickering, storm inbound.”
Technical ex: Sentinel “Data Collection partially degraded” alert surfaces.

Critical Devices Monitoring

Technical: Ensures top-tier assets continuously report (DCs, firewalls, crown jewels).
Everyday: VIP planes (organ transplants, heads of state) tracked end-to-end.
Technical ex: Domain controller event gap >10 min triggers page.

Devices Not Reporting (Windows/Linux/Network)

Technical: Detects endpoints missing expected heartbeat/events.
Everyday: A parked plane went silent on the tarmac.
Technical ex: Syslog source silent for 60 min → create problem record.

Sentinel Health & Audit Monitoring

Technical: Internal service checks, API limits, configuration drift, audit events.
Everyday: Airport power, radios, and control systems diagnostics.
Technical ex: Audit log shows permission changes to Analytics blade.

Unhealthy AMA (Azure Monitor Agent) Agents

Technical: Flags agent install/health/config failures.
Everyday: Ground crew short-staffed or missing tools.
Technical ex: AMA heartbeat present but data channels failing (DCR mismatch).

Data Collection Rule (DCR) Monitoring

Technical: Validates DCR consistency and scope across resources.
Everyday: Flight plans correctly applied to the right aircraft.
Technical ex: New subnet lacks DCR mapping → 0 logs from that segment.

Unauthorized Modification of Use Case

Technical: Detects tampering to detection rules (query edits, schedules).
Everyday: Someone rewrote tower procedures without approval.
Technical ex: KQL (Kusto Query Language) diff shows removed join on identity table.

New Log Collector Out of Intended List

Technical: Flags newly-registered collectors not in the approved inventory.
Everyday: An unlisted aircraft lands without a flight plan.
Technical ex: Unknown syslog IP starts forwarding—validate source & owner.

Collector Health (No Heartbeat)

Technical: Collector process/host unavailable.
Everyday: Runway lights off—no signals.
Technical ex: VM down event correlates with EPS collapse.

Collector Health (No Logs)

Technical: Collector up but not sending logs.
Everyday: Runway open, but no planes using it.
Technical ex: Ingest pipeline credentials expired; heartbeat OK, EPS 0.

Tables Not Ingesting Logs

Technical: Detects schema-level gaps (e.g., SecurityEvent empty).
Everyday: A department with no patients for hours—impossible.
Technical ex: FirewallLogs table flatline after parser update.

Analytic Rule Disabled/Deleted

Technical: Ensures detections remain active & intact.
Everyday: Tower turned off weather radar.
Technical ex: High-severity rule disabled by change window w/o approval.

Sentinel Health & Audit (Platform performance)

Technical: (Aggregated view) Platform performance, limits, and governance.
Everyday: Airport operations dashboard for execs.
Technical ex: API throttle near limits during IR surge; scale-out advised.

Real-World Case Study

Failure Case — Ransomware Quietly Gained Time

Situation: EPS looked “normal” per day totals, but ingestion delays at P95 grew to 35–40 minutes after a network change. Analytic rules were fine, but they fired late.
Impact: SOC saw lateral movement an hour after the fact. Restores needed; downtime cost and reputational hit.
Lesson: Volume ≠ timeliness. Latency is a first-class SLO (service-level objective). Your “Ingestion Delays” and “Abnormal Spikes & Dips” would’ve caught this earlier.

Success Case — Misconfigured DCR Contained in Minutes

Situation: A new Linux subnet went live; DCR Monitoring flagged no mapping. Devices Not Reporting confirmed silent hosts; Tables Not Ingesting showed flat SecurityEventLinux table.
Impact: Fix within 20 minutes; coverage gap avoided during a vendor compromise alert wave.
Lesson: Layered monitors (DCR + device heartbeat + table health) shrink MTTR (mean time to repair).

Action Framework — Prevent → Detect → Respond

Prevent

Standardize collector baselines (EPS, log size).
Enforce DCR-as-code with approvals; alert on drift.
Lock analytic rules (change control + audit).
Maintain approved collector inventory; block unknowns.

Detect

SLOs: EPS ±25% anomaly, P95 delay <15 min, table freshness <10 min.
Triangulate: device heartbeat + table freshness + rule health.
Prioritize critical devices and OOTB connectors with higher alert sensitivity.

Respond

Playbooks:
1. No Heartbeat: auto-scale or restart collector VM; reroute sources.
2. No Logs: token refresh, pipeline test, sample event injection.
3. DCR Drift: auto-rollback via Git; notify change owner.
4. Rule Tampering: revert from versioned store; open P1; audit who/when.
Dashboards for execs: “Are we blind anywhere?” + “How fast are we seeing?” (MTTD/MTTR for telemetry issues).

Key Differences to Keep in Mind

No Heartbeat vs No Logs — Down host vs live host with broken pipeline.
Scenario: Heartbeat OK but EPS 0 → investigate tokens/parsers, not VM.
Volume vs Timeliness — Total GB ≠ real-time visibility.
Scenario: Daily volume steady but P95 delay 30 min → IR effectiveness drops.
Device Health vs Table Freshness — Endpoints alive doesn’t mean data landed.
Scenario: AMA OK; SecurityEvent table flat → DCR scope missing.
Anomaly vs Planned Change — Spikes can be normal during patch night.
Scenario: Annotate change windows to suppress false noise.
Connector Status vs Detection Integrity — Data present doesn’t ensure rules run.
Scenario: Rule disabled after tuning—coverage illusion.
Approved vs Rogue Collectors — Inventory matters.
Scenario: New IP starts sending logs → verify ownership before trusting data.

(ASCII) Executive Dashboard Sketch

+---------------- Sentinel Health & Telemetry Radar ----------------+
|  Ingestion SLOs     |  Collectors          |  Coverage           |
|  EPS: 7.8k (↔)      |  HB: 12/12 (✓)       |  Critical: 48/50 (⚠)|
|  P95 Delay: 7m (✓)  |  No Logs: 1 (⚠)      |  Tables Fresh: 96%  |
|  Spikes/Dips: OK    |  Unknown: 0 (✓)      |  DCR Drift: 0 (✓)   |
+------------------------------------------------+------------------+
|  Detection Integrity                             |  Audit & Changes|
|  Rules Disabled: 0 (✓)  Tamper Attempts: 1 (⚠)  |  Priv Changes: 2 |
+------------------------------------------------+------------------+
|  Top Alerts: Ingestion Delay @ COL-03 (sev-2) | ETA fix: 10m     |
+-------------------------------------------------------------------+

Summary Table

Concept	Definition	Everyday Example	Technical Example
All Log Collector Ingestion & EPS	Event throughput per collector	Landings/minute on a runway	Baseline 8K EPS; sustained 20K triggers scale
Log Size by Log Collector	Daily/rolling volume per collector	Cargo tonnage per runway	40% drop on COL-03 after firewall change
Abnormal Workspace Spikes & Dips	Workspace-level ingestion anomalies	Airport-wide lull/surge	z-score anomaly on _LogManagement
Ingestion Delays	Source-to-ingest latency	Planes circling	P95 delay >15m = sev-2
OOTB Data Connector Monitor	Health of native connectors	Prebuilt jetways attached	O365 token expiry alarm
Identify Warnings (Incident, Workspace)	Aggregated SOC/platform warnings	Tower status alerts	Sentinel “partial degradation” surfaced
Critical Devices Monitoring	Coverage for crown-jewel assets	VIP flights tracked	DC event gap >10m page
Devices Not Reporting	Missing endpoint telemetry	Silent plane on tarmac	Syslog source 60m silent
Sentinel Health & Audit Monitoring	Internal checks & audit	Airport systems diagnostics	Permission change on Analytics blade
Unhealthy AMA Agents	Agent failure/misconfig	Ground crew missing tools	Heartbeat OK; channel fail
Data Collection Rule Monitoring	DCR consistency & scope	Correct flight plans	New subnet lacks DCR
Unauthorized Modification of Use Case	Rule tampering detection	Unapproved tower procedure	KQL diff shows removed join
New Log Collector Out of Intended List	Unapproved collectors	Unlisted aircraft lands	Unknown syslog IP sending
Collector Health (No Heartbeat)	Collector host down	Runway lights off	VM down + EPS collapse
Collector Health (No Logs)	Host up, no events sent	Open runway, no planes	Token/parsers expired
Tables Not Ingesting Logs	Schema/table freshness gap	Department with no patients	FirewallLogs flatline
Analytic Rule Disabled/Deleted	Detections turned off/removed	Weather radar off	High-sev rule disabled
Sentinel Health & Audit (Performance)	Aggregated platform performance	Airport ops dashboard	Near API throttle during IR

What’s Next

In the next post of this mini-series, we’ll go from health to outcomes: mapping telemetry SLOs to detection KPIs (true positive rate, MTTD/MTTR for security events), and how to automate executive scorecards that tie telemetry health → detection fidelity → business risk.

🌞 The Last Sun Rays…

Hook answers:

Airports, hospitals, supply chains—all fail the same way: silent delays and hidden blind spots. Your suite exposes both in real time and proves the control tower (Sentinel) is awake.
Executives get a single radar: Are we seeing the right data, fast enough, with rules that still work—and will tomorrow?

Your move: If you could add one metric to your exec dashboard tomorrow, which would it be—P95 ingestion delay, critical device freshness, or rules integrity drift?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

January 15, 2026

Chapter 1 — How NOT to Plan a Sentinel Deployment

(Where security programs quietly fail before day one)

1) Title + Hook

Before we talk Sentinel, picture these everyday slip-ups that create invisible risk:

Analogy 1: Moving into a new house without labeling boxes.
Everything’s technically there… but you can’t find what matters.
Urgency becomes guesswork.
Analogy 2: Installing a home security system but forgetting which door each sensor protects.
Your phone keeps saying “Sensor triggered!”
Useful? Not if you don’t know where.
Analogy 3: Turning on notifications for every app on your phone.
The constant pinging forces you to ignore everything — including the important ones.

Security fails in the same quiet way: not dramatically, but by missing clarity, ownership, and context when you need them most.

2) Why It’s Needed (Context)

Most Sentinel deployments fail long before the first alert.

Why? Because security tools don’t create security — structure and intent do.

When planning lacks:

visibility into what exists,
clarity on who owns decisions,
defined purpose for each log collected,
disciplined priority-setting,
and least-privilege boundaries,

…then Sentinel becomes a sophisticated storage device instead of a decision platform.

The team gets data.
But not direction.

The result?
A system that looks operational yet struggles to protect anything that matters.

3) Core Concepts — Explained Simply Through “How NOT to Plan Sentinel”

Each anti-pattern includes:

What fails
A relatable everyday example
A simple technical example
And each subtly reinforces principles like classification, ownership, least privilege, and business alignment.

A. No Asset Inventory → “Protect everything, understand nothing.”

What fails:
No clear picture of systems, data, owners, or importance levels.
Without classification, everything becomes equally urgent — and equally neglected.
Everyday Example:
Trying to account for people after a fire drill without a guest list.
“Is everyone out?” → “We think so.”
Technical Example:
Sentinel fires an alert on Server-07, but:
- Is it a payroll server?
- A lab VM?
- An abandoned machine from last year?
  No one knows.
  Triage becomes guesswork.

B. No Operating Model → “Sentinel is live, but no one knows what to do.”

What fails:
No defined responsibilities, escalation criteria, or decision authority.
When everyone owns alerts, no one owns the outcome.
Everyday Example:
“Back to office Monday.”
No seating plan. No timings. No team norms.
Everyone arrives, but no one functions well.
Technical Example:
A high-severity incident lands in Sentinel.
No assignment logic.
Five analysts assume someone else will handle it.
The clock keeps ticking.

C. No Purpose Behind Data Collection → “Logs as hoarding, not protection.”

What fails:
Logs are onboarded “just because,” not tied to risks, outcomes, or decisions.
You get volume instead of value.
Everyday Example:
Installing CCTV cameras all over the house but never mapping what each one covers.
When something happens, the footage exists but insight doesn’t.
Technical Example:
You ingest massive firewall logs (hundreds of GB/day)
but have zero detections built on them.
You’re buying storage, not reducing risk.

D. Cost Cutting Without Classification → “Saving money by removing emergency exits.”

What fails:
Budget reviews remove logs based on cost instead of importance.
Critical data disappears because no one defined which logs protect essential functions.
Everyday Example:
Removing emergency exit signs in a building to lower electricity bills.
Looks fine until it matters.
Technical Example:
To save money, identity logs are disabled.
Result:
Account takeover goes undetected — attackers blend in as normal users.

E. No Standard RBAC → “Everyone gets the master key.”

What fails:
Access is granted ad-hoc.
High-risk permissions spread unintentionally.
Integrity of the system erodes.
Everyday Example:
A shared Google Sheet where everyone can edit.
By week’s end, formulas break, data changes, and no one knows how.
Technical Example:
An analyst modifies a rule meant for engineers.
Suppression breaks.
An alert storm floods the SOC for days.

F. No Business Risk Mapping → “Treating all systems as equal.”

What fails:
Machines are prioritized by technical severity, not business impact.
You lose sight of what truly matters.
Everyday Example:
Treating a broken coffee machine and a payroll outage as identical problems.
Both create noise — only one creates consequences.
Technical Example:
A medium-severity alert on the payroll server should outrank a high-severity alert on a test VM.
Without mapping, Sentinel can’t tell the difference.

4) Real-World Case Study

Failure — “The Breach Hidden by Cost Savings”

Situation:
A fintech ingested everything early on.
When bills grew, they removed logs based purely on cost.
Identity logs went first.
Impact:
OAuth abuse persisted quietly for 11 days.
No user-based anomalies, no geographic risk flags, no token alerts.
Governance Lesson (subtle):
Decisions made without classification or purpose create blind spots attackers love.

Success — “The Alert That Knew Its Importance”

Situation:
A healthcare organization:
- Classified assets
- Mapped business services
- Defined clear roles
- Enforced least privilege
Impact:
A lateral movement alert auto-tagged as reaching an EHR cluster (highest tier).
Instantly escalated to P1.
The right team responded. Containment in 30 minutes.
Governance Lesson (subtle):
Clarity enables the right response at the right time.

5) Action Framework — Prevent → Detect → Respond

Prevent (Before Sentinel ingests anything)

Build a real asset list → with ownership and criticality.
Define roles, authority, and escalation logic.
Create data purpose contracts for every log source.
Classify telemetry into tiers (non-negotiable → optional).
Set least-privilege access upfront.
Tag systems with business impact levels.

Detect (Once Sentinel is receiving data)

Write detections tied to risks and outcomes, not logs.
Enrich each alert with context: owner, criticality, business service.
Prioritize incidents using business impact × technical severity.
Build dashboards that reveal gaps:
- Unmonitored key assets
- Tier-0 coverage
- Alert quality indicators

Respond

Automate responses for critical systems.
Auto-route based on business impact and role.
After each incident, refine the inventory, rules, and purpose contracts.
Track metrics that matter (time to detect, time to respond, false-positive trend, cost vs value).

ASCII Flow

[Classified Assets] 
      ↓
[Purpose-Based Data Collection]
      ↓
[Clear Roles + Least Privilege]
      ↓
      Sentinel
      ↓
[Context-Enriched, Business-Aware Alerts]
      ↓
[Correct Team → Fast Response → Reduced Risk]

6) Key Differences to Keep in Mind

Severity vs Priority
Severity = how loud the alert looks
Priority = how important the underlying asset is
Scenario:
High-severity alert on a lab VM < Medium-severity alert on payroll server.
Logs vs Insight
More logs don’t equal more protection — relevance does.
Scenario:
1 TB of firewall logs with no detections is noise, not security.
Equal Coverage vs Informed Coverage
You cannot treat every machine the same.
Classification drives protection.
Scenario:
Crown-jewel systems get 24×7 watch.
Test environments get baseline visibility.

7) Summary Table

Concept (Failure Pattern)	What It Means	Everyday Example	Simple Technical Example
No Asset Inventory	No clarity on what’s important	Fire drill without guest list	Alerts lack owner/criticality
No Operating Model	Undefined roles & escalation	“Office Monday” with no plan	High-severity incidents unassigned
Purpose-Free Logs	Collecting data without outcome	CCTV installed everywhere	Logs ingested but unused
Blind Cost Cutting	Removing logs that matter	Removing exit signs	Disabling identity logs
No Standard RBAC	Over-permissioning	Shared sheet with full access	Analysts editing detection rules
No Risk Mapping	All systems treated equal	Coffee vs payroll outage	Test VM alert = payroll alert

8) What’s Next

Chapter 2 — Building the Asset Truth Layer: The Security Foundation Sentinel Depends On
We’ll design classification, ownership, metadata enrichment, and automated freshness — the bedrock of meaningful detection.

9) 🌞 The Last Sun Rays…

Remember our everyday mistakes?

Labeled boxes turn chaos into clarity.
Door sensors only help when you know which door they belong to.
Notifications are useful only when prioritized.

Security works the same way:
Clarity, classification, and accountability reduce risk long before technology does.

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

January 1, 2026

IAM Blog Series – Part 7: AuthN vs AuthZ on the Internal Network

Hook: Picture your network as an airport. What guards it: boarding passes, security lanes, or staff-only doors?

Kerberos = boarding pass system (one pass, many gates).
RADIUS = passenger security lane (get into the secure area).
TACACS+ = staff-only doors (crew actions checked and recorded).

Why It’s Needed (Context)

Modern networks are crowded airports: many people (users), many gates (apps), and busy back rooms (devices).
AAA—Authentication, Authorization, Accounting—keeps order: who gets in, what they can do, and what gets logged. Strong AAA stops intruders, limits damage, and proves what happened.

Core Concepts Explained Simply

Kerberos — SSO + Tickets + KDC

Technical definition: Ticket-based login managed by a KDC (Key Distribution Center). You sign in once, get a TGT (Ticket-Granting Ticket), then request service tickets for each app—no more passwords.
Airport example: Check in at the airline desk, get a boarding pass, use it at multiple gates and lounges.
Technical example: User logs into Active Directory, then reaches file shares and databases using tickets—no extra prompts.

RADIUS — Network Access + UDP + Harden with TLS

Technical definition: Central AAA for VPN/Wi-Fi/802.1X. Usually over UDP/1812–1813. Legacy RADIUS only hides the password; fix this with EAP-TLS (certificates) and/or RadSec (RADIUS over TLS). Avoid MSCHAPv2.
Airport example: Passenger security lane—fast check to enter the secure side.
Technical example: VPN device asks RADIUS to verify a user’s certificate (EAP-TLS) and assign policy (e.g., VLAN).

TACACS+ — Device Admin + TCP + Full Encryption

Technical definition: AAA for router/switch/firewall admin over TCP/49 with full message encryption and per-command authorization + logging.
Airport example: Staff-only doors—every entry is checked; tasks allowed by role; all actions recorded.
Technical example: Engineer SSHs to a switch; TACACS+ approves identity and each command (show, deny conf t), logging everything.

Real-World Case Study

Failure (RADIUS used for admin):

Situation: Company used legacy RADIUS (no TLS, shared secrets reused) for Wi-Fi and device admin.
Impact: Attacker inside watched RADIUS details and reached management networks. No per-command logs.
Lesson: Keep RADIUS for access (VPN/Wi-Fi) and harden it (EAP-TLS/RadSec). Use TACACS+ for admin.

Success (right tool, right zone):

Setup: Kerberos for app SSO; RADIUS + EAP-TLS (or RadSec) for Wi-Fi/VPN; TACACS+ for device admin. Logs to SIEM.
Result: Stolen helpdesk login triggered TACACS+ command denies and clear audit. Fast containment.
Lesson: Split duties: Kerberos (apps), RADIUS (access), TACACS+ (admin).

Action Framework — Prevent → Detect → Respond

Prevent

Kerberos: Use AES; disable RC4; NTP time sync; short ticket lifetimes; clean SPNs.
RADIUS: Enforce EAP-TLS; prefer RadSec (or IPsec/DTLS); unique shared secrets; allow-list NAS clients.
TACACS+: Put on management network; require MFA; define roles; per-command policies; send logs to SIEM.

Detect

Kerberos: Spikes in TGT/TGS failures; weird SPN requests; time-skew errors.
RADIUS: Access-Reject storms; unknown NAS; EAP or TLS (RadSec) errors.
TACACS+: Command-deny spikes; sudden privilege jumps; commands outside change windows.

Respond

Kerberos: Purge tickets; disable accounts; fix SPNs/time; review delegation.
RADIUS: Quarantine bad NAS; rotate secrets; enforce EAP-TLS/RadSec.
TACACS+: Freeze risky roles; pull command logs; revert configs; review with change control.

Key Differences to Keep in Mind

Where used: Kerberos = gates/apps; RADIUS = entering airport; TACACS+ = staff doors.
Transport: RADIUS = UDP/1812–1813 (optionally RadSec/TLS); TACACS+ = TCP/49; Kerberos = ticket exchanges.
Encryption: Kerberos = tickets protected; RADIUS = password only unless EAP-TLS/RadSec; TACACS+ = full payload.
Authorization: Kerberos = app decides; RADIUS = session attributes; TACACS+ = per-command.
Common pitfalls: Kerberos = clock/SPN issues; RADIUS = MSCHAPv2, reused secrets, no TLS; TACACS+ = flat “admin-all” roles, missing logs.

Summary Table

Concept	Definition	Airport Example	Technical Example
Kerberos	Ticket-based SSO via KDC; TGT + service tickets.	One boarding pass, many gates.	AD login → tickets to SMB/SQL.
RADIUS	AAA for VPN/Wi-Fi over UDP; use EAP-TLS/RadSec; avoid MSCHAPv2.	Passenger security lane.	VPN checks cert with RADIUS; policy assigned.
TACACS+	AAA for device admin over TCP/49; full encryption; per-command control.	Staff-only doors with action logs.	Switch allows `show`, denies `conf t`, logs all.

Visual: Airport Decision Tree

                 What are you securing?
                      /             \
            End-user/App SSO     Network & Device
                  |                 /         \
              KERBEROS        Access (VPN/Wi-Fi)   Admin (CLI)
                                  RADIUS           TACACS+
                             UDP/1812–1813 + TLS     TCP/49

What’s Next

“802.1X Made Simple: Rolling Out EAP-TLS (and RadSec) Without Drama.”
We’ll cover cert automation, common supplicant issues, and clean controller configs.

🌞 The Last Sun Rays…

Boarding passes moving you between gates = Kerberos.
Security lanes letting you into the airside = RADIUS (use EAP-TLS/RadSec).
Staff-only doors with full checks = TACACS+.

KPI quick targets: Kerberos TGS failures < 0.5%; RADIUS reject rate alerts > 5%/15min; TACACS+ command denies baseline per role (alert on 3× normal).

Reflection: Which single metric would make you catch trouble fastest tomorrow—Kerberos failures, RADIUS rejects, or TACACS+ command denies?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

November 9, 2025

IAM Blog Series – Part 6: AuthN vs AuthZ on the Internet

1) Title + Hook

How “Sign in with Google” Works: The Airport Badge Way

Ever clicked “Sign in with Google” and wondered what’s happening?
Imagine every app is a locked door at the airport. You don’t want a new badge for every door!
Let’s see how Google helps you get in, fast and safe.

2) Why It’s Needed (Context)

At a big airport, showing your ID at every single door is slow and tiring.
It’s much better to have one trusted badge that lets you into the rooms you need.
Apps want the same thing: they want to make sure it’s really you, but don’t want to store your password.
That’s why they trust Google to give you a “badge” to get you in.

3) Core Concepts Explained Simply

SSO / FIM (Single Sign-On / Federated Identity)

What it means: Use one badge to open many doors.
Airport: Your airport badge from security lets you into the café, baggage room, and lounge.
Apps: Google gives you a badge. Canva, Spotify, and others let you in because they trust Google’s badge.

SAML – The “Paper Note” World

What it means: Get a paper note with a stamp.
Airport: Security writes a note, stamps it, and gives it to you. The door guard lets you in if the note is stamped.
Apps: Google gives a digital “letter” (SAML assertion) to the app. The app checks the stamp (signature) and lets you in.

OAuth – The “Valet Pass” World

What it means: Get a special pass for one room.
Airport: You get a pass to go into just the cafeteria—not everywhere else.
Apps: Canva asks Google for a pass to see your Drive files. The pass only works for those files.

OIDC – The “Photo Badge” World

What it means: Get a badge with your photo and name.
Airport: Security gives you a badge that shows your face and name, and what places you’re allowed.
Apps: Spotify asks Google for a badge with your info. The app knows it’s you and what you can do.

Visual: Airport Badge Stack

         [SSO]    One badge for many doors
           |
        [SAML]   Paper note with stamp
           |
        [OAuth]  Special pass for one room
           |
        [OIDC]   Photo badge with name

4) Real-World Case Study

Bad Example:

An airport let anyone in if they had a paper note, but they didn’t check the name or number.
Someone copied a note and got into places they shouldn’t.
Lesson: Always check the photo, name, and where the badge is allowed!

Good Example:

Another airport used photo badges and checked them at every door.
If someone lost a badge, security turned it off fast.
Lesson: Photo badges with checks keep things safe.

5) What To Do: Prevent → Detect → Respond

Prevent:
- Use photo badges, not paper notes.
- Only give out passes for what people really need.
- Always check names and photos.
Detect:
- Watch for people trying old or fake badges.
- Get alerts if someone tries to open the wrong door.
Respond:
- Turn off lost or fake badges right away.
- Tell all guards if something weird happens.

6) Key Differences To Remember

Badge Type	What It Does	Airport Example	Best For
SAML	Lets you in with a note	Stamped paper note	Old systems
OAuth	Lets you in for one room	Special pass	Reading files, APIs
OIDC	Shows who you are + access	Photo badge with name	Logging in, web & mobile

7) Quick Summary Table

Concept	Simple Meaning	Airport Example	When To Use
SSO	One badge, many doors	Airport badge	Login everywhere
SAML	Paper note, stamped	Stamped note	Older apps
OAuth	Special pass, one room	Pass for cafeteria	Apps reading data
OIDC	Photo badge, name	Badge with photo	Logging in as you

8) What’s Next

Next: What’s inside your photo badge? How do apps check if your badge is real or fake?

9) 🌞 The Last Sun Rays…

“Sign in with Google” is like getting a badge from airport security.
Apps (doors) trust Google’s badge—not your password.
Some badges let you in, others also show who you are.
If you can explain that, you’re ready for anything!

Your turn:
If you ran the airport, what’s the first rule you’d give to your guards about checking badges?

Surya

By profession, a CloudSecurity Consultant; by passion, a storyteller. Through SunExplains, I explain security in simple, relatable terms — connecting technology, trust, and everyday life.

sunexplains.com

November 9, 2025