Chapter 11.10
Cyber-Physical & Destructive Attacks on OT/Facility Systems
The control plane that keeps an AI factory alive — BMS, EPMS, cooling controllers, the power-cap and firmware layers — is also the shortest path to destroying it, because a single forged setpoint or synchronized load step can do in seconds what a kinetic strike needs explosives to do.
What you'll decide here
- Whether your OT estate (BMS, EPMS, SCADA, CDU controllers, BESS management, the power-cap/firmware planes) is genuinely segmented from IT per IEC 62443 zones-and-conduits, or merely firewalled on paper while exposed to the internet through a vendor jump-host nobody owns.
- Which destructive actions are physically possible from the control plane — a forced synchronized load step, CDU disablement into thermal runaway, BESS-induced runaway, GPU-bricking firmware — and which of those a compromised controller can actually command versus which a hardwired interlock vetoes.
- Where you draw the safety-instrumented-system (SIS) boundary: which trips are independent, hardwired, and physically incapable of being overridden by software, and which are 'soft' and therefore in the attacker's reach.
- Whether each Appendix-F failure mode is treated as dual-use — designed against the random fault AND the attacker who induces that exact fault deliberately, at the worst possible moment, across many units at once.
- Who owns OT detection and response — a converged cyber-physical SOC with the escalation trigger wired in, or two teams (facilities and security) who each assume the other is watching the controllers.
Most of Part 11 defends the information in the building — model weights, tenant data, credentials. This chapter defends the building itself. An AI factory is a cyber-physical system: tens of megawatts of synchronized silicon sitting on a control plane of programmable logic controllers, building-management and electrical-power-management systems, coolant-distribution-unit firmware, battery-management systems, and a power-capping layer that the GPUs themselves obey. Every one of those is a computer. Every computer is reachable. And unlike a data exfiltration — which is reversible in the sense that the facility keeps running — an attack on the OT plane can physically destroy the asset: melt cold plates, trip the grid interconnection, vent a battery rack, or brick a fleet of accelerators with a firmware push. The blast radius is the whole campus, and the recovery time is measured in months of long-lead equipment, not hours of restore-from-backup.
This chapter names the OT assets and their security levels; walks the specific destructive primitives the control plane exposes and what each one costs; draws the segmentation boundary (the Purdue model and IEC 62443 zones-and-conduits); and makes the architectural case that the only thing standing between a compromised control plane and a destroyed facility is a safety-instrumented system that software cannot override. Each Appendix-F failure mode is dual-use: the same transient you engineer against as a random fault is also the attacker's payload, and the attacker picks the timing.
The OT/ICS threat model: what is actually reachable
Start by enumerating the control planes, because each is a distinct attack surface with a distinct destructive payload. The BMS (building-management system) owns cooling setpoints, valve and pump commands, air handlers, leak detection, and fire/EPO interlocks. The EPMS (electrical-power-management system) owns breakers, transfer switches, generator start/stop, UPS modes, and metering. SCADA/PLC layers sit under both, executing the actual I/O. CDU controllers regulate the technology-cooling loop — flow, inlet temperature, secondary-loop pressure — that keeps a 130-kW liquid-cooled rack from cooking itself. BESS management (battery-management systems and their EMS) governs charge/discharge, cell balancing, and the thermal-runaway protections on grid-scale storage. And the power-cap and firmware planes — the GPU/BMC layer that throttles and updates accelerators at fleet scale — are the newest and least-defended of all, because they were built for performance management, not as a weapon.
The uncomfortable empirical baseline: this estate is exposed far more than operators believe. Claroty's Team82 analyzed roughly 467,000 building-management devices across 529 organizations in 2025 and found 75% of organizations ran BMS affected by known-exploited vulnerabilities (KEVs), with 51% exposed to KEVs that are both ransomware-linked and insecurely connected to the internet (Claroty Team82, State of CPS Security 2025: BMS Exposures). Data centers were called out by name as the asset-intensive worst case. The reach is not theoretical: in the median facility, the controllers governing your cooling and power are internet-adjacent and exploitable today.
The power-cap and firmware as a weapon
This is the part of the threat model unique to AI factories, and it is where the destructive primitives are most violent. Four are worth naming explicitly, each with its downstream cost.
The forced synchronized load step → grid trip. A gigawatt-class training cluster already swings load by hundreds of megawatts when a job starts, stops, or stalls — that is the random-fault problem Chapter 4.5 engineers ride-through against. Now make it the payload. An attacker who controls the power-cap plane (or simply the scheduler) can command thousands of racks to drop or pick up load simultaneously, deliberately, at the moment of peak grid stress. NERC has already flagged the non-malicious version as a reliability crisis: in the July 2024 Northern Virginia incident, a 230-kV fault and reclosing sequence dropped roughly 1,500 MW of data-center load near-simultaneously (about 1,260 MW stayed off) over an ~82-second window — serious enough to prompt NERC's rare Level 3 alert (NERC; Utility Dive). Weaponized, the same primitive is a tool to trip the interconnection or destabilize the local grid, and the facility's own protection relays may disconnect it — taking down the campus to save the grid. The downstream cost is not a reboot; it is a black-start and a damaged relationship with the utility that holds your interconnection.
CDU disablement → thermal runaway. A GB200 NVL72 rack rejects roughly 115 kW into its liquid loop and throttles — or trips on over-temperature — within seconds to tens of seconds of losing flow, because there is almost no chilled-water thermal inertia left in a direct-to-chip design (Chapter 5.12). An attacker who stops the CDU pumps, closes a facility-water valve, or falsifies the inlet-temperature reading so the controller never opens the valve, drives the rack straight into thermal runaway. Done across a hall, that takes cold plates and silicon past their thermal limits, with a recovery bounded by component lead times.
BESS-induced runaway. Grid-scale and UPS-class battery storage protects against thermal runaway through the BMS: cell balancing, over-temperature trips, and isolation. Compromise that management layer and you can disable the very protection that prevents a runaway, force an over-charge or over-discharge, or suppress the thermal alarm — turning the energy-storage asset into an ignition source inside the building.
GPU-bricking via malicious firmware. The fleet firmware-update path (BMC, GPU VBIOS, NVLink-switch firmware) is a fleet-scale destructive primitive: a single malicious image, pushed through the legitimate update channel, can render thousands of accelerators non-functional or subtly mis-behaving. This is why firmware integrity (signed images, a hardware root of trust, measured boot) is not a 'nice to have' but the gate on the most expensive single payload in the building. The canonical treatment of that integrity chain is in Chapter 11.4; here it is one of four ways the control plane bricks the asset.
| Destructive primitive | Control plane abused | Physical consequence | Recovery scale | Independent veto (SIS / hardwired) |
|---|---|---|---|---|
| Forced synchronized load step | Power-cap plane / scheduler / EPMS | Grid trip; interconnection disconnect; possible black-start | Hours to days (grid + black-start) | Utility protection relays; on-site ride-through (BBU/supercap) sized for the swing |
| CDU disablement / falsified inlet temp | CDU controller / BMS valve & pump I/O | Thermal runaway of liquid-cooled racks; silicon damage | Weeks to months (cold plates, GPUs) | Hardwired high-temp trip and flow-loss interlock independent of the BMS |
| BESS over-charge / alarm suppression | BESS BMS / energy-management system | Battery thermal runaway; fire; structural loss | Months (rack + structure + remediation) | Independent cell-level protection and gas/thermal detection on a separate logic solver |
| Malicious firmware push | BMC / GPU VBIOS / switch firmware plane | Fleet-scale bricking or covert mis-operation | Weeks to months (re-image or RMA fleet) | Hardware root of trust; signed measured boot; firmware governance (11.4) |
| EPO / breaker mis-operation | EPMS / SCADA | Unplanned full or partial campus shutdown | Hours (restart) to days (equipment stress) | Hardwired EPO logic; mechanical breaker interlocks; key-locked overrides |
OT/IT segmentation: the Purdue model and IEC 62443
The single highest-leverage control against all of the above is the oldest one: keep the control plane off the corporate network and off the internet. The Purdue model stratifies the estate into levels — physical process (0), basic control / PLCs (1), supervisory / SCADA-HMI (2), operations / MES (3), then the DMZ (3.5) and enterprise IT (4–5) — with the principle that traffic crosses level boundaries only through controlled, inspected conduits, never directly. IEC 62443 operationalizes this as zones and conduits: you partition assets into security zones, define every conduit between them, and assign each zone a target Security Level. The standard's SL scale is defined by attacker capability — SL1 casual/coincidental, SL2 intentional with simple means, SL3 sophisticated with moderate resources, SL4 sophisticated with extended resources (i.e. a nation-state) (ISA/IEC 62443). The destructive primitives above are SL3–SL4 work; a BMS reachable from the internet is, by definition, not even meeting SL2.
The fork here is real and recurring. Convergence — running OT on the same IP fabric as IT, with shared identity and shared monitoring — is operationally attractive: one network team, one observability stack, remote vendor access for the cooling and electrical contractors who actually maintain the gear. Isolation — an air-gapped or diode-separated OT network with its own identity, jump-hosts, and one-way telemetry export — is far harder to operate and far harder to compromise. The honest answer for an AI campus is neither extreme: a hardened DMZ with brokered, recorded, time-boxed vendor access and unidirectional telemetry out, sized so that an IT compromise cannot reach a Level-1 controller and a controller cannot reach the internet. The microsegmentation and zero-trust mechanics that enforce this are the subject of Chapter 11.7; this chapter's contribution is the consequence of getting it wrong, which is the destructive-primitive table above.
| Dimension | Full convergence (OT on IT fabric) | Brokered DMZ (recommended) | Hard isolation (air-gap / diode) |
|---|---|---|---|
| Attack surface | Largest — IT compromise reaches PLCs | Bounded — conduit + broker only | Smallest — no inbound path |
| Vendor/remote maintenance | Trivial but dangerous | Recorded, time-boxed, brokered | On-site or sneakernet only |
| Telemetry / observability | Native, unified SOC view | Unidirectional export to SOC | Manual or one-way diode export |
| Operational cost | Lowest | Moderate | Highest |
| IEC 62443 SL ceiling realistically achievable | SL1–SL2 | SL3 | SL3–SL4 |
Safety-instrumented systems: the control software cannot be allowed to win
Here is the architectural thesis of the chapter, and the one design rule that survives a fully compromised control plane: the last line of defense against physical destruction must be independent of, and un-overridable by, the system that can be hacked. A safety-instrumented system (SIS) is a separate logic solver with its own sensors and its own final elements, whose only job is to drive the process to a safe state when a critical parameter crosses a hardwired limit — coolant flow lost, rack temperature past the trip point, battery cell over-temperature, ground fault. It does not ask the BMS for permission. A correctly designed SIS is physically incapable of being told to stand down by software, because its trip logic is hardwired and its overrides are mechanical and key-locked.
The 2017 Triton/Trisis attack is why this matters and why it is hard. That campaign specifically targeted a Triconex SIS — the safety layer of last resort at a petrochemical plant — attempting to reprogram it so a subsequent unsafe process state would not trip. It only failed (and caused a shutdown rather than a catastrophe) because of a flaw in the attacker's payload (Dragos; FireEye/Mandiant). The lesson translated to an AI factory: if your high-temperature CDU trip, your BESS thermal-runaway isolation, or your EPO is implemented in the BMS/PLC that the attacker now owns, you have no last line of defense at all. The trip must live on independent hardware, on an isolated conduit (or no network at all), with its setpoints in firmware that the operational control plane cannot write. The cost of this independence is real — duplicated sensors, a separate logic solver, periodic proof-testing — and it is the cheapest insurance in the building relative to the assets it protects.
Every Appendix-F failure mode is dual-use
The reframing that should run through your entire FMEA: every failure mode in the Appendix-F catalog has two authors. One is randomness — a pump fails, a valve sticks, a relay chatters, a load step happens because a job crashed. The other is an attacker who induces that exact failure deliberately, with three advantages the random fault does not have: timing (at peak grid stress, during a maintenance window, when redundancy is degraded), correlation (across many units at once, defeating the N+1 that assumed independent failures), and persistence (re-triggering as fast as you recover).
This breaks a core assumption of classical reliability engineering. Redundancy math (Chapter 0.5, and quantified in Part 12) rests on failures being statistically independent — that is what makes N+1 meaningful. A cyber-physical attacker deliberately violates independence: a single compromised controller can trip every CDU in a hall at once, and the 'redundant' second pump on the same compromised PLC is no redundancy at all. The design response is that fault domains must also be security domains: the controller that can fail unit A must not be able to reach unit B, or your redundancy is a single shared point of compromise.
Deep dive: worked example — the worst-timed attack on a 100 MW liquid-cooled hall
Walk the chain an SL3+ adversary would actually run, to see where each control either holds or fails. Step 1 — access. The attacker reaches the OT DMZ through a contractor's jump-host left standing after a cooling-vendor maintenance window (the FrostyGoop pattern: a forgotten edge device, not a zero-day). If the DMZ is brokered and time-boxed, the jump-host is gone and this step fails; if OT is converged with IT, an earlier IT phishing foothold already put them here. Step 2 — reconnaissance. Modbus and BACnet are unauthenticated, so they map the CDU controllers, chiller PLCs, and EPMS without exploiting anything — they simply read.
Step 3 — the payload, timed. They wait for a maintenance window when the hall is on N (no spare cooling unit), then simultaneously falsify inlet-temperature readings on every CDU so the controllers hold their valves closed, and stop the pumps. With no chilled-water inertia, racks cross their thermal trip in seconds. This is where the SIS decides the outcome. If the high-temperature and flow-loss trips live in the same PLCs the attacker now owns, nothing fires and the hall cooks — months of recovery. If the trips are on an independent hardwired SIS the attacker cannot write to, every rack drops to a safe state, the hall goes down hard (lost goodput, a bad day) but the silicon survives — hours of recovery, not months. Step 4 — the multiplier. Simultaneously they command a synchronized load drop to stress the interconnection, so the facility is fighting a grid event and a thermal event at once, degrading the human response. The defenses that turned a catastrophe into a bad day were all decided at design time: segmentation that bounded step 1, an independent SIS that bounded step 3, and ride-through sized for step 4 (Chapter 4.5). None could be added in the moment.
Detection, response, and the converged escalation trigger
Prevention is the priority, but the OT plane also needs detection that a converged SOC can act on. The hard problem is that OT telemetry and IT telemetry live in different worlds: a falsified Modbus write looks like a legitimate engineering command, and the signal that distinguishes attack from operation is often physical — a setpoint that contradicts the measured process state, a valve commanded closed while temperature climbs, a load step with no scheduling event behind it. That is why the detection that matters here is cyber-physical correlation: cross-checking commanded state against sensed state against expected workload, and treating divergence as an incident.
The organizational fork is whether facilities and security each assume the other is watching the controllers — the gap through which most OT incidents walk. The answer is a converged cyber-physical escalation trigger: a single condition (anomalous OT command + adverse physical trend) that pages both the SOC and facilities engineering simultaneously and invokes a joint playbook, because neither team can diagnose a CDU-disablement-into-thermal-runaway alone. The unified incident-command model that this escalation feeds into is canonical in Chapter 14.11, and the OT/cyber-physical IR playbook itself is built out in Chapter 11.12; this chapter's job is to insist the trigger exists and is wired to the physics, not just to the logs.
What to decide, and when
The decisions in this chapter sort sharply into irreversible-at-design-time and operational. The irreversible ones — where the SIS boundary sits, whether destruction-class trips are hardwired and independent, how the OT network is segmented from IT, whether fault domains are also security domains — must be made before you commission, because retrofitting an independent safety layer or re-segmenting a live OT network mid-life is brutal and sometimes impossible. The operational ones — vendor access brokering, firmware governance cadence, OT detection content, the converged escalation drill — are ongoing and improvable. The trap, exactly as elsewhere in this guide, is treating an irreversible decision as if it were operational: assuming you can 'add a safety system later' to a hall whose trips already live inside the hackable BMS. You cannot, and the day you discover that is the day an attacker discovers it for you.