Appendix F

Failure-Mode / FMEA Catalog

Every failure in an AI factory is dual-use — the same coolant-leak cascade, grid trip, or thermal runaway can be a random fault or an attacker's objective — so this catalog gives each mode a single uniform record (trigger, propagation, detection, blast radius, mitigation, recovery) that feeds the availability model in Chapter 12.5 and the cyber-physical analysis in Chapter 11.10 from one consistent source of truth.

What you'll decide here

Use this appendix as a lookup, not a narrative: find your failure mode in the master table, then jump to the owning chapter (last column) for the engineering derivation — this catalog summarizes, it does not replace, the canonical treatment.
Read every mode as dual-use. The trigger column lists the random cause; the same end-state is reachable by an attacker via the path in Chapter 11.10 (forced load step, CDU disablement, BMS spoof). If a mitigation only defends the random path, it is incomplete.
Treat the blast-radius column as the input to your fault-domain and RBD work (Chapter 12.1 / 12.5): a mode that strands one rack and a mode that trips the campus POI sit in different availability tiers and deserve different redundancy spend.
Walk the propagation column for cascade interlocks. Most catastrophic outages here are not the first fault — they are a single fault that defeated a shared mitigation (one CDU, one fuel header, one protection setting) and took an entire fault domain with it.
Pair detection latency against propagation speed for each mode. Where the fault propagates faster than your detection-plus-actuation loop (load-step→grid trip, HBM runaway), the only viable mitigation is preventive or inertial, never reactive — design accordingly.

This appendix consolidates the failure modes scattered across the engineering chapters into one uniform FMEA register. It is referenced from the resilience-standards chapter (Chapter 12.1), the reliability rethink (Chapter 12.2), the component-failure-rate chapter (Chapter 14.3), and the integrated-systems-test chapter (Chapter 13.6); and it is consumed directly by the quantitative availability model in Chapter 12.5, which draws its top events and common-cause couplings from the rows below. The canonical engineering of each mode lives in the chapter named in the right-most column — this catalog is the index and the cross-walk, deliberately dense and scannable, not the derivation.

The organizing discipline is that every mode is treated as dual-use: a coolant leak, a synchronized load step, a BESS cell vent, a fiber cut — each is reachable both as a random fault and as an attacker's objective. The two paths share an end-state, so they share a row; where they diverge is in the trigger and in which mitigation defeats them. A defense that addresses only the stochastic path (a redundant pump) but not the induced path (a malicious firmware load that disables both pumps in lockstep) is, for FMEA purposes, an incomplete mitigation. The cyber-physical attack tree that maps these induced paths is Chapter 11.10; this appendix flags the coupling but does not re-derive it.

How to read a row

Each failure mode is recorded against six fields, applied identically across every table so the catalog is sortable and comparable:

Trigger — the initiating event (random cause first; the induced/attacker path is noted where it differs materially).
Propagation path — how the fault spreads, and critically which shared mitigation it defeats to escalate from a local fault to a cascade.
Detection — the sensing modality and its characteristic latency relative to propagation speed (the decisive ratio).
Blast radius — the fault domain affected at full propagation: node, rack, row, hall, or campus/POI.
Mitigation — the preventive or containing control, classified as preventive (stops the trigger), inertial (buys ride-through time), or reactive (acts after detection).
Recovery — the path back to service and its characteristic time-to-restore (ETTR).

Where propagation outruns detection-plus-actuation, the only effective controls are preventive or inertial. That ratio is called out per mode in the notes because it tells you whether a mitigation must exist before the fault or can be deployed after it.

The cascade is rarely the first fault

Read the propagation column with one question: what shared resource did the first fault take with it? A single leaking quick-disconnect is a maintenance ticket; a leaking QD that shorts a shared CDU controller and de-rates an entire row is an outage. A single de-loading job is invisible; ten thousand GPUs de-loading in lockstep is a 1,000+ MW/s ramp that can trip the point of interconnection. In almost every catastrophic row below, the severity comes not from the trigger but from a defeated common mitigation — one CDU, one fuel header, one protection setting, one BMS. Design your fault domains so that the common-cause couplings in the propagation column are broken, and most of these cascades degrade to single-domain faults the availability model in Chapter 12.5 can tolerate.

Master FMEA catalog — thermal & mechanical (cooling) modes

Cooling-system failure modes

Failure mode	Trigger	Propagation path	Detection	Blast radius	Mitigation	Recovery	Owner
Coolant-leak cascade	QD/manifold/cold-plate breach, hose chafe, gasket creep; induced: spoofed dew-point setpoint forcing condensation	Local drip → conductive coolant on busbar/PDU → arc/short → de-rate or trip of the powered branch; if it reaches a shared CDU controller, the whole CDU loop and its row drop together	Floor/leak-rope sensors + CDU flow/pressure-decay; latency seconds-to-minutes; propagation can outrun it on a high-flow breach	Rack → row (if the CDU is shared); hall if isolation valves are absent	Negative-pressure loops (leak draws air in, not coolant out); dripless UQDs; zoned isolation valves; per-rack leak detection; N+1 CDU with independent controllers (preventive + reactive)	Isolate the branch, drain/flush the loop, replace the failed coupling, re-pressure-test, re-fill, re-commission worst-case branch; ETTR hours per rack	5.11
CDU / pump failure	Pump bearing/VFD failure, seal loss, filter blockage, control-board fault; induced: malicious VFD firmware or controller DoS	Single-pump loss halves flow → cold-plate ΔT rises → GPU thermal throttle (up to ~50%) → if the CDU has no inertia and no standby pump, the served racks ride the thermal cliff in seconds (no chilled-water mass to coast on)	Pump tach, ΔP across the CDU, coolant supply temp; fast (sub-second on flow), but there is no thermal inertia to absorb the gap	Rack(s) on the affected CDU → row	N+1 pumps inside the CDU; UPS/BBU-backed pump power for ride-through; redundant independent CDUs per row; flow-failure auto-throttle of GPUs as a graceful-degrade floor (inertial + reactive)	Fail over to standby pump/CDU (sub-second to seconds if hot-standby), then RMA the failed pump off the critical path; ETTR minutes if N+1, hours if not	5.11
Cooling-controls transient excursion	Synchronized GPU load drop (job ends / checkpoint pause) → loop heat input collapses faster than valves/VFDs can slew; setpoint hunt / control-loop oscillation	On a rapid load drop, supply-coolant temp overshoots downward → transient dew-point excursion → condensation risk on cold surfaces; or anti-hunting failure drives sustained oscillation that fatigues actuators and destabilizes neighboring loops	Coolant supply-temp rate-of-change, dew-point margin sensor, valve-position hunting; detectable but the excursion window is brief	Row → hall (controls coupling); condensation risk is local to cold surfaces	Slew-rate limits on control valves and pump VFDs; anti-hunting tuning; dew-point margin floor; coordinate cooling setpoints with the rack BBU/BESS load-smoothing spine (preventive)	Re-tune control loops, restore dew-point margin, dry/inspect any condensation; ETTR minutes, no hardware loss if caught	5.12
HBM thermal runaway	Cold-plate contact loss, TIM pump-out, local flow starvation, or sustained over-temp on a stacked-DRAM site; induced: CDU disablement holding flow at zero	HBM junction temp climbs → ECC error rate rises → uncorrectable error / package damage; on a tightly-coupled training step the failed device stalls the synchronous collective and the whole job stalls behind the straggler	Per-die thermal telemetry, ECC/CE rate trend, GPU throttle flags; trend-detectable early, but runaway is fast once contact is lost	Node (the GPU/HBM package) → job (synchronous training stalls on the straggler)	Thermal screening/burn-in pre-deployment; ECC-rate alarming with proactive drain; flow-failure throttle floor; hot-spare nodes so the scheduler evicts and replaces the straggler (preventive + reactive)	Evict the node, fail the job over to a hot spare, RMA the package; training resumes from last checkpoint (ETTR = checkpoint interval + restart)	14.3

Coolant chemistry/flow envelope per ASHRAE TC 9.9 (5th ed.) liquid-cooling guidelines and OCP Liquid Cooling white papers; CDU/QD practice per Vertiv/nVent/Equinix. Owning chapters in last column. Dual-use induced path per Chapter 11.10.

The cooling modes share a defining property absent from legacy air-cooled halls: direct-to-chip liquid loops have almost no thermal inertia. A chilled-water plant coasts for minutes on the mass of water in the system; a DLC technology-cooling loop sized to a tight delta-T coasts for seconds. That removes the operator's reaction window for the pump/CDU and HBM modes — detection latency that would be acceptable on air is fatal on liquid. This is why every mitigation in the table above is either preventive (screen it out before deployment) or inertial (BBU-backed pumps, a flow-failure throttle floor) rather than reactive. The disappearance of chilled-water inertia is treated as a first-class reliability problem in Chapter 12.2 and engineered in Chapter 5.11.

Master FMEA catalog — electrical & power modes

Power-chain & grid-interface failure modes

Failure mode	Trigger	Propagation path	Detection	Blast radius	Mitigation	Recovery	Owner
Simultaneous-GPU-load-step grid trip	Thousands of GPUs ramp in lockstep at job start/stop/checkpoint; di/dt event on every step; induced: malicious power-cap firmware forcing a synchronized step	Aggregate ramp >1,000 MW/s presented to the POI → voltage/frequency disturbance → if the load-smoothing spine is absent, upstream protection or generators see a step they cannot follow → trip	Power-quality metering at the POI, PMU/PQM; fast, but the di/dt event is faster than any reactive control — milliseconds	Campus (POI) → contributes to wide-area grid disturbance	The chip→BBU→BESS smoothing spine (on-package capacitance → rack BBU → facility BESS); software ramp-rate limits and regulated wind-downs; grid-forming inverters (preventive + inertial)	Re-energize per utility ride-through procedure; restore smoothing controls; no hardware loss if the spine held; ETTR minutes if ride-through succeeded	4.5
Utility ride-through / voltage-disturbance event	Grid fault (e.g. 230 kV line fault) causes a voltage sag at the POI; sensitive customer-side protection drops the load	Undervoltage trip of the facility load → instantaneous multi-MW load loss (~1,500 MW seen on a single fault) → the loss itself destabilizes the grid, a self-reinforcing reliability problem NERC flagged at Level 3	POI relays, undervoltage/under-frequency elements, PMU; the disturbance is sub-cycle to cycles	Campus (full load drop) → wide-area grid	Fault-ride-through settings tuned to stay online through the sag (SEL/relay, UPS, undervoltage-load-retention); reactive/voltage support toward the POI; compliance with TPL-001/PRC ride-through standards (preventive)	Auto-recover as the sag clears if ride-through held; if tripped, sequenced re-energization and load ramp; ETTR minutes	4.10
BESS thermal runaway	Cell defect, overcharge, internal short, or cooling loss in an LFP facility battery; induced: BMS spoof disabling cell balancing/thermal protection	Single cell vents → exothermic chain to adjacent cells → module-level runaway → fire/off-gas if pack-level isolation and venting fail; loss of the BESS also removes the ride-through and load-smoothing it was providing	Cell voltage/temp telemetry, off-gas (H2/CO) detection, BMS fault flags; early-warning gas detection precedes thermal runaway by a useful margin	BESS enclosure → adjacent enclosures/room if propagation isolation fails	LFP chemistry (higher thermal-runaway threshold than NMC); module-level thermal isolation and dedicated venting/deflagration paths; off-gas detection with pre-emptive isolation; physical separation of BESS from IT (preventive + reactive)	Isolate and let the affected module burn out safely within its enclosure; replace module/pack; re-commission; ETTR hours-to-days; ride-through reverts to UPS/BBU meanwhile	4.5
Fuel-supply interruption (on-site generation)	Firm pipeline curtailment (correlated cold-snap), valve/compressor failure, or fuel-quality (Wobbe/dew-point) excursion; LNG/CNG storage depletion	Loss of fuel → on-site turbines/engines de-rate or trip → if the site is islanded or grid-import is constrained, generation cannot meet IT load → controlled load-shed or outage	Fuel header pressure, Wobbe-index/dew-point analyzers, tank level, generator load; minutes of warning on slow depletion, immediate on a hard cut	Campus (islanded sites) → partial if grid-import backstops	'Synthetic-firm' fuel structure (multiple pipelines + interruptible + on-site LNG/CNG storage); dual-fuel switching; fuel conditioning to spec; sized on-site storage for correlated-curtailment duration (preventive)	Switch fuel source / draw down on-site storage, restore generation; coordinate curtailment with curtailable-load agreement; ETTR depends on storage sizing vs outage duration	4.9

Load-step magnitudes: synchronized GPU draw can swing 30%→100% in milliseconds, aggregating to >1,000 MW/s at GW scale (NVIDIA/Microsoft/OpenAI joint findings, 2025). The ~1,500 MW instantaneous load-loss on a 230 kV fault is the NERC Level-3 alert motivating case. Owners in last column.

30%→100% in ms

synchronized GPU power swing per load step; aggregates to >1,000 MW/s ramp at GW scale

2025NVIDIA/Microsoft/OpenAI joint findings via Oracle OCI; arXiv AI Load Dynamics

~1,500 MW

instantaneous data-center load loss on a single 230 kV fault — the NERC Level-3 alert motivating case

2026NERC Level-3 Essential Actions Alert; Utility Dive / Data Center Frontier

1 / 3 hr

mean interruption rate on Meta's 16,384-GPU H100 Llama 3 cluster; 466 interruptions (419 unexpected) over a 54-day window

2024Meta, The Llama 3 Herd of Models

30.1% / 17.2%

share of training interruptions from faulty GPUs / HBM3 memory; network switch+cable 8.4%; >90% effective training time maintained

2024Meta Llama 3 paper; Tom's Hardware / DCD analysis

up to ~50%

GPU throttle when DLC coolant flow/temp leaves the envelope (sub-25 °C inlet, ~20 L/min, <10 °C rise across cold plates)

2025NVIDIA GB200 NVL72 thermal envelope; Vertiv 360AI reference design

~1.8–1.9 L/kWh

industry-avg evaporative WUE (best-in-class 0.1–0.7; Microsoft FY2025 0.30); curtailment risk per ERCOT SB6 75 MW kill-switch regime

2025Microsoft FY2025 sustainability; ERCOT SB6 / NPRR

Master FMEA catalog — connectivity, compute & data-integrity modes

Network, optics, compute & data-integrity failure modes

Failure mode	Trigger	Propagation path	Detection	Blast radius	Mitigation	Recovery	Owner
Fiber cut (inter-/intra-DC)	Backhoe/construction strike, conduit failure, or DCI route cut; induced: deliberate cut of an un-diverse route	Loss of a fiber path → if scale-across/DCI is single-routed, the affected campus or training partition is severed → distributed training stalls or a metro site loses connectivity	Optical LOS/LOF alarms, OTDR, BER collapse; immediate at the physical layer	Link → partition/campus (distributed training) or metro site reach (inference)	Physically diverse, geographically separated fiber routes; protected DCI (ZR/ZR+ with restoration); for training, topology that degrades gracefully on a partition loss (preventive + reactive)	Restore over the diverse path automatically; physical splice repair on the cut route (ETTR hours-to-days for the splice, seconds for protected failover)	3.6
Optics flap storm	Marginal transceiver, dirty/over-bent connector, thermal cycling, or firmware bug causing repeated link up/down; induced: thermal attack on the optics environment	One flapping link → routing reconvergence churn / ECMP rehashing → packet loss and tail-latency spikes propagate across the fabric → on a tightly-coupled collective, the flapping link gates the whole all-reduce and tanks MFU	Per-port link-flap counters, BER/FEC-error trend, CRC errors; trend-detectable but a storm builds in seconds-to-minutes	Link → fabric pod (reconvergence churn) → job (collective stalls)	Pre-install optics burn-in / BER screening off the critical path; FEC margin headroom; auto-quarantine of flapping ports; CPO to remove pluggable failure points where deployed (preventive + reactive)	Quarantine and replace the offending transceiver; let routing reconverge; ETTR minutes to swap, plus reconvergence	8.9
SDC corruption event (silent data corruption)	Marginal/defective silicon (a 'mercurial core'), aging, voltage/thermal margin loss producing a wrong result with no error flag; induced: fault-injection on a known-marginal device	A miscomputed value flows silently into gradients/activations → corrupts the model state or an inference result → undetected for hours-to-days, potentially poisoning a checkpoint and forcing a roll-back of all work since the last clean checkpoint	No native hardware flag — requires a dedicated detection program (Fleetscanner periodic, Ripple in-fleet, Hardware Sentinel runtime); latency hours unless instrumented	Node (the mercurial core) → job/model (silent corruption of state) → potentially every downstream consumer of a poisoned checkpoint	Proactive SDC-hunting at scale; redundant/checksummed compute on critical paths; PVF-aware placement; roll-back to a verified-clean checkpoint; quarantine the device (preventive + reactive)	Identify and quarantine the mercurial core, roll back to last verified-clean checkpoint, re-run; ETTR = work since clean checkpoint (the reason hyper-frequent checkpointing pays off)	14.3
Water-curtailment event	ISO/utility curtailment order (ERCOT SB6 kill-switch class), drought/basin restriction, or reclaimed-water supply interruption on an evaporative-cooled site	Loss/limit of make-up water → evaporative/cooling-tower capacity drops → heat-rejection ceiling falls below IT load → forced de-rate of compute or a curtailment-driven load-shed	Make-up water flow/level, WUE telemetry, basin level, curtailment-order receipt; advance notice on scheduled curtailment, immediate on a hard order	Hall → campus (heat-rejection limited)	Closed-loop / zero-evaporation cooling design (designs water out of the risk); reclaimed/non-potable sourcing; on-site water storage; curtailment-tolerant workload scheduling (batch defers); thermal-storage buffer (preventive)	Shift heat rejection to the closed-loop/dry path or draw down water storage; defer curtailable batch load; ETTR = curtailment duration; structural fix is closed-loop conversion	3.7

SDC detection program per Meta (Fleetscanner/Ripple/Hardware Sentinel) and OCP SDC-in-AI white paper; optics reliability per SemiAnalysis/NVIDIA CPO analyses; fiber/latency per Chapter 3.6. Owners in last column.

Three of these four modes are distinguished by a long detection latency relative to a slow-burning blast radius — the inverse of the cooling and load-step modes. An SDC event can poison a checkpoint and sit undetected for days; an optics flap can quietly erode MFU before anyone correlates the tail-latency to a single marginal transceiver; a fiber cut on a poorly-instrumented diverse path can go unnoticed until the protect path also fails. For this class the decisive investment is detection, not faster actuation: a dedicated SDC-hunting program, per-port flap/FEC telemetry, and verified-clean checkpointing convert an invisible, unbounded-blast-radius fault into a bounded, recoverable one. The empirical fault taxonomy and the AFRs that feed these rows into the availability model are the canonical content of Chapter 14.3; the SDC detection program is treated there and demonstrated at commissioning in Chapter 13.6.

Common-cause couplings & cascade interlocks

The most useful output of this catalog for the availability model is not the per-mode rows but the couplings between them: the shared resources whose failure makes two independent-looking modes fail together. These are the common-cause terms that an RBD or fault tree must capture or it will badly over-state availability. The table below names the cross-mode interlocks worth modeling explicitly.

Cross-mode common-cause couplings (model these as shared events)

Shared resource	Modes it couples	Why the coupling matters	Design break
A single CDU / coolant loop	Coolant-leak cascade + CDU/pump failure + HBM runaway	One CDU serves a row; its loss simultaneously removes flow (HBM runaway) and can be the leak's escalation path — a single point that takes the whole row	Independent N+1 CDUs per row with separate controllers and power feeds; zoned isolation valves
The load-smoothing spine (BBU + BESS)	Load-step grid trip + BESS thermal runaway + ride-through event	The BESS absorbs the load step AND provides ride-through; a BESS runaway removes both defenses at once, so a grid disturbance during a BESS event is unmitigated	Don't make the same BESS the sole provider of smoothing and ride-through; layer UPS/BBU; physically separate BESS
A single fuel header / pipeline	Fuel-supply interruption + load-step / ride-through (generation can't follow)	A correlated cold-snap curtailment removes generation just when grid stress is highest — the failures are positively correlated, not independent	Synthetic-firm fuel (multiple pipelines + storage + dual-fuel); size storage to correlated-event duration
A single fiber route / DCI path	Fiber cut + optics flap storm (no diverse path to fail over to)	If diversity is nominal but the routes share a conduit/right-of-way, a single cut defeats the 'redundant' path and the protect path together	True physical + geographic route diversity; verify conduit separation, not just two strands
A verified-clean checkpoint cadence	SDC corruption + HBM runaway + any job-restart mode	Every restart-based recovery (HBM eviction, node failure) and every SDC roll-back is bounded by the checkpoint interval — a stale cadence inflates the ETTR of many modes at once	Hyper-frequent / asynchronous checkpointing with integrity verification; this is the universal ETTR floor
Shared facility controls (BMS/DCIM/SCADA)	Cooling-controls excursion + CDU disablement + induced (Ch 11.10) variants of every mode	The controls plane is the common attacker target and a common random single-point: a BMS fault or spoof can trigger the dew-point, flow, and load-step modes simultaneously	Segmented OT networks, signed setpoints, out-of-band safety interlocks independent of the BMS

Each row is a single shared resource whose loss couples otherwise-independent failure modes. Feed these as common-cause / shared events into the RBD/FTA of Chapter 12.5 — independent-failure assumptions over-state availability where these exist.

Using this catalog in the availability model

The rows above are written to be lifted directly into the reliability work. The mechanical translation is:

Top events for the FTA come from the high-blast-radius rows — campus/POI-level modes (load-step grid trip, ride-through, fuel interruption, water curtailment) are the system-level top events the fault tree resolves down to component basic-events.
Basic-event rates come from the component AFRs in Chapter 14.3 (GPU, HBM, optics, pump, cell, transceiver) — this appendix names the mode; that chapter quantifies how often it fires.
Common-cause β-factors come from the couplings table — every shared-resource row is a common-cause term that breaks the independence assumption and must be modeled as a shared event in the RBD/Monte-Carlo of Chapter 12.5.
ETTR distributions come from the recovery column — and note that the checkpoint cadence is the dominant ETTR term across the largest number of modes, which is why goodput-centric reliability (Chapter 12.2) prioritizes checkpoint speed over facility nines for training.
IST demonstration cases come from the propagation column — Level-5 integrated systems testing (Chapter 13.6) should demonstrate the failover for the worst cascade rows (CDU loss, BESS event, load-step ride-through) before the factory carries production load.

Worked example: tracing the coolant-leak cascade as a one-fault-to-row outage

Take the first row and walk it as the availability model would. Trigger: a quick-disconnect gasket creeps after thermal cycling and weeps coolant — by itself a single-rack maintenance event with an ETTR of an hour. The escalation: whether this stays a one-rack event or becomes a row outage is decided entirely by the propagation path. If conductive coolant reaches a powered busbar before the leak-rope sensor trips, it arcs and de-rates the branch; if the affected CDU controller is shared across the row and the short reaches it, the whole row's flow drops and you are now also in the CDU/pump-failure row and, seconds later, the HBM-runaway row — three catalog rows from one gasket, because a single common resource (the shared CDU) was defeated.

The break: negative-pressure loops mean a breach draws air in rather than pushing coolant onto live electrical gear, removing the arc path; per-rack leak detection with zoned isolation valves contains the weep to one rack; and independent N+1 CDUs per row mean the shared-controller escalation cannot happen. With those three breaks, the model sees a single-rack basic event with a bounded ETTR — not a row-level top event with a common-cause coupling to two other modes. The dual-use note: the same row outage is reachable by spoofing the dew-point setpoint to force condensation onto the busbar (no gasket required), which is why the OT-segmentation mitigation in Chapter 11.10 is part of this row's complete defense, not an optional extra. Canonical engineering of the leak-detection and isolation design is Chapter 5.11.

This catalog is the index; the derivations live in their owning chapters. Cooling modes: leak/CDU/pump in Chapter 5.11, controls transients in Chapter 5.12, the density/thermal envelope in Chapter 5.1. Electrical modes: load-step smoothing in Chapter 4.5, grid ride-through in Chapter 4.10, fuel-supply engineering in Chapter 4.9. Connectivity/compute: fiber/latency in Chapter 3.6, optics reliability in Chapter 8.9, the SDC/hard/transient fault taxonomy and AFRs in Chapter 14.3. Water curtailment as a siting gate in Chapter 3.7. The dual-use induced paths for every mode are Chapter 11.10. The catalog feeds the redundancy/fault-domain framework in Chapter 12.1, the goodput-vs-availability rethink in Chapter 12.2, the quantitative RBD/FTA/Monte-Carlo model in Chapter 12.5, and the failure-mode demonstration at Chapter 13.6.