The Definitive Guide toAI Data Centers
Ask the Guide

Appendix F

Failure-Mode / FMEA Catalog

Every failure in an AI factory is dual-use — the same coolant-leak cascade, grid trip, or thermal runaway can be a random fault or an attacker's objective — so this catalog gives each mode a single uniform record (trigger, propagation, detection, blast radius, mitigation, recovery) that feeds the availability model in Chapter 12.5 and the cyber-physical analysis in Chapter 11.10 from one consistent source of truth.

What you'll decide here

  1. Use this appendix as a lookup, not a narrative: find your failure mode in the master table, then jump to the owning chapter (last column) for the engineering derivation — this catalog summarizes, it does not replace, the canonical treatment.
  2. Read every mode as dual-use. The trigger column lists the random cause; the same end-state is reachable by an attacker via the path in Chapter 11.10 (forced load step, CDU disablement, BMS spoof). If a mitigation only defends the random path, it is incomplete.
  3. Treat the blast-radius column as the input to your fault-domain and RBD work (Chapter 12.1 / 12.5): a mode that strands one rack and a mode that trips the campus POI sit in different availability tiers and deserve different redundancy spend.
  4. Walk the propagation column for cascade interlocks. Most catastrophic outages here are not the first fault — they are a single fault that defeated a shared mitigation (one CDU, one fuel header, one protection setting) and took an entire fault domain with it.
  5. Pair detection latency against propagation speed for each mode. Where the fault propagates faster than your detection-plus-actuation loop (load-step→grid trip, HBM runaway), the only viable mitigation is preventive or inertial, never reactive — design accordingly.

This appendix consolidates the failure modes scattered across the engineering chapters into one uniform FMEA register. It is referenced from the resilience-standards chapter (Chapter 12.1), the reliability rethink (Chapter 12.2), the component-failure-rate chapter (Chapter 14.3), and the integrated-systems-test chapter (Chapter 13.6); and it is consumed directly by the quantitative availability model in Chapter 12.5, which draws its top events and common-cause couplings from the rows below. The canonical engineering of each mode lives in the chapter named in the right-most column — this catalog is the index and the cross-walk, deliberately dense and scannable, not the derivation.

The organizing discipline is that every mode is treated as dual-use: a coolant leak, a synchronized load step, a BESS cell vent, a fiber cut — each is reachable both as a random fault and as an attacker's objective. The two paths share an end-state, so they share a row; where they diverge is in the trigger and in which mitigation defeats them. A defense that addresses only the stochastic path (a redundant pump) but not the induced path (a malicious firmware load that disables both pumps in lockstep) is, for FMEA purposes, an incomplete mitigation. The cyber-physical attack tree that maps these induced paths is Chapter 11.10; this appendix flags the coupling but does not re-derive it.

How to read a row

Each failure mode is recorded against six fields, applied identically across every table so the catalog is sortable and comparable:

  • Trigger — the initiating event (random cause first; the induced/attacker path is noted where it differs materially).
  • Propagation path — how the fault spreads, and critically which shared mitigation it defeats to escalate from a local fault to a cascade.
  • Detection — the sensing modality and its characteristic latency relative to propagation speed (the decisive ratio).
  • Blast radius — the fault domain affected at full propagation: node, rack, row, hall, or campus/POI.
  • Mitigation — the preventive or containing control, classified as preventive (stops the trigger), inertial (buys ride-through time), or reactive (acts after detection).
  • Recovery — the path back to service and its characteristic time-to-restore (ETTR).

Where propagation outruns detection-plus-actuation, the only effective controls are preventive or inertial. That ratio is called out per mode in the notes because it tells you whether a mitigation must exist before the fault or can be deployed after it.

Master FMEA catalog — thermal & mechanical (cooling) modes

Cooling-system failure modes
Failure modeTriggerPropagation pathDetectionBlast radiusMitigationRecoveryOwner
Coolant-leak cascadeQD/manifold/cold-plate breach, hose chafe, gasket creep; induced: spoofed dew-point setpoint forcing condensationLocal drip → conductive coolant on busbar/PDU → arc/short → de-rate or trip of the powered branch; if it reaches a shared CDU controller, the whole CDU loop and its row drop togetherFloor/leak-rope sensors + CDU flow/pressure-decay; latency seconds-to-minutes; propagation can outrun it on a high-flow breachRack → row (if the CDU is shared); hall if isolation valves are absentNegative-pressure loops (leak draws air in, not coolant out); dripless UQDs; zoned isolation valves; per-rack leak detection; N+1 CDU with independent controllers (preventive + reactive)Isolate the branch, drain/flush the loop, replace the failed coupling, re-pressure-test, re-fill, re-commission worst-case branch; ETTR hours per rack5.11
CDU / pump failurePump bearing/VFD failure, seal loss, filter blockage, control-board fault; induced: malicious VFD firmware or controller DoSSingle-pump loss halves flow → cold-plate ΔT rises → GPU thermal throttle (up to ~50%) → if the CDU has no inertia and no standby pump, the served racks ride the thermal cliff in seconds (no chilled-water mass to coast on)Pump tach, ΔP across the CDU, coolant supply temp; fast (sub-second on flow), but there is no thermal inertia to absorb the gapRack(s) on the affected CDU → rowN+1 pumps inside the CDU; UPS/BBU-backed pump power for ride-through; redundant independent CDUs per row; flow-failure auto-throttle of GPUs as a graceful-degrade floor (inertial + reactive)Fail over to standby pump/CDU (sub-second to seconds if hot-standby), then RMA the failed pump off the critical path; ETTR minutes if N+1, hours if not5.11
Cooling-controls transient excursionSynchronized GPU load drop (job ends / checkpoint pause) → loop heat input collapses faster than valves/VFDs can slew; setpoint hunt / control-loop oscillationOn a rapid load drop, supply-coolant temp overshoots downward → transient dew-point excursion → condensation risk on cold surfaces; or anti-hunting failure drives sustained oscillation that fatigues actuators and destabilizes neighboring loopsCoolant supply-temp rate-of-change, dew-point margin sensor, valve-position hunting; detectable but the excursion window is briefRow → hall (controls coupling); condensation risk is local to cold surfacesSlew-rate limits on control valves and pump VFDs; anti-hunting tuning; dew-point margin floor; coordinate cooling setpoints with the rack BBU/BESS load-smoothing spine (preventive)Re-tune control loops, restore dew-point margin, dry/inspect any condensation; ETTR minutes, no hardware loss if caught5.12
HBM thermal runawayCold-plate contact loss, TIM pump-out, local flow starvation, or sustained over-temp on a stacked-DRAM site; induced: CDU disablement holding flow at zeroHBM junction temp climbs → ECC error rate rises → uncorrectable error / package damage; on a tightly-coupled training step the failed device stalls the synchronous collective and the whole job stalls behind the stragglerPer-die thermal telemetry, ECC/CE rate trend, GPU throttle flags; trend-detectable early, but runaway is fast once contact is lostNode (the GPU/HBM package) → job (synchronous training stalls on the straggler)Thermal screening/burn-in pre-deployment; ECC-rate alarming with proactive drain; flow-failure throttle floor; hot-spare nodes so the scheduler evicts and replaces the straggler (preventive + reactive)Evict the node, fail the job over to a hot spare, RMA the package; training resumes from last checkpoint (ETTR = checkpoint interval + restart)14.3
Coolant chemistry/flow envelope per ASHRAE TC 9.9 (5th ed.) liquid-cooling guidelines and OCP Liquid Cooling white papers; CDU/QD practice per Vertiv/nVent/Equinix. Owning chapters in last column. Dual-use induced path per Chapter 11.10.

The cooling modes share a defining property absent from legacy air-cooled halls: direct-to-chip liquid loops have almost no thermal inertia. A chilled-water plant coasts for minutes on the mass of water in the system; a DLC technology-cooling loop sized to a tight delta-T coasts for seconds. That removes the operator's reaction window for the pump/CDU and HBM modes — detection latency that would be acceptable on air is fatal on liquid. This is why every mitigation in the table above is either preventive (screen it out before deployment) or inertial (BBU-backed pumps, a flow-failure throttle floor) rather than reactive. The disappearance of chilled-water inertia is treated as a first-class reliability problem in Chapter 12.2 and engineered in Chapter 5.11.

Master FMEA catalog — electrical & power modes

Power-chain & grid-interface failure modes
Failure modeTriggerPropagation pathDetectionBlast radiusMitigationRecoveryOwner
Simultaneous-GPU-load-step grid tripThousands of GPUs ramp in lockstep at job start/stop/checkpoint; di/dt event on every step; induced: malicious power-cap firmware forcing a synchronized stepAggregate ramp >1,000 MW/s presented to the POI → voltage/frequency disturbance → if the load-smoothing spine is absent, upstream protection or generators see a step they cannot follow → tripPower-quality metering at the POI, PMU/PQM; fast, but the di/dt event is faster than any reactive control — millisecondsCampus (POI) → contributes to wide-area grid disturbanceThe chip→BBU→BESS smoothing spine (on-package capacitance → rack BBU → facility BESS); software ramp-rate limits and regulated wind-downs; grid-forming inverters (preventive + inertial)Re-energize per utility ride-through procedure; restore smoothing controls; no hardware loss if the spine held; ETTR minutes if ride-through succeeded4.5
Utility ride-through / voltage-disturbance eventGrid fault (e.g. 230 kV line fault) causes a voltage sag at the POI; sensitive customer-side protection drops the loadUndervoltage trip of the facility load → instantaneous multi-MW load loss (~1,500 MW seen on a single fault) → the loss itself destabilizes the grid, a self-reinforcing reliability problem NERC flagged at Level 3POI relays, undervoltage/under-frequency elements, PMU; the disturbance is sub-cycle to cyclesCampus (full load drop) → wide-area gridFault-ride-through settings tuned to stay online through the sag (SEL/relay, UPS, undervoltage-load-retention); reactive/voltage support toward the POI; compliance with TPL-001/PRC ride-through standards (preventive)Auto-recover as the sag clears if ride-through held; if tripped, sequenced re-energization and load ramp; ETTR minutes4.10
BESS thermal runawayCell defect, overcharge, internal short, or cooling loss in an LFP facility battery; induced: BMS spoof disabling cell balancing/thermal protectionSingle cell vents → exothermic chain to adjacent cells → module-level runaway → fire/off-gas if pack-level isolation and venting fail; loss of the BESS also removes the ride-through and load-smoothing it was providingCell voltage/temp telemetry, off-gas (H2/CO) detection, BMS fault flags; early-warning gas detection precedes thermal runaway by a useful marginBESS enclosure → adjacent enclosures/room if propagation isolation failsLFP chemistry (higher thermal-runaway threshold than NMC); module-level thermal isolation and dedicated venting/deflagration paths; off-gas detection with pre-emptive isolation; physical separation of BESS from IT (preventive + reactive)Isolate and let the affected module burn out safely within its enclosure; replace module/pack; re-commission; ETTR hours-to-days; ride-through reverts to UPS/BBU meanwhile4.5
Fuel-supply interruption (on-site generation)Firm pipeline curtailment (correlated cold-snap), valve/compressor failure, or fuel-quality (Wobbe/dew-point) excursion; LNG/CNG storage depletionLoss of fuel → on-site turbines/engines de-rate or trip → if the site is islanded or grid-import is constrained, generation cannot meet IT load → controlled load-shed or outageFuel header pressure, Wobbe-index/dew-point analyzers, tank level, generator load; minutes of warning on slow depletion, immediate on a hard cutCampus (islanded sites) → partial if grid-import backstops'Synthetic-firm' fuel structure (multiple pipelines + interruptible + on-site LNG/CNG storage); dual-fuel switching; fuel conditioning to spec; sized on-site storage for correlated-curtailment duration (preventive)Switch fuel source / draw down on-site storage, restore generation; coordinate curtailment with curtailable-load agreement; ETTR depends on storage sizing vs outage duration4.9
Load-step magnitudes: synchronized GPU draw can swing 30%→100% in milliseconds, aggregating to >1,000 MW/s at GW scale (NVIDIA/Microsoft/OpenAI joint findings, 2025). The ~1,500 MW instantaneous load-loss on a 230 kV fault is the NERC Level-3 alert motivating case. Owners in last column.
30%→100% in ms
synchronized GPU power swing per load step; aggregates to >1,000 MW/s ramp at GW scale
2025NVIDIA/Microsoft/OpenAI joint findings via Oracle OCI; arXiv AI Load Dynamics
~1,500 MW
instantaneous data-center load loss on a single 230 kV fault — the NERC Level-3 alert motivating case
2026NERC Level-3 Essential Actions Alert; Utility Dive / Data Center Frontier
1 / 3 hr
mean interruption rate on Meta's 16,384-GPU H100 Llama 3 cluster; 466 interruptions (419 unexpected) over a 54-day window
2024Meta, The Llama 3 Herd of Models
30.1% / 17.2%
share of training interruptions from faulty GPUs / HBM3 memory; network switch+cable 8.4%; >90% effective training time maintained
2024Meta Llama 3 paper; Tom's Hardware / DCD analysis
up to ~50%
GPU throttle when DLC coolant flow/temp leaves the envelope (sub-25 °C inlet, ~20 L/min, <10 °C rise across cold plates)
2025NVIDIA GB200 NVL72 thermal envelope; Vertiv 360AI reference design
~1.8–1.9 L/kWh
industry-avg evaporative WUE (best-in-class 0.1–0.7; Microsoft FY2025 0.30); curtailment risk per ERCOT SB6 75 MW kill-switch regime
2025Microsoft FY2025 sustainability; ERCOT SB6 / NPRR

Master FMEA catalog — connectivity, compute & data-integrity modes

Network, optics, compute & data-integrity failure modes
Failure modeTriggerPropagation pathDetectionBlast radiusMitigationRecoveryOwner
Fiber cut (inter-/intra-DC)Backhoe/construction strike, conduit failure, or DCI route cut; induced: deliberate cut of an un-diverse routeLoss of a fiber path → if scale-across/DCI is single-routed, the affected campus or training partition is severed → distributed training stalls or a metro site loses connectivityOptical LOS/LOF alarms, OTDR, BER collapse; immediate at the physical layerLink → partition/campus (distributed training) or metro site reach (inference)Physically diverse, geographically separated fiber routes; protected DCI (ZR/ZR+ with restoration); for training, topology that degrades gracefully on a partition loss (preventive + reactive)Restore over the diverse path automatically; physical splice repair on the cut route (ETTR hours-to-days for the splice, seconds for protected failover)3.6
Optics flap stormMarginal transceiver, dirty/over-bent connector, thermal cycling, or firmware bug causing repeated link up/down; induced: thermal attack on the optics environmentOne flapping link → routing reconvergence churn / ECMP rehashing → packet loss and tail-latency spikes propagate across the fabric → on a tightly-coupled collective, the flapping link gates the whole all-reduce and tanks MFUPer-port link-flap counters, BER/FEC-error trend, CRC errors; trend-detectable but a storm builds in seconds-to-minutesLink → fabric pod (reconvergence churn) → job (collective stalls)Pre-install optics burn-in / BER screening off the critical path; FEC margin headroom; auto-quarantine of flapping ports; CPO to remove pluggable failure points where deployed (preventive + reactive)Quarantine and replace the offending transceiver; let routing reconverge; ETTR minutes to swap, plus reconvergence8.9
SDC corruption event (silent data corruption)Marginal/defective silicon (a 'mercurial core'), aging, voltage/thermal margin loss producing a wrong result with no error flag; induced: fault-injection on a known-marginal deviceA miscomputed value flows silently into gradients/activations → corrupts the model state or an inference result → undetected for hours-to-days, potentially poisoning a checkpoint and forcing a roll-back of all work since the last clean checkpointNo native hardware flag — requires a dedicated detection program (Fleetscanner periodic, Ripple in-fleet, Hardware Sentinel runtime); latency hours unless instrumentedNode (the mercurial core) → job/model (silent corruption of state) → potentially every downstream consumer of a poisoned checkpointProactive SDC-hunting at scale; redundant/checksummed compute on critical paths; PVF-aware placement; roll-back to a verified-clean checkpoint; quarantine the device (preventive + reactive)Identify and quarantine the mercurial core, roll back to last verified-clean checkpoint, re-run; ETTR = work since clean checkpoint (the reason hyper-frequent checkpointing pays off)14.3
Water-curtailment eventISO/utility curtailment order (ERCOT SB6 kill-switch class), drought/basin restriction, or reclaimed-water supply interruption on an evaporative-cooled siteLoss/limit of make-up water → evaporative/cooling-tower capacity drops → heat-rejection ceiling falls below IT load → forced de-rate of compute or a curtailment-driven load-shedMake-up water flow/level, WUE telemetry, basin level, curtailment-order receipt; advance notice on scheduled curtailment, immediate on a hard orderHall → campus (heat-rejection limited)Closed-loop / zero-evaporation cooling design (designs water out of the risk); reclaimed/non-potable sourcing; on-site water storage; curtailment-tolerant workload scheduling (batch defers); thermal-storage buffer (preventive)Shift heat rejection to the closed-loop/dry path or draw down water storage; defer curtailable batch load; ETTR = curtailment duration; structural fix is closed-loop conversion3.7
SDC detection program per Meta (Fleetscanner/Ripple/Hardware Sentinel) and OCP SDC-in-AI white paper; optics reliability per SemiAnalysis/NVIDIA CPO analyses; fiber/latency per Chapter 3.6. Owners in last column.

Three of these four modes are distinguished by a long detection latency relative to a slow-burning blast radius — the inverse of the cooling and load-step modes. An SDC event can poison a checkpoint and sit undetected for days; an optics flap can quietly erode MFU before anyone correlates the tail-latency to a single marginal transceiver; a fiber cut on a poorly-instrumented diverse path can go unnoticed until the protect path also fails. For this class the decisive investment is detection, not faster actuation: a dedicated SDC-hunting program, per-port flap/FEC telemetry, and verified-clean checkpointing convert an invisible, unbounded-blast-radius fault into a bounded, recoverable one. The empirical fault taxonomy and the AFRs that feed these rows into the availability model are the canonical content of Chapter 14.3; the SDC detection program is treated there and demonstrated at commissioning in Chapter 13.6.

Common-cause couplings & cascade interlocks

The most useful output of this catalog for the availability model is not the per-mode rows but the couplings between them: the shared resources whose failure makes two independent-looking modes fail together. These are the common-cause terms that an RBD or fault tree must capture or it will badly over-state availability. The table below names the cross-mode interlocks worth modeling explicitly.

Cross-mode common-cause couplings (model these as shared events)
Shared resourceModes it couplesWhy the coupling mattersDesign break
A single CDU / coolant loopCoolant-leak cascade + CDU/pump failure + HBM runawayOne CDU serves a row; its loss simultaneously removes flow (HBM runaway) and can be the leak's escalation path — a single point that takes the whole rowIndependent N+1 CDUs per row with separate controllers and power feeds; zoned isolation valves
The load-smoothing spine (BBU + BESS)Load-step grid trip + BESS thermal runaway + ride-through eventThe BESS absorbs the load step AND provides ride-through; a BESS runaway removes both defenses at once, so a grid disturbance during a BESS event is unmitigatedDon't make the same BESS the sole provider of smoothing and ride-through; layer UPS/BBU; physically separate BESS
A single fuel header / pipelineFuel-supply interruption + load-step / ride-through (generation can't follow)A correlated cold-snap curtailment removes generation just when grid stress is highest — the failures are positively correlated, not independentSynthetic-firm fuel (multiple pipelines + storage + dual-fuel); size storage to correlated-event duration
A single fiber route / DCI pathFiber cut + optics flap storm (no diverse path to fail over to)If diversity is nominal but the routes share a conduit/right-of-way, a single cut defeats the 'redundant' path and the protect path togetherTrue physical + geographic route diversity; verify conduit separation, not just two strands
A verified-clean checkpoint cadenceSDC corruption + HBM runaway + any job-restart modeEvery restart-based recovery (HBM eviction, node failure) and every SDC roll-back is bounded by the checkpoint interval — a stale cadence inflates the ETTR of many modes at onceHyper-frequent / asynchronous checkpointing with integrity verification; this is the universal ETTR floor
Shared facility controls (BMS/DCIM/SCADA)Cooling-controls excursion + CDU disablement + induced (Ch 11.10) variants of every modeThe controls plane is the common attacker target and a common random single-point: a BMS fault or spoof can trigger the dew-point, flow, and load-step modes simultaneouslySegmented OT networks, signed setpoints, out-of-band safety interlocks independent of the BMS
Each row is a single shared resource whose loss couples otherwise-independent failure modes. Feed these as common-cause / shared events into the RBD/FTA of Chapter 12.5 — independent-failure assumptions over-state availability where these exist.

Using this catalog in the availability model

The rows above are written to be lifted directly into the reliability work. The mechanical translation is:

  • Top events for the FTA come from the high-blast-radius rows — campus/POI-level modes (load-step grid trip, ride-through, fuel interruption, water curtailment) are the system-level top events the fault tree resolves down to component basic-events.
  • Basic-event rates come from the component AFRs in Chapter 14.3 (GPU, HBM, optics, pump, cell, transceiver) — this appendix names the mode; that chapter quantifies how often it fires.
  • Common-cause β-factors come from the couplings table — every shared-resource row is a common-cause term that breaks the independence assumption and must be modeled as a shared event in the RBD/Monte-Carlo of Chapter 12.5.
  • ETTR distributions come from the recovery column — and note that the checkpoint cadence is the dominant ETTR term across the largest number of modes, which is why goodput-centric reliability (Chapter 12.2) prioritizes checkpoint speed over facility nines for training.
  • IST demonstration cases come from the propagation column — Level-5 integrated systems testing (Chapter 13.6) should demonstrate the failover for the worst cascade rows (CDU loss, BESS event, load-step ride-through) before the factory carries production load.
Worked example: tracing the coolant-leak cascade as a one-fault-to-row outage

Take the first row and walk it as the availability model would. Trigger: a quick-disconnect gasket creeps after thermal cycling and weeps coolant — by itself a single-rack maintenance event with an ETTR of an hour. The escalation: whether this stays a one-rack event or becomes a row outage is decided entirely by the propagation path. If conductive coolant reaches a powered busbar before the leak-rope sensor trips, it arcs and de-rates the branch; if the affected CDU controller is shared across the row and the short reaches it, the whole row's flow drops and you are now also in the CDU/pump-failure row and, seconds later, the HBM-runaway row — three catalog rows from one gasket, because a single common resource (the shared CDU) was defeated.

The break: negative-pressure loops mean a breach draws air in rather than pushing coolant onto live electrical gear, removing the arc path; per-rack leak detection with zoned isolation valves contains the weep to one rack; and independent N+1 CDUs per row mean the shared-controller escalation cannot happen. With those three breaks, the model sees a single-rack basic event with a bounded ETTR — not a row-level top event with a common-cause coupling to two other modes. The dual-use note: the same row outage is reachable by spoofing the dew-point setpoint to force condensation onto the busbar (no gasket required), which is why the OT-segmentation mitigation in Chapter 11.10 is part of this row's complete defense, not an optional extra. Canonical engineering of the leak-detection and isolation design is Chapter 5.11.

This catalog is the index; the derivations live in their owning chapters. Cooling modes: leak/CDU/pump in Chapter 5.11, controls transients in Chapter 5.12, the density/thermal envelope in Chapter 5.1. Electrical modes: load-step smoothing in Chapter 4.5, grid ride-through in Chapter 4.10, fuel-supply engineering in Chapter 4.9. Connectivity/compute: fiber/latency in Chapter 3.6, optics reliability in Chapter 8.9, the SDC/hard/transient fault taxonomy and AFRs in Chapter 14.3. Water curtailment as a siting gate in Chapter 3.7. The dual-use induced paths for every mode are Chapter 11.10. The catalog feeds the redundancy/fault-domain framework in Chapter 12.1, the goodput-vs-availability rethink in Chapter 12.2, the quantitative RBD/FTA/Monte-Carlo model in Chapter 12.5, and the failure-mode demonstration at Chapter 13.6.