The Definitive Guide toAI Data Centers
Ask the Guide

Chapter 1.7

The Requirements-and-Consequences Matrix

Once you have named the workload archetype, the rest of the facility is no longer a menu of options — it is a forced sequence of subsystem commitments, and this chapter is the lookup table that turns one requirement into a defensible, signed design basis.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

  1. The cooling modality each hall is plumbed for — air, rear-door, or direct-to-chip liquid — which the density target sets before steel is cut and which a retrofit cannot cheaply undo.
  2. The back-end fabric blocking ratio and the GPU:CPU / GPU:memory / GPU:storage ratios, which coupling and read-bandwidth demand fix per archetype — over-provision and you strand capex, under-provision and you starve the accelerators.
  3. The storage tier and its throughput floor (checkpoint write bandwidth, data-loader read bandwidth, KV-cache capacity), mapped to the archetype's tolerance for a stalled GPU.
  4. The redundancy tier — N, N+1, or 2N/Tier-IV — mapped to whether the workload survives a node loss via checkpoint-and-resume or loses revenue on every outage.
  5. Whether the site is scored power-first or latency-first, and the per-archetype reference design-basis sheet you freeze and sign before ordering long-lead equipment.

Chapter 1.1 established the master variable — the workload archetype — and walked the cascade qualitatively. This chapter is the engineering instrument that operationalizes it: a requirements-and-consequences matrix that takes one archetype and returns a concrete, numbered design basis for every subsystem that follows. Where 1.1 said "pre-training implies liquid cooling," this chapter says which inlet temperature, which flow rate, which floor-loading class, which blocking ratio, which storage throughput floor, and which redundancy tier — and names the downstream cost of each cell you fill in wrong.

The altitude here is lower than in 1.1. We move through four mappings in the order an engineer actually commits them — density to the cooling cliff, fabric and the GPU:CPU/memory/storage ratios, storage and redundancy against interruption tolerance, and siting as power-first versus latency-first — and close on the reference design-basis sheets that capture all four per archetype. Read this chapter with Chapter 1.1 open: this is the table its cascade was promising.

Mapping 1 — density to the cooling cliff

The first irreversible commitment is the cooling modality, and it is set entirely by one number: peak rack density. This is physics, not preference. Air saturates as a heat-removal medium around ~41 kW/rack under realistic containment; rear-door heat exchangers (RDHx) and air-assisted liquid push the ceiling to ~50–100 kW without facility water at the rack; past ~100 kW the only answer is direct-to-chip liquid (DLC). A GB200 NVL72 draws ~120–132 kW — roughly ~115 kW removed by liquid and ~17 kW by residual air — which lands it firmly past the cliff. The next generation does not soften this: Rubin VR200-class racks are ~190–230 kW, and Rubin Ultra Kyber is on a ~600 kW / 800 VDC path. The density target therefore does not influence the cooling plant; it determines it.

The consequence of mis-reading the cliff is a discontinuity, not a slope. A hall scoped for 40 kW air-cooled racks cannot absorb a 132 kW DLC rack by tuning airflow — there is no containment scheme, no warmer supply air, no economizer setting that closes a ~90 kW gap. You are over the cliff, and crossing it in a retrofit costs ~$5–10M/MW while still stranding capacity, because the slab cannot bear ~3,000 lb wet racks, the plenum was never sized for liquid distribution, and facility water was never provisioned. The map below is the lookup; the engineering lives in Chapter 5.1 (the density wall) through Chapter 5.4 (DLC).

Density-to-cooling-cliff map
Rack density bandCooling modalityFacility water at rack?Floor / structural basisTypical PUE bandArchetypes that land here
Up to ~41 kWAir (containment, CRAH/in-row)NoStandard raised floor / slab1.4–1.6 (legacy air)Edge; modest-density batch inference
~41–100 kWRear-door HX / air-assisted liquidNo (door-level only)Reinforced rows; brownfield-friendly1.2–1.4Online & batch inference; retrofit bridge
~100–200 kWDirect-to-chip liquid (single-phase)Yes — CDU + warm-water loopReinforced slab for ~3,000–5,000 lb wet racks1.05–1.15Pre-training; RL trainer; dense inference
~200–600 kW+DLC + 800 VDC; busbar-integrated liquidYes — high-flow, tight delta-TPurpose-built; pipe-rack & knockout headroom1.05–1.10Frontier pre-training (Rubin / Kyber class)
Air ceiling per ASHRAE TC 9.9 / SemiAnalysis; rack-density figures are 2026-current NVIDIA-class reference points (see keynumbers for sources and vintages). PUE bands are design, not annualized.

Mapping 2 — fabric sizing and the system ratios

The second mapping is the network, and it has two parts: the blocking ratio of the back-end (scale-out) fabric, set by coupling, and the system composition ratios — GPU:CPU, GPU:memory, GPU:storage — set by the archetype's host-side and data-path demands. Both are decisions where the wrong answer wastes money in opposite directions: over-build and you pay for bandwidth that never carries traffic; under-build and you starve the accelerators you spent the most on.

Coupling sets the blocking ratio. A synchronous pre-training job spends a large fraction of every step in collectives (all-reduce, all-gather, reduce-scatter), so the back-end must be 1:1 non-blocking — typically an 8-rail-optimized fat-tree. Oversubscribe it and the all-reduce stalls, dragging model FLOPs utilization down across the whole job. Loosely-coupled inference fits inside a node or a small scale-up domain, so 2:1–3:1 oversubscription is fine and cuts back-end cost ~31% (Meta has run 7:1 on a 24k-H100 inference fleet). Sizing a non-blocking fabric for an inference business is the cleanest example of a self-inflicted anti-pattern — bisection bandwidth the requests never use. → Chapter 8.5 (topology & oversubscription), Chapter 8.4 (protocols).

The composition ratios are archetype-specific and shifting. Training historically ran ~8 GPU:1 CPU; agentic inference — with host-side sandbox execution, retrieval, tool calls, and RL rollouts — is pulling that toward ~4–8 GPU:1 CPU and lower, which changes the host BOM and the node power budget. GPU:memory is set by per-GPU HBM (H100 80 GB → B200 192 GB → B300 288 GB → Rubin Ultra ~1 TB) plus host RAM, and inference is increasingly KV-cache-bound rather than weight-bound. GPU:storage is a bandwidth ratio, not a capacity one: it is fixed by checkpoint write speed for training and data-loader read speed for both — which is exactly where Mapping 3 begins.

Fabric and system-ratio sizing by archetype
ArchetypeBack-end blocking ratioGPU:CPU (host)Dominant memory pressureStorage demand profile
Pre-training1:1 non-blocking, 8-rail fat-tree~8:1 (compute-dense host)HBM for activations; host RAM for stagingBurst checkpoint writes; high sustained read for data loader
Post-training / RLDisaggregated: tight trainer, tolerant rollout poolMixed — more CPU on rollout sideKV-cache on rollouts; HBM on trainerRollout reads + trainer checkpoints; staleness-tolerant
Online inference2:1–3:1 oversubscribed~4–8:1, falling (agentic host work)KV-cache capacity & bandwidthModel-weight load; KV-cache tiering to NVMe/CXL
Batch inferenceHeavily oversubscribed; cost-optimizedFlexibleThroughput over latency; large batchesThroughput reads; no low-latency requirement
Edge inferenceMinimal (single node / WAN backhaul)Constrained by applianceSingle-model resident; small KVLocal model store; periodic sync
Blocking ratios and GPU:CPU norms per SemiAnalysis AI Neocloud Playbook and TrendForce; storage bandwidth bands are practitioner design floors. Figures are 2026-current reference points, not vendor minimums.
~41 kW
practical air-cooling ceiling per rack; RDHx ~50–100 kW; DLC 100–200 kW+
2025ASHRAE TC 9.9; SemiAnalysis Datacenter Anatomy
120–132 kW
per GB200 NVL72 rack (~115 kW liquid + ~17 kW air); GB300 ~142 kW; Rubin Ultra Kyber ~600 kW
2026NVIDIA OCP / SemiAnalysis roadmap
20–25 °C / ~80 L/min
GB200 NVL72 DLC inlet & flow; deviation can throttle GPUs up to ~50%
2025NVIDIA OCP / Introl
1:1 vs 2:1–3:1
training non-blocking vs inference oversubscribed; 2:1 cuts back-end cost ~31% (contested — single-source); Meta ran 7:1 on 24k H100
2025SemiAnalysis AI Neocloud Playbook / Meta
~8:1 → 4–8:1
GPU:CPU ratio shifting from training-era norm toward agentic-inference host demand
2026TrendForce Insights; Introl
~$5–10M/MW
full AI liquid retrofit cost crossing the cooling cliff; still strands capacity
2026Introl / Vera Rubin deployment analysis
Tier III 99.982% / Tier IV 99.995%
~1.6 hr/yr vs ~26 min/yr downtime; Tier IV ~20–40% capital premium
2025Uptime Institute
~90% / ~96%
goodput (effective training time): industry avg vs best-in-class; reliability overhead 6–21% of TCO
2025SemiAnalysis ClusterMAX / CoreWeave

Mapping 3 — storage and redundancy against interruption tolerance

Storage and redundancy are two consequences of the same input — the archetype's tolerance for an interrupted GPU — and they are most defensible when designed together. The question storage answers is: when does a GPU stall waiting on data, and what does that stall cost? The question redundancy answers is: when a node or a power feed fails, does the workload restart cheaply or lose money?

Storage is sized by the throughput that keeps GPUs fed, not by capacity alone. For training, the two binding flows are checkpoint write bandwidth — because a synchronous job pauses all GPUs to write a checkpoint, and slow writes are pure goodput loss — and data-loader read bandwidth, because a starved loader idles the whole pipeline. A high-bandwidth parallel file system feeding GPUDirect Storage (CPU-bypass) is the training default; this is the link that turns a storage decision into a GPU-efficiency decision (Chapter 9.1, Chapter 9.3, Chapter 9.4). For online inference, the new pressure is the KV-cache: reasoning models emit long decode sequences, inflating per-request cache, so the hierarchy now tiers KV state across HBM, host memory, and NVMe/CXL (Chapter 9.7). Batch and edge are the relaxed cases — throughput reads with no low-latency floor.

Redundancy is set by interruption tolerance, and over-building it is a recognizable waste. A synchronous training job already restarts from a checkpoint when any node fails — at best-in-class operators, MTBF is ~7 days per 512 GPUs, and Meta's Llama 3 405B run logged ~one interruption every three hours on 16,384 H100s — so the rational posture is N or N+1 plus disciplined checkpointing, not 2N. Spending on Tier-IV facility power to prevent a restart the job already tolerates buys nines the workload does not value; that capital returns more as goodput — faster checkpoint storage, hot spares, more GPUs. An always-on inference business inverts this: an outage is lost revenue and a breached SLA, so 2N / Tier-IV-class power with N+1 cooling on standby is justified. → Chapter 12.1 (redundancy topologies), Chapter 12.2 (goodput vs availability), Chapter 12.4 (goodput SLAs).

Storage and redundancy mapped to interruption tolerance
ArchetypeInterruption toleranceBinding storage flowStorage tierRedundancy posture
Pre-trainingHigh — checkpoint-and-resumeCheckpoint write + loader read bandwidthParallel FS + NVMe; GPUDirect StorageN or N+1; spend on checkpointing, not 2N
Post-training / RLHigh — staleness-tolerant, restartableRollout reads + trainer checkpointsTiered: fast trainer FS + rollout object storeN+1; disaggregated fault domains
Online inferenceLow — outage = lost revenue + SLA breachWeight load + KV-cache bandwidthKV tiered HBM→host→NVMe/CXL2N / Tier-IV + N+1 cooling on standby
Batch inferenceHigh — queue-and-retryThroughput readsObject store / capacity tierN — interruption-tolerant
Edge inferenceSite-level — fleet geo-redundancyLocal model store + periodic syncLocal NVMe; minimalOften N; resilience via fleet-of-sites
Redundancy tiers and storage profiles are design heuristics, not rules. Goodput/MTBF figures from SemiAnalysis and the Meta Llama 3 disclosure; see keynumbers and the reliability provenance entries.
Deep dive: why checkpoint bandwidth is a redundancy decision in disguise

It is tempting to file checkpoint storage under "storage" and redundancy under "electrical," and to size them in separate workstreams. For training, that separation hides the real trade. A synchronous job's resilience strategy is checkpoint-and-resume: every node failure is absorbed by reloading the last checkpoint and replaying. The cost of that strategy is two-fold — the goodput lost while all GPUs pause to write each checkpoint, and the work re-done since the last one. Both shrink as checkpoint write bandwidth rises: faster writes mean you can checkpoint more often (less re-done work) at lower per-checkpoint cost (less pause).

So the question "how much should we spend on facility redundancy?" and "how fast must checkpoint storage be?" are the same question asked twice. At a top operator's ~7-day MTBF per 512 GPUs and Meta's observed ~one interruption per three hours at 16k-GPU scale, the dominant resilience lever is not 2N power — it is the storage and checkpointing path that makes each interruption cheap. Capital budgeted for Tier-IV redundancy on a checkpointable cluster almost always returns more as checkpoint bandwidth, hot spares, and autonomous fault recovery. The inference case flips: there is no checkpoint to resume to, so the spend belongs in 2N power and N+1 cooling. Interruption tolerance, read once, sets both columns. → Chapter 9.4, Chapter 12.2.

Mapping 4 — siting: power-first vs latency-first

The fourth mapping is the least reversible of all — you cannot move a slab — which is why it must be derived from the workload, never chosen first and rationalized after. Latency sensitivity is the discriminator. Pre-training and batch inference are indifferent to user proximity, so they are scored power-first: chase the cheapest firm (or curtailable) megawatts and the coldest free-cooling climate, accept that the site may be hours from any metro, and treat the grid-interconnection queue slot as the scarcest asset in the project. Online and edge inference are scored latency-first: chase sub-50 ms reach to users and accept power that can cost 2–4x more, distributing capacity for proximity rather than concentrating it for cost.

The 2026 context sharpens this fork. The binding constraint is power, not chips — US large-load interconnection waits run ~3–7+ years in the densest hubs — so a power-first archetype that mis-sites near expensive, constrained metro power burns both money and a queue slot it cannot recover. A latency-first archetype sited in a cheap-power exurb, conversely, may meet its energy budget and miss its SLO, which is the more expensive miss because it loses the revenue the building exists to earn. Water availability is a hard siting gate for any liquid-cooled hall regardless of archetype (Chapter 3.7). The reordered hierarchy and the speed-to-power race are engineered in Chapter 3.1 and Chapter 3.2; the fiber/latency screen in Chapter 3.6.

The reference design-basis sheet, per archetype

The four mappings converge into a single artifact: a reference design-basis sheet per archetype that freezes the inherited assumptions before any long-lead equipment is ordered. This is the deliverable 1.1 promised under "design-basis document," now filled in. Each sheet pins one row per subsystem — density tier, cooling modality, fabric blocking ratio, system ratios, storage tier and throughput floor, redundancy topology, voltage class, and siting class — plus a reversible-vs-irreversible register recording which assumptions are hedged and which are committed. The table below is the skeleton; a real sheet attaches the numbers (the ramp curve, the MVA sizing, the CDU capacity) and the signatures.

Reference design-basis sheets (skeleton) by archetype
SubsystemPre-trainingOnline inferenceBatch inferenceEdge inference
Density tier100–600 kW (DLC)30–100 kW (air/RDHx/DLC)30–60 kW (flexible)few kW–~30 kW
Cooling modalityDLC, warm-water loopAir→RDHx→DLC by densityHost hall's existing (air often fine)Air / sealed modular
Fabric blocking1:1 non-blocking2:1–3:1 oversubscribedHeavily oversubscribedMinimal / WAN backhaul
Storage tierParallel FS + GPUDirectKV-tiered HBM→NVMe/CXLObject / capacity tierLocal NVMe
RedundancyN / N+1 (checkpoint)2N / Tier-IV + N+1 coolingN (queue-and-retry)N + fleet geo-redundancy
Voltage class415/480 VAC → 800 VDC path415/480 VAC415/480 VACLocal LV / appliance
Siting classPower-first (cheap, cold, big queue)Latency-first (sub-50 ms)Cheapest / curtailable powerProximity over cost
A starting template, not a prescription — every figure is sized to the specific ramp curve and generation. Voltage class follows density (800 VDC paths emerge above ~200 kW/rack). Cross-check each cell against the mapping tables above.
Deep dive: reading the matrix backwards to audit an existing facility

The matrix is written forward — archetype in, design basis out — but its most useful diagnostic mode is backward. Given a facility that already exists (a hall you are evaluating to lease, retrofit, or acquire), read its observable subsystems back up the cascade and infer the archetype it was actually built for, then compare that to the workload you intend to run.

A hall with 40 kW air-cooled racks, an oversubscribed Ethernet fabric, 2N power, and a metro location is an inference building — try to run synchronous pre-training in it and you will hit the cooling cliff, starve the all-reduce, and pay for redundancy the training job does not value. A campus with 132 kW DLC racks, a non-blocking InfiniBand fabric, N+1 power, and a remote cheap-power site is a training building — run latency-sensitive inference from it and you will miss every proximity SLO while paying for bisection bandwidth the requests never use. The mismatches the backward read exposes are exactly the three anti-patterns 1.1 named: training fabric for an inference business, retrofitting past the air-cooling cliff, and over-provisioned redundancy for checkpointable jobs. The matrix is therefore both a scoping tool and a due-diligence checklist — the same table, run in two directions. → Chapter 5.10 (retrofit limits), Chapter 1.6 (procurement diligence).

This chapter is the lookup table for the cascade introduced in Chapter 1.1 and deepened per archetype in Chapter 1.2 (training), Chapter 1.3 (inference), Chapter 1.4 (post-training/RL), and Chapter 1.5 (edge); the procurement fork that pairs with siting is in Chapter 1.6, and the economics that score every design-basis sheet live in Chapter 1.8. The cooling cliff is engineered in Chapter 5.1 through Chapter 5.4, with CDUs in Chapter 5.6 and retrofit paths in Chapter 5.10; the fabric blocking decision in Chapter 8.4 and Chapter 8.5; the storage flows in Chapter 9.1, Chapter 9.3, Chapter 9.4, and Chapter 9.7; the redundancy rethink in Chapter 12.1, Chapter 12.2, and Chapter 12.4; and the siting hierarchy in Chapter 3.1, Chapter 3.2, Chapter 3.6, and Chapter 3.7.