Chapter 1.2
Training Data Centers: Synchronous, Dense, Checkpointable
A training data center is one synchronous supercomputer whose every step runs at the speed of its slowest GPU — so you design it to maximize goodput per megawatt, not availability, and you make the building dense, liquid-cooled, non-blocking, and checkpointable before you cut steel.
What you'll decide here
- The scale-up domain size (8 / 72 / 576 GPUs) you build to — because it sets your tensor- and expert-parallel ceiling, your in-rack copper budget, and how much traffic you keep off the expensive scale-out fabric.
- Whether the back-end fabric is truly 1:1 non-blocking or quietly oversubscribed — the single decision that determines whether all-reduce starves and MFU collapses on a synchronous job.
- The density tier you plumb for now (132 kW NVL72 → 600 kW Kyber) versus the IT you fit out today — reserve the irreversible substrate (floor, water, electrical headroom) for the ramp, defer the spend.
- The reliability posture: N or N+1 facility power plus disciplined checkpointing, NOT 2N — for a job that already restarts from a checkpoint, availability nines are capital spent against the wrong objective.
- Whether this is one campus or a multi-site / gigawatt run — which forces a choice between synchronous cross-DC fabric and asynchronous (DiLoCo-class) training, and rewrites the inter-campus fiber and power-resilience plan.
A pre-training cluster is not a fleet of independent servers that happen to share a building. It is one machine — a single tightly-coupled supercomputer running a single job, where thousands of accelerators advance in lock-step and the whole run moves at the speed of its slowest straggler. That one fact reorganizes every downstream decision. The objective is not facility availability; it is goodput — the fraction of wall-clock time the cluster spends doing useful gradient work — measured against the megawatts you can energize and the depreciation clock already running on the silicon. Everything in this chapter follows from optimizing goodput per megawatt on a job that cannot tolerate a slow link, a hot GPU, or an unsynchronized step.
This is the engineering treatment of the training-shaped side of the master fork introduced in Chapter 1.1. We take the four defining properties of the workload — it is synchronous, dense, collective-dominated, and checkpointable — and derive the building from them: the parallelism regime and the collectives it generates; the scale-up domain and scale-out fabric that must carry them; the power density and liquid-cooling mandate that density forces; the reliability philosophy that checkpointing makes rational; the storage system that feeds the GPUs and absorbs the checkpoints; and finally the multi-datacenter and gigawatt-campus regime where a single run outgrows a single building.
Pre-training as one tightly-coupled supercomputer
Training a frontier model is a single optimization loop run across an enormous machine. The model and its data are split four ways at once. Data parallelism replicates the model and shards the batch; every replica computes gradients on its slice and the replicas must agree before the next step — an all-reduce of the full gradient on every iteration. Tensor parallelism splits individual matrix multiplications across GPUs within a tightly-coupled group, generating all-reduce / all-gather traffic on the critical path of every layer. Pipeline parallelism splits the layers into stages across nodes, passing activations forward and gradients back, and lives or dies on how small the pipeline bubble stays. Expert parallelism, for mixture-of-experts models, routes each token to a subset of experts that live on different GPUs — an all-to-all shuffle that is now one of the heaviest collectives in modern training. The combination is called 4D (or 5D, adding context/sequence parallelism) parallelism, and it is why a pre-training cluster behaves as one organism rather than many.
The consequence that governs the whole design: the job is synchronous, so it advances at the speed of the slowest participant. A single GPU that runs 10% slow — a thermal throttle, a flaky NVLink, a degraded optic — does not slow its own work by 10%; it slows the entire cluster by 10%, because every other GPU waits at the next collective barrier for the straggler to arrive. This is the straggler tax, and it is the reason training facilities are engineered for uniformity and tail control rather than average performance. It is also why a single hardware failure does not degrade the job gracefully; it halts it, forcing a restart from the last checkpoint.
Scale-up domain design and the scale-out fabric
AI clusters have two networks, and conflating them is the most expensive networking mistake in training. The scale-up fabric (NVLink and its NVSwitch fabric inside a node or rack) is the memory-coherent, ultra-high-bandwidth domain where GPUs talk as if they shared one address space. The scale-out fabric (InfiniBand or RoCEv2 Ethernet across the back-end) connects those domains into the full cluster. The per-GPU bandwidth gap between them is roughly an order of magnitude — NVLink5 delivers 1.8 TB/s per GPU (a 72-GPU NVL72 rack aggregates ~130 TB/s) versus a ~400 Gb/s scale-out NIC, about an 18x difference. The first principle of training-fabric design follows directly: keep the heaviest collectives inside the scale-up domain, because every byte you push onto scale-out is an order of magnitude more expensive in bandwidth and latency.
That makes scale-up domain size a first-class workload decision, not a hardware accident. An 8-GPU HGX node, a 72-GPU NVL72 rack, and the coming 576-GPU Kyber-class domain are not just bigger boxes — each enlargement raises the ceiling on how much tensor and expert parallelism you can fit before spilling onto scale-out. A larger NVLink domain lets you fit a whole tensor-parallel group, or a wider set of MoE experts, inside the cheap fabric, which is exactly why the industry is racing domain size upward. The downside is blast radius and packaging: a 72-GPU rack is ~3,000 lb of wet hardware carrying 5,184 in-rack copper NVLink cables, and a single NVSwitch fault now degrades 72 GPUs instead of 8.
The scale-out fabric for a training cluster must be 1:1 non-blocking — a rail-optimized fat-tree (typically 8 rails matching 8 NICs per server at 8×400 Gb/s = 3,200 Gb/s per node) sized so that full bisection bandwidth is available to the all-reduce. This is the single line item where training and inference diverge most sharply. A synchronous job spends a large fraction of every step in collectives; oversubscribe its back-end fabric and you starve the all-reduce, the GPUs idle at the barrier, and MFU collapses. Inference, by contrast, fits inside a node or a small domain, so a 2:1 or 3:1 oversubscribed fabric is fine and cuts back-end cost ~31%. Build a non-blocking trainer fabric for an inference business and you have wasted that 31% on bisection bandwidth no request will ever use; oversubscribe a trainer and you have throttled a job worth tens of millions in GPU-hours to save a fraction of the network. → Chapter 8.5 (topology, sizing, oversubscription); Chapter 8.2 (scale-up fabric); Chapter 8.4 (InfiniBand vs RoCE).
| Scale-up domain | GPUs / domain | Intra-domain bandwidth | Parallelism it unlocks | Rack power | Cost / blast radius |
|---|---|---|---|---|---|
| HGX node (8-GPU) | 8 | NVLink5 ~14.4 TB/s aggregate / node | TP up to 8; EP narrow | ~30–60 kW (B200 class) | Air or liquid; small blast radius |
| NVL72 rack | 72 | ~130 TB/s rack aggregate | TP + wide EP inside one rack | ~120–142 kW (GB200/GB300) | DLC mandatory; 72-GPU fault domain |
| Kyber NVL576 (roadmap) | 576 | Gen6 NVLink, rack-scale | TP + EP + more DP inside scale-up | ~600 kW on 800 VDC | 800 VDC + DLC; very large blast radius |
Power density and the cooling mandate
Density in a training hall is set by the accelerator generation and the scale-up domain you chose, and it lands you on one side of a discontinuity. Air cooling saturates around ~41 kW per rack; rear-door heat exchangers and air-assisted liquid stretch that to ~50–100 kW; past ~100 kW the only answer is direct-to-chip liquid. A GB200 NVL72 draws ~120–132 kW continuous, a GB300 ~142 kW, and the announced Rubin Ultra Kyber rack ~600 kW. There is no airflow scheme, containment trick, or warmer supply air that closes a ~90 kW gap between air's ceiling and an NVL72 rack. Choosing pre-training therefore chooses liquid cooling — and it does so before you order a single GPU, because the slab loading, the facility water loop, the pipe-rack space, and the heat-rejection plant all have to exist first. → Chapter 5.1 (the density wall); Chapter 5.4 (DLC, the 2026 default).
The DLC envelope for current dense racks is unforgiving and worth pinning down, because it constrains the whole facility water strategy. The NVL72 reference spec runs a coolant inlet of ~20–25 °C at roughly 80 L/min per rack, with about 200 L of coolant in the loop and ~2.4 MW of cooling capacity at the row-level CDU (a facility-side figure, not per-rack); of the ~132 kW per-rack heat load, roughly 115 kW is removed by liquid and ~17 kW by residual air. Deviation from that envelope throttles the GPUs by up to 50%, which on a synchronous job means the whole cluster halves its step rate behind the throttled rack. This is why training facilities run warm-water loops sized to a tight delta-T: warmer supply water enables free cooling for more of the year (driving PUE toward 1.05–1.15 versus 1.4–1.6 for legacy air), but it also shrinks the thermal margin, so the controls must hold setpoint tightly or pay the throttle tax. → Chapter 5.7 (warm-water loops); Chapter 5.6 (CDUs and the secondary loop).
Reliability philosophy: checkpoint-and-resume, MTBF, and straggler economics
Here is where a training facility most violates data-center orthodoxy. The classical objective is availability — keep the load energized, measured in nines. A training cluster does not care about availability in that sense, because the job already tolerates losing a node: it checkpoints and resumes. When a GPU fails, the synchronous run does not limp along degraded; it stops, the orchestration layer evicts the dead node, swaps in a hot spare, and every GPU rolls back to the last checkpoint and re-computes the lost steps. The right objective is therefore goodput, not availability — and the two call for opposite spending.
The failure rates are not theoretical. Meta's Llama 3 405B run logged 419 unplanned interruptions over 54 days on 16,384 H100s — roughly one every three hours — with 78% hardware-caused and 58.7% GPU- or HBM3-related. A best-in-class mature H100 cluster still sees an MTBF on the order of 7 days per 512 GPUs; scale that to 100,000 GPUs and you are absorbing failures continuously, not occasionally. Alibaba's production study put the large-job failure rate near 43% with ~73% recoverable via restart. At these rates, the cluster is always healing — so the design question is not 'how do we prevent failures' (you cannot) but 'how do we make each failure cheap.'
That reframing drives the redundancy decision. Spending on 2N / Tier-IV facility power to prevent a restart on a job that already restarts from a checkpoint is largely wasted capital — you are buying availability nines the workload does not value. The rational posture is N or N+1 facility power plus disciplined checkpointing, hot spares, and fast fault detection. A dollar moved from facility redundancy into faster checkpoint storage, a deeper spare pool, or more aggressive straggler detection returns more goodput than the same dollar spent on a second power path. This is the cleanest anti-pattern in the building: over-provisioned redundancy for checkpointable jobs. → Chapter 12.2 (the goodput-vs-availability rethink); Chapter 14.4 (operational reliability for training); Chapter 10.7 (autonomous fault recovery).
The straggler economics close the loop. Because the job runs at the slowest GPU's pace, a partially-degraded node is often worse than a dead one — a dead node is evicted and replaced, but a silently-slow node taxes every step until it is detected. Mature operators therefore invest heavily in tail telemetry: per-GPU thermal and clock monitoring, NVLink and optic error counters, and collective-timing instrumentation that flags the straggler before it has bled hours of goodput. The facility's job is to give that telemetry nothing to find — uniform cooling, uniform power, no thermal hot spots — because every degree of thermal non-uniformity across the hall is a latent straggler. → Chapter 14.2 (DCIM and telemetry); Chapter 10.6 (GPU health observability).
| Decision axis | Training (checkpointable) | Inference (always-on) | Why they diverge |
|---|---|---|---|
| Primary objective | Goodput (useful FLOP-time) | Availability vs latency SLO | Job restarts vs lost revenue |
| Facility power | N or N+1 | 2N / Tier-IV-class | Restart is cheap; outage is lost revenue |
| Back-end fabric | 1:1 non-blocking | 2:1–3:1 oversubscribed | All-reduce vs node-local requests |
| Failure response | Evict, hot-spare, resume from checkpoint | Fail over, drain, no user-visible drop | Synchronous halt vs independent requests |
| Where extra $ goes | Faster checkpointing, spares, straggler detect | Redundant power/cooling, geo-distribution | Goodput nines vs availability nines |
Checkpointing as a training constraint (its bearing here)
The full optimal-interval mathematics — the Young/Daly result that sets the checkpoint cadence balancing checkpoint cost against expected lost work — is canonical and lives in Chapter 9.4. Here we cover only its bearing on the synchronous training building, which is twofold and concrete.
First, checkpointing is the mechanism that makes N/N+1 redundancy rational — it is the reason a training facility can refuse to pay for 2N. The cheaper and faster the checkpoint, the less work a failure costs, the more aggressive the cadence you can afford, and the lower the goodput penalty of any given MTBF. That makes checkpoint bandwidth a first-class facility requirement: the storage system must absorb a full-cluster checkpoint — terabytes of optimizer and model state — fast enough that the GPUs stall only briefly, because every GPU is idle during a synchronous checkpoint barrier. A slow checkpoint path quietly converts into lost goodput on every interval. → see storage, below, and Chapter 9.3 (GPUDirect Storage).
Second, the cadence interacts with the failure rate to size everything else. At one interruption every three hours (Llama 3 scale), a checkpoint cadence and a lost-work budget together determine how many hot spares you must keep warm and how fast the orchestration plane must detect, evict, and re-place a failed node to keep goodput near the 90–96% band. The facility decision that flows from this: provision the checkpoint storage tier and the spare-node pool as deliberately as you provision GPUs — they are not afterthoughts, they are the levers that convert a high failure rate into high goodput. → Chapter 10.7 (autonomous recovery); Chapter 14.6 (spares strategy).
Storage for training: checkpoint bandwidth, dataset streaming, and LOSF
A training cluster's storage exists to keep expensive GPUs fed and to absorb checkpoints without stalling them — and it has three distinct jobs with different performance shapes. Checkpoint write bandwidth is bursty and enormous: at a synchronous barrier the whole cluster writes its state at once, so the storage must sink terabytes in seconds to minimize the idle window. Dataset streaming (the data-loader path) is sustained, read-heavy, and latency-sensitive in the tail — if the loader cannot keep every GPU's input queue full, the GPUs starve and MFU drops, the same straggler logic applied to data instead of compute. Object/capacity storage holds the raw corpus and cold checkpoints. The parallel file system (Lustre, GPFS/Storage Scale, WEKA, VAST and kin) sits in front, and increasingly the data-loader and checkpoint paths use GPUDirect Storage to move bytes directly into GPU memory, bypassing the CPU bounce buffer. → Chapter 9.1 (why storage determines GPU efficiency); Chapter 9.2 (parallel file systems); Chapter 9.5 (the data-loader path).
LOSF — Lots Of Small Files — is the storage pathology specific to AI training, and it is worth naming because it ambushes teams that sized for bandwidth alone. Training corpora and tokenized shards are frequently millions of small objects, and small-file workloads are bound by metadata operations and IOPS, not by sequential throughput. A file system tuned for the big sequential reads of checkpoint restore can choke on the random small-file reads of dataset streaming, leaving GPUs starved while the bandwidth meter reads low. The facility consequence: the storage tier must be specified against the LOSF and checkpoint-burst profiles explicitly, not against an average GB/s number — and the metadata path (often the silent bottleneck) sized as deliberately as the data path. → Chapter 9.8 (sizing and data gravity); Chapter 9.9 (offline data-prep).
Deep dive: why the checkpoint-storage tier is a goodput lever, not a cost center
Treat checkpoint storage as commodity capacity and you leave goodput on the floor at every interval. The chain is mechanical. A synchronous checkpoint is a stop-the-world event: every GPU in the run holds at a barrier while cluster state is flushed, so the wall-clock cost of a checkpoint is (state size ÷ effective write bandwidth) multiplied across the whole fleet's idle time. Halve the write bandwidth and you double the idle window on every checkpoint — and you must checkpoint frequently because the failure rate is high. The lost work from a failure is, on expectation, half the checkpoint interval; so a slow checkpoint path forces a longer interval to amortize the stall, which in turn raises the expected lost work per failure. Slow storage thus costs goodput twice: once in the stall and again in the larger rollback.
The mitigations are all facility-and-stack decisions made at scoping time. Asynchronous / in-memory checkpointing stages state to host memory or NVMe and flushes in the background so GPUs resume almost immediately. Hierarchical checkpointing writes frequent local checkpoints (to node NVMe) and infrequent global ones (to the parallel file system), bounding both stall time and blast radius. GPUDirect Storage removes the CPU bounce buffer from the write path. Each of these trades a little complexity or local-NVMe capacity for goodput — and on a cluster where a point of goodput on a tens-of-thousands-of-GPU run is worth millions in GPU-hours, the trade is overwhelmingly favorable. The number to carry: provision checkpoint write bandwidth against the stop-the-world stall budget you can tolerate, not against steady-state throughput. → Chapter 9.4 (Young/Daly cadence math); Chapter 9.3 (GPUDirect).
Multi-datacenter and gigawatt-campus training
A single building has a ceiling — on power it can energize, on land, on a coherent cooling plant — and frontier runs have already hit it. The binding constraint of 2026 is power, and a single campus increasingly cannot host enough of it for the largest runs, so the supercomputer is being stretched across multiple buildings and multiple campuses. Google has trained Gemini across multiple sites within and across data centers, connecting TPU superpods over its intra- and inter-cluster network with latency and bandwidth sufficient to preserve a synchronous training paradigm — model-parallel within a superpod, data-parallel across superpods — and explicitly cites resilience (a power event at one site does not kill the run) and the simple physics of space and power as the reasons to distribute. Gigawatt-scale campuses (the multi-site Ohio build summing toward ~1 GW) are the current expression of this. → Chapter 8.8 (scale-across: multi-campus and cross-region fabric).
This forces a real fork. Synchronous cross-DC training keeps the single-job, lock-step model and extends the non-blocking fabric across campuses with dedicated dark fiber — preserving model quality and simplicity, at the cost of needing enormous, low-latency, high-bandwidth inter-campus links and tolerating the speed-of-light latency floor between sites (which bounds how far apart they can sit before the collective stalls). Asynchronous / low-communication training (the DiLoCo family — Streaming DiLoCo, DiLoCoX, and async variants) lets each site take many local steps before exchanging compressed pseudo-gradients, slashing inter-site bandwidth by orders of magnitude (DiLoCo demonstrated comparable quality while communicating ~500x less) and tolerating wide-area links and stragglers — at the cost of algorithmic complexity, staleness management, and a model-quality regime that is still maturing. The choice rewrites the inter-campus fiber plan, the power-resilience design, and the orchestration plane. → Chapter 8.8; Chapter 10.8 (training frameworks).
Deep dive: the power-resilience case for distributing a single run
Distributing a training run across campuses is usually read as a capacity story — no single site has the megawatts. But there is a second, subtler driver that matters as runs reach gigawatt scale: power resilience. A synchronous job on a single campus is hostage to that campus's power: a grid fault, a generator trip, or a ride-through failure can drop the entire run, and large data-center loads have demonstrably caused multi-hundred-megawatt instantaneous loss events that stress the grid (one ~1,500 MW loss on a single fault; 1.5 GW dropped in 82 seconds in a 2024 Virginia event that triggered a rare NERC Level 3 alert). Spreading the run across sites means a power event at one campus degrades rather than kills the job — the surviving campuses checkpoint and continue, and the affected campus rejoins after recovery, especially under an asynchronous regime that already tolerates stragglers and staleness.
The consequence for the building program: at gigawatt scale, the multi-datacenter decision is no longer purely about fitting the load — it is also a reliability-engineering decision that trades inter-campus fiber and orchestration complexity for independence from any single point of grid failure. That reframing pulls the energy-supply and ride-through strategy (→ Chapter 3.4, Chapter 7.1) and the cross-region fabric (→ Chapter 8.8) into the same design conversation as the parallelism strategy. The largest runs of 2026 are being scoped by people who hold all three at once.
Anti-patterns specific to training builds
The recurring training mis-scopes all share a root: reasoning from the equipment or the building instead of from the synchronous-dense-checkpointable nature of the job. Four are worth naming.
- Quietly oversubscribed back-end fabric. Sizing the scale-out fabric at 2:1 or 3:1 to save ~31% on networking, on a job whose every step is a collective. The all-reduce starves, MFU collapses, and the savings are dwarfed by the lost GPU-hours. Training east-west must be non-blocking. → Chapter 8.5.
- 2N redundancy for a checkpointable job. Commissioning Tier-IV facility power for a run that already restarts from a checkpoint — buying availability nines the workload does not value, when the same capital returns more as faster checkpointing, hot spares, or more GPUs. → Chapter 12.2.
- Designing to today's density. Pouring a slab and water plant for the current generation, then being unable to absorb the next density step without re-pouring concrete. Reserve the irreversible substrate for the ramp. → Chapter 5.10.
- Bandwidth-only storage spec. Sizing the storage tier against an average GB/s number while ignoring the LOSF metadata profile and the stop-the-world checkpoint burst — starving GPUs on the data-loader path or stalling them on every checkpoint. → Chapter 9.1.