Chapter 5.11
Thermal Design, Reliability, Leak Detection & Commissioning
A liquid-cooled AI hall lives or dies on three things the brochure never shows you: the thermal budget that must close end-to-end at the worst-case branch, the leak-and-loss-of-cooling failure modes that can trip a 1 kW GPU in seconds, and a commissioning sequence rigorous enough to find those failures before the GPUs do.
What you'll decide here
- Where in the chain you spend your thermal margin — at the cold plate, the CDU approach, or the facility loop — because a budget that closes at the average branch but not the worst-case branch is a budget that throttles GPUs in production.
- How much cooling redundancy and ride-through to commission (N, N+1, 2N pumps; UPS-backed pumping; chilled-water thermal flywheel vs none) given that DLC has almost no thermal inertia and a checkpointable training job values goodput over nines.
- The leak-detection and containment architecture — sensing modality, zoning granularity, and the auto-isolation response — and whether a detected leak pauses a rack, a row, or the job.
- The serviceability contract: dripless quick-disconnects, blind-mate trays, and the MTTR target that decides whether a failed cold plate is a 20-minute hot-swap or a multi-hour drain-and-fill.
- The commissioning depth (L1 factory through L5 integrated systems test) and the ongoing fluid-analysis cadence you will fund — the difference between a hall that holds spec for five years and one that fouls, corrodes, and de-rates.
Everything earlier in Part 5 chose a cooling technology and sized its pieces. This chapter is where those pieces are made to work together, reliably, and survive the day something fails. It is the engineering that separates a rack that runs at rated TDP for five years from one that quietly throttles every afternoon because the thermal budget never actually closed at the worst rack in the row. The discipline here is unforgiving in a way air cooling never was: when you put water (or PG25) inside the rack at 200+ kW, a single open budget, a single undetected leak, or a single un-rehearsed loss-of-cooling event does not degrade gracefully — it trips a $3M rack of accelerators in seconds and, on a synchronous job, restarts thousands of GPUs from the last checkpoint.
We work four problems in order, because each constrains the next. First, the end-to-end thermal budget — a worked example that walks the temperature from junction to wet-bulb and proves the loop closes at the worst-case branch, not the average. Second, the reliability architecture — redundancy topology, UPS-backed pumps, and the fact that DLC has almost no thermal ride-through. Third, leak detection and containment — sensing, zoning, and the auto-isolation response. Fourth, serviceability and commissioning — quick-disconnects, blind-mate, MTTR, the L1–L5 commissioning ladder, and the fluid-analysis program that keeps the loop in spec for life. The cooling-controls transient problem — anti-hunting, setpoint stability, dew-point excursions under a synchronized load slam — is the thermal twin of the electrical transient and lives in its own chapter (Chapter 5.12); the consolidated failure-mode catalog lives in Appendix F.
The end-to-end thermal budget: closing the loop at the worst branch
The single most useful artifact in a liquid-cooled design is a temperature ladder that starts at the silicon junction and ends at the outdoor wet-bulb, with every thermal resistance and every approach temperature on it. If that ladder closes — if the sum of the rises still leaves the junction under its throttle limit when the outdoor air is at design wet-bulb and the worst rack in the hall is at full load — you have a working facility. If it closes on a spreadsheet for the average rack but not the rack at the end of the longest manifold run, you have a facility that throttles, and you will spend the first year of operations chasing a goodput leak that was designed in.
Walk it forward. A Blackwell-class GPU dissipating on the order of 1–1.4 kW presents a die heat flux in the 500–600 W/cm² range — roughly 40–100x air's practical limit, which is the whole reason we are here. The junction-to-coolant path runs through the silicon, the TIM, the cold-plate base, and the convective film inside the cold-plate microchannels. Each is a thermal resistance in K/W; the sum, multiplied by the chip power, is the rise from coolant to junction. Then the coolant itself heats as it crosses the cold plate — that is your cold-plate delta-T, governed by flow rate and fluid heat capacity. The warmed coolant returns through the rack manifold (a pressure-drop and mixing penalty), crosses the CDU heat exchanger (an approach-temperature penalty of typically 3–5 °C between the technology-cooling loop and the facility water loop), travels the facility loop to the heat-rejection plant, and finally rejects to ambient across a dry cooler or tower (another approach to dry-bulb or wet-bulb). Each handoff costs you degrees. The budget is the bookkeeping of those degrees against the one number that matters: the GPU's throttle threshold.
| Stage | What it is | Typical rise / approach | Running temperature |
|---|---|---|---|
| Outdoor design wet-bulb | The climate floor you reject against | — | ~24 °C (design WB) |
| Heat-rejection approach | Tower/dry-cooler approach to WB/DB | +4–8 °C | ~30 °C facility supply |
| Facility loop + CDU HX approach | FWS→TCS handoff across the CDU exchanger | +3–5 °C | ~34 °C TCS supply (≈ W40 band) |
| Manifold + worst-branch penalty | Distribution loss to the farthest cold plate | +1–3 °C | ~36 °C at the worst cold-plate inlet |
| Cold-plate coolant delta-T | Coolant heating across the plate at rated flow | +8–10 °C | ~45 °C coolant leaving the plate |
| Junction-to-coolant resistance | Silicon + TIM + plate + film, x chip power | +25–35 °C | ~70–80 °C junction |
| Margin to throttle | Headroom before clock/voltage de-rate | throttle ~ 85–90 °C | ~10–15 °C margin (worst branch) |
The last two rows are where the money is. The margin to throttle on the worst branch at design wet-bulb is the only margin that counts. The instinct of an inexperienced designer is to chase a colder plant — drop the facility supply, buy back degrees with a chiller. That works, and it is exactly the wrong default in 2026: warm-water operation (W40/W45, ~45 °C inlet bands) is the design point precisely because it lets the plant run chiller-less on free cooling nearly year-round, and GB200- and Rubin-class racks were specified to tolerate it. The right place to recover margin is almost always upstream of the plant: a tighter CDU approach, a better TIM, a higher per-chip flow to shrink the cold-plate delta-T, or balancing the manifold so the worst branch is not 3 °C hotter than the best. Spend a degree where it is cheap. → fundamentals and the metric definitions are in Chapter 5.1; the cold-plate and flow design in Chapter 5.4; the warm-water loop in Chapter 5.7; heat rejection in Chapter 5.8.
Reliability architecture: redundancy, UPS-backed pumps, and the ride-through problem
Here is the fact that reorders everything you learned about air-cooled redundancy: a direct-to-chip loop has almost no thermal inertia. An air-cooled hall coasts. The thermal mass of the room, the raised-floor plenum, and the chilled-water volume buys you minutes after a cooling failure — time for a generator to start, a pump to fail over, an operator to react. A DLC rack at 1+ kW per GPU has none of that. The water volume in contact with the silicon is tiny, the heat flux is enormous, and the junction can climb to its throttle — and then its trip — within seconds of flow stopping. The provenance is blunt: 1 kW+ GPUs can thermal-trip within seconds of a loss-of-cooling event, which is why UPS-backed pumps and N+1/2N heat rejection are not a tier upgrade — they are the entry ticket.
That single fact splits the redundancy question into two layers that fail on completely different timescales. The pumping layer (CDU pumps, secondary-loop circulation) must never stop, because stopping it trips the rack in seconds — so the pumps go on the UPS, on the same protected bus as the IT load, and the CDU carries redundant pump circuits with independent power feeds. The heat-rejection layer (chillers, dry coolers, towers, the facility loop) has slightly more grace because the loop volume gives a few tens of seconds of ride-through, but on a warm-water plant with no chilled-water flywheel even that is thin. The design choice is whether to buy back ride-through with thermal storage — a buffer tank, a chilled-water volume, a deliberately oversized loop — or to accept the thin margin and lean on fast, reliable failover. → the electrical spine that powers all of this, and the BBU/BESS layering that backs it, is in Chapter 4.5; CDU sizing and N+1 pump topology in Chapter 5.6; heat-rejection redundancy and the lack of chilled-water inertia in DLC in Chapter 5.8.
| Posture | Pumping (CDU) | Heat rejection | Ride-through | Best fit |
|---|---|---|---|---|
| N | Single pump path | No spare plant | Seconds; trip on any failure | Never for production DLC — checkpoint-tolerant batch only |
| N+1 | Redundant pump in CDU, shared feed | One spare chiller/cooler | Seconds–tens of seconds | Checkpointable training; goodput-optimized |
| N+1 + UPS pumps | Redundant pumps, dual UPS-backed feeds | N+1 plant | Pumps ride the UPS; plant fails over | The practical 2026 default for dense DLC |
| 2N + thermal store | Fully duplicated pump paths | 2N plant + buffer-tank flywheel | Tens of seconds of stored cooling | Always-on inference; SLA-bound; no checkpoint to fall back on |
The posture is a goodput decision, and it is the cooling twin of the redundancy argument made for power in Part 1. A synchronous training cluster already restarts from a checkpoint when a node fails; spending on 2N cooling and a thermal flywheel to prevent a trip that costs you a checkpoint-resume is often capital better spent on faster checkpointing or more GPUs. An always-on inference business has no checkpoint to fall back to — a loss-of-cooling trip is dropped revenue and a breached SLA — so the flywheel and the 2N plant earn their keep. The anti-pattern, again, is buying nines the workload does not value. → the goodput-vs-availability reframing is in Chapter 12.2; the checkpoint math that makes training trip-tolerant in Chapter 9.4.
Leak detection and containment: the cascade you must never start
Water and energized electronics are an old enemy, and the industry's tolerance for leaks in a 200 kW rack is effectively zero. The containment philosophy is layered defense: keep the fluid in the loop; if it escapes, detect it within seconds; if it is detected, isolate it before it reaches a busbar; and at every stage, make the response proportional so a drip pauses a rack rather than a job. The connectors do most of the work — modern in-rack plumbing is built on dripless quick-disconnects (UQD/UQDB-class couplings) and blind-mate floating-tray connections that self-align and seal on insertion, so that the act of servicing a node does not itself create a spill. But connectors fail, hoses chafe, cold plates corrode, and gaskets age, so detection and isolation are the backstop.
Detection runs on three complementary modalities, and serious designs use more than one. Point sensors — float switches and conductive pads in drip trays and at low points — are cheap and certain but only see fluid that has already pooled where you guessed it would. Rope/cable leak-detection snakes along manifolds, under racks, and through cable trays, localizing a leak to a length of cable rather than a point. Pressure and flow telemetry on the loop catches the leak that never reaches a sensor at all — a slow drop in loop pressure or a mismatch between supply and return flow betrays a leak upstream of any pooling. The 2026 frontier adds ML/IoT leak forecasting: correlating pressure, flow, makeup-water consumption, and acoustic signatures to flag a degrading connector days before it weeps. The response must be zoned: a leak signal should auto-isolate the affected rack or row (close the manifold valves, depressurize the branch) and alarm — not silently dump the whole hall's cooling, which would convert a contained drip into a hall-wide loss-of-cooling trip.
Deep dive: the coolant-leak cascade and the thermal-runaway twin (the two failure modes that define DLC risk)
Two cascades dominate the DLC failure catalog, and they are mirror images. The coolant-leak cascade starts with a breach — a chafed hose, a failed gasket, a cross-threaded quick-disconnect, a corroded cold-plate channel. Fluid escapes onto energized electronics; if PG25 reaches a busbar or PSU it can flash a short, and if loop pressure drops far enough the remaining racks on that branch lose flow. The contained version stops at the drip tray with a rack isolated and a service ticket. The uncontained version is a short, a fire-risk event, and a multi-rack loss-of-cooling trip in the same minute. The entire detection-and-isolation architecture above exists to keep this cascade in its contained form. The insurability and FM Global gating that shapes fluid choice (and that stalled two-phase immersion) is the financial expression of this same risk — see Chapter 5.5.
The thermal-runaway cascade is the leak's twin: instead of fluid leaving the loop, heat fails to leave the silicon. A blocked filter, a fouled cold plate, a seized pump, a closed valve, or a controls hang stops or starves flow with the loop still full. The junction climbs through throttle to trip in seconds (the ride-through problem above). On a synchronous job, one rack's trip cascades into a job-wide checkpoint restart across thousands of GPUs. The defenses differ from the leak case — they are flow/pressure health telemetry, redundant pumping, fast power-capping, and thermal storage — but the lesson is the same: the failure is fast, so the detection and response must be faster than the thermal time constant. Both cascades, treated as explicitly dual-use (random fault and attacker-induced — a maliciously closed valve looks identical to a seized one), are consolidated in the FMEA catalog in Appendix F. The transient-controls failure that can cause a thermal-runaway trip — a hunting valve, a setpoint oscillation — is in Chapter 5.12.
Serviceability and MTTR: the hot-swap contract
A liquid-cooled rack will need service — a cold plate fouls, a hose ages out, a node fails — and the question that decides your fleet availability is not whether but how long. Mean-time-to-repair is a design parameter you set at procurement, not a number you discover in operations. The two architectures that bracket it: a rack built on dripless quick-disconnects and blind-mate trays lets a technician pull a node, swap a cold plate, and re-seat it in minutes with no tools and no spill — the connectors break and seal cleanly, and the rest of the rack keeps running. A rack plumbed with conventional fittings and shared manifold segments that cannot isolate forces a drain-and-fill of a whole branch to touch one node — hours of downtime, a re-bleed to purge air, and a re-commission of that branch before it carries load again.
The serviceability contract is therefore a set of concrete commitments to extract at procurement: every fluid connection is a dripless quick-disconnect rated for the cycle count of expected service; every node is blind-mate so insertion seats both power and coolant without manual alignment; every branch has isolation valves so a single cold plate can be isolated without draining the row; and the CDU carries side-stream filtration (Deschutes-class designs run 0.2-micron filtration) and redundant pumps so filter and pump service never stop cooling. OCP is standardizing exactly these interfaces — rack/manifold geometry, UQD/UQDB couplings, leak-detection practice — precisely so a hall is not locked to one vendor's serviceability story. The cross-vendor interoperability of couplings, manifolds, and coolant chemistry remains an open standardization gap and a real fragmentation risk to weigh in procurement.
Commissioning: the L1–L5 ladder for a wet hall
Commissioning is where the design meets reality, and for a liquid-cooled hall it is the most consequential and least skippable phase in the build. The industry runs a five-level ladder, and a wet hall adds steps to every rung. Level 1 — factory acceptance: CDUs, manifolds, and cold-plate assemblies are tested and pressure-checked at the manufacturer before they ship, because finding a weeping weld on a loading dock is cheap and finding it in a live hall is not. Level 2 — installation verification: confirm the pipe runs, manifolds, sensors, and rack interfaces were installed to drawing — correct slopes, correct isolation valves, leak-detection cable routed where leaks will actually go. Level 3 — startup: the wet-hall-specific work happens here, and it is multi-stage. The loop is flushed (often multiple passes) to remove manufacturing debris, flux, and biofilm precursors before a single drop touches a cold plate; it is filled and bled to purge air that would otherwise hot-spot a plate; it is pressure-tested to prove integrity; and every leak sensor, flow meter, and temperature probe is calibrated. Level 4 — functional/fault: each subsystem is run at part-load, full-load, and through failure scenarios — kill a pump, simulate a leak signal, drop a chiller — to prove redundancy and control sequences actually behave. Level 5 — integrated systems test (IST): the whole facility runs together under simulated full IT load, including the worst case that the brochure never tests — the worst-case branch at full load with the design-day plant, plus a pull-the-plug ride-through demonstration that proves the UPS-backed pumps and failover hold the racks through a power and a plant event.
Ongoing fluid analysis: keeping the loop in spec for life
Commissioning hands you a clean, balanced, leak-tight loop. Operations has to keep it that way for five-plus years, and the loop degrades whether you watch it or not. PG25 and similar glycol coolants oxidize and lose inhibitor; dissimilar metals in cold plates, manifolds, and the CDU set up galvanic corrosion; biofilm grows wherever the biocide thins; and particulate sheds from every wetted surface to foul the microchannels that close your thermal budget. A fouled cold plate raises junction-to-coolant resistance — the largest term in the thermal ladder — and quietly eats the margin you commissioned. The countermeasure is a fluid-analysis program: scheduled sampling for pH, inhibitor concentration, conductivity, dissolved metals (the corrosion signature), and biological activity, with side-stream filtration (sub-micron on hyperscale CDUs) and biocide dosing tuned to the results. The cadence is a real cost and a real decision — quarterly sampling and an annual filter program is a defensible baseline — and it is cheaper than the alternative, which is discovering corrosion by the leak it eventually causes.
This closes the loop, literally and figuratively, on Part 5's mechanical content. The coolant chemistry and material-compatibility envelope is set in Chapter 5.4 (coolant selection) and Chapter 5.6 (CDU filtration and chemistry); the facility-water treatment and biocide/Legionella program for the heat-rejection side is in Chapter 5.7 and Chapter 5.8; the pressure-system mechanical engineering — pipe code, water-hammer, NDE, and hydrostatic acceptance that underwrites the L3 pressure test — is in Chapter 5.13; and the flush/fill/pressure acceptance procedures referenced here are detailed in the construction-execution path at Chapter 13.5.