Guide › Cooling & Thermal Management › 5.11

Chapter 5.11

Thermal Design, Reliability, Leak Detection & Commissioning

A liquid-cooled AI hall lives or dies on three things the brochure never shows you: the thermal budget that must close end-to-end at the worst-case branch, the leak-and-loss-of-cooling failure modes that can trip a 1 kW GPU in seconds, and a commissioning sequence rigorous enough to find those failures before the GPUs do.

GOODPUTDENSITY-RAMP

What you'll decide here

Where in the chain you spend your thermal margin — at the cold plate, the CDU approach, or the facility loop — because a budget that closes at the average branch but not the worst-case branch is a budget that throttles GPUs in production.
How much cooling redundancy and ride-through to commission (N, N+1, 2N pumps; UPS-backed pumping; chilled-water thermal flywheel vs none) given that DLC has almost no thermal inertia and a checkpointable training job values goodput over nines.
The leak-detection and containment architecture — sensing modality, zoning granularity, and the auto-isolation response — and whether a detected leak pauses a rack, a row, or the job.
The serviceability contract: dripless quick-disconnects, blind-mate trays, and the MTTR target that decides whether a failed cold plate is a 20-minute hot-swap or a multi-hour drain-and-fill.
The commissioning depth (L1 factory through L5 integrated systems test) and the ongoing fluid-analysis cadence you will fund — the difference between a hall that holds spec for five years and one that fouls, corrodes, and de-rates.

The chip-to-sky heat path: the four delta-Ts stack, which is why warmer supply water lets you free-cool year-round.

Everything earlier in Part 5 chose a cooling technology and sized its pieces. This chapter is where those pieces are made to work together, reliably, and survive the day something fails. It is the engineering that separates a rack that runs at rated TDP for five years from one that quietly throttles every afternoon because the thermal budget never actually closed at the worst rack in the row. The discipline here is unforgiving in a way air cooling never was: when you put water (or PG25) inside the rack at 200+ kW, a single open budget, a single undetected leak, or a single un-rehearsed loss-of-cooling event does not degrade gracefully — it trips a $3M rack of accelerators in seconds and, on a synchronous job, restarts thousands of GPUs from the last checkpoint.

We work four problems in order, because each constrains the next. First, the end-to-end thermal budget — a worked example that walks the temperature from junction to wet-bulb and proves the loop closes at the worst-case branch, not the average. Second, the reliability architecture — redundancy topology, UPS-backed pumps, and the fact that DLC has almost no thermal ride-through. Third, leak detection and containment — sensing, zoning, and the auto-isolation response. Fourth, serviceability and commissioning — quick-disconnects, blind-mate, MTTR, the L1–L5 commissioning ladder, and the fluid-analysis program that keeps the loop in spec for life. The cooling-controls transient problem — anti-hunting, setpoint stability, dew-point excursions under a synchronized load slam — is the thermal twin of the electrical transient and lives in its own chapter (Chapter 5.12); the consolidated failure-mode catalog lives in Appendix F.

The end-to-end thermal budget: closing the loop at the worst branch

The single most useful artifact in a liquid-cooled design is a temperature ladder that starts at the silicon junction and ends at the outdoor wet-bulb, with every thermal resistance and every approach temperature on it. If that ladder closes — if the sum of the rises still leaves the junction under its throttle limit when the outdoor air is at design wet-bulb and the worst rack in the hall is at full load — you have a working facility. If it closes on a spreadsheet for the average rack but not the rack at the end of the longest manifold run, you have a facility that throttles, and you will spend the first year of operations chasing a goodput leak that was designed in.

Walk it forward. A Blackwell-class GPU dissipating on the order of 1–1.4 kW presents a die heat flux in the 500–600 W/cm² range — roughly 40–100x air's practical limit, which is the whole reason we are here. The junction-to-coolant path runs through the silicon, the TIM, the cold-plate base, and the convective film inside the cold-plate microchannels. Each is a thermal resistance in K/W; the sum, multiplied by the chip power, is the rise from coolant to junction. Then the coolant itself heats as it crosses the cold plate — that is your cold-plate delta-T, governed by flow rate and fluid heat capacity. The warmed coolant returns through the rack manifold (a pressure-drop and mixing penalty), crosses the CDU heat exchanger (an approach-temperature penalty of typically 3–5 °C between the technology-cooling loop and the facility water loop), travels the facility loop to the heat-rejection plant, and finally rejects to ambient across a dry cooler or tower (another approach to dry-bulb or wet-bulb). Each handoff costs you degrees. The budget is the bookkeeping of those degrees against the one number that matters: the GPU's throttle threshold.

Worked thermal ladder — junction to ambient, ~130 kW DLC rack (warm-water W40 plant)

Stage	What it is	Typical rise / approach	Running temperature
Outdoor design wet-bulb	The climate floor you reject against	—	~24 °C (design WB)
Heat-rejection approach	Tower/dry-cooler approach to WB/DB	+4–8 °C	~30 °C facility supply
Facility loop + CDU HX approach	FWS→TCS handoff across the CDU exchanger	+3–5 °C	~34 °C TCS supply (≈ W40 band)
Manifold + worst-branch penalty	Distribution loss to the farthest cold plate	+1–3 °C	~36 °C at the worst cold-plate inlet
Cold-plate coolant delta-T	Coolant heating across the plate at rated flow	+8–10 °C	~45 °C coolant leaving the plate
Junction-to-coolant resistance	Silicon + TIM + plate + film, x chip power	+25–35 °C	~70–80 °C junction
Margin to throttle	Headroom before clock/voltage de-rate	throttle ~ 85–90 °C	~10–15 °C margin (worst branch)

Illustrative single-phase D2C budget for a Blackwell-class rack on a warm-water loop. Figures are representative design points, not a vendor spec; flow and delta-T per provenance (DLC ~1.2–2.0 L/min/kW, delta-T ~7.5–12 °C, CDU approach ~3–5 °C). The point is the bookkeeping, not the exact decimals.

The last two rows are where the money is. The margin to throttle on the worst branch at design wet-bulb is the only margin that counts. The instinct of an inexperienced designer is to chase a colder plant — drop the facility supply, buy back degrees with a chiller. That works, and it is exactly the wrong default in 2026: warm-water operation (W40/W45, ~45 °C inlet bands) is the design point precisely because it lets the plant run chiller-less on free cooling nearly year-round, and GB200- and Rubin-class racks were specified to tolerate it. The right place to recover margin is almost always upstream of the plant: a tighter CDU approach, a better TIM, a higher per-chip flow to shrink the cold-plate delta-T, or balancing the manifold so the worst branch is not 3 °C hotter than the best. Spend a degree where it is cheap. → fundamentals and the metric definitions are in Chapter 5.1; the cold-plate and flow design in Chapter 5.4; the warm-water loop in Chapter 5.7; heat rejection in Chapter 5.8.

Reliability architecture: redundancy, UPS-backed pumps, and the ride-through problem

Here is the fact that reorders everything you learned about air-cooled redundancy: a direct-to-chip loop has almost no thermal inertia. An air-cooled hall coasts. The thermal mass of the room, the raised-floor plenum, and the chilled-water volume buys you minutes after a cooling failure — time for a generator to start, a pump to fail over, an operator to react. A DLC rack at 1+ kW per GPU has none of that. The water volume in contact with the silicon is tiny, the heat flux is enormous, and the junction can climb to its throttle — and then its trip — within seconds of flow stopping. The provenance is blunt: 1 kW+ GPUs can thermal-trip within seconds of a loss-of-cooling event, which is why UPS-backed pumps and N+1/2N heat rejection are not a tier upgrade — they are the entry ticket.

That single fact splits the redundancy question into two layers that fail on completely different timescales. The pumping layer (CDU pumps, secondary-loop circulation) must never stop, because stopping it trips the rack in seconds — so the pumps go on the UPS, on the same protected bus as the IT load, and the CDU carries redundant pump circuits with independent power feeds. The heat-rejection layer (chillers, dry coolers, towers, the facility loop) has slightly more grace because the loop volume gives a few tens of seconds of ride-through, but on a warm-water plant with no chilled-water flywheel even that is thin. The design choice is whether to buy back ride-through with thermal storage — a buffer tank, a chilled-water volume, a deliberately oversized loop — or to accept the thin margin and lean on fast, reliable failover. → the electrical spine that powers all of this, and the BBU/BESS layering that backs it, is in Chapter 4.5; CDU sizing and N+1 pump topology in Chapter 5.6; heat-rejection redundancy and the lack of chilled-water inertia in DLC in Chapter 5.8.

Cooling redundancy postures — what you buy and what it costs

Posture	Pumping (CDU)	Heat rejection	Ride-through	Best fit
N	Single pump path	No spare plant	Seconds; trip on any failure	Never for production DLC — checkpoint-tolerant batch only
N+1	Redundant pump in CDU, shared feed	One spare chiller/cooler	Seconds–tens of seconds	Checkpointable training; goodput-optimized
N+1 + UPS pumps	Redundant pumps, dual UPS-backed feeds	N+1 plant	Pumps ride the UPS; plant fails over	The practical 2026 default for dense DLC
2N + thermal store	Fully duplicated pump paths	2N plant + buffer-tank flywheel	Tens of seconds of stored cooling	Always-on inference; SLA-bound; no checkpoint to fall back on

The fork is per-layer, not per-facility: most designs run a higher posture on pumping (which fails in seconds) than on heat rejection. Map the posture to the workload's interruption tolerance, not to a tier badge.

The posture is a goodput decision, and it is the cooling twin of the redundancy argument made for power in Part 1. A synchronous training cluster already restarts from a checkpoint when a node fails; spending on 2N cooling and a thermal flywheel to prevent a trip that costs you a checkpoint-resume is often capital better spent on faster checkpointing or more GPUs. An always-on inference business has no checkpoint to fall back to — a loss-of-cooling trip is dropped revenue and a breached SLA — so the flywheel and the 2N plant earn their keep. The anti-pattern, again, is buying nines the workload does not value. → the goodput-vs-availability reframing is in Chapter 12.2; the checkpoint math that makes training trip-tolerant in Chapter 9.4.

On a 1 MW rack, UPS-backed pumps may not be enough

The ride-through math gets worse, not better, as racks scale to 600 kW and 1 MW. UPS-backed pumps solve the loss-of-power-to-the-pump failure — they keep coolant moving when the grid drops. They do nothing for a loss-of-flow failure: a seized pump, a closed valve, a blocked filter, a leak that drops loop pressure. At 1 kW+ per GPU the junction climbs so fast that even a clean failover may not beat the trip. The mitigations move up the stack: per-GPU and per-rack capacitance and power-capping to shed heat-generating load the instant flow is lost (the same energy-storage layering used for electrical transients), valve and pump health telemetry that predicts the failure before it happens, and a control response fast enough to throttle the GPUs before the thermal sensor does. This is an open frontier — independent fleet leak-rate and loss-of-flow MTBF data outside the hyperscalers is still thin, and vendor reliability claims warrant independent validation. Design for the loss-of-flow case, not just loss-of-power. → transient control in Chapter 5.12; power-capping coordination in Chapter 4.5.

Leak detection and containment: the cascade you must never start

Water and energized electronics are an old enemy, and the industry's tolerance for leaks in a 200 kW rack is effectively zero. The containment philosophy is layered defense: keep the fluid in the loop; if it escapes, detect it within seconds; if it is detected, isolate it before it reaches a busbar; and at every stage, make the response proportional so a drip pauses a rack rather than a job. The connectors do most of the work — modern in-rack plumbing is built on dripless quick-disconnects (UQD/UQDB-class couplings) and blind-mate floating-tray connections that self-align and seal on insertion, so that the act of servicing a node does not itself create a spill. But connectors fail, hoses chafe, cold plates corrode, and gaskets age, so detection and isolation are the backstop.

Detection runs on three complementary modalities, and serious designs use more than one. Point sensors — float switches and conductive pads in drip trays and at low points — are cheap and certain but only see fluid that has already pooled where you guessed it would. Rope/cable leak-detection snakes along manifolds, under racks, and through cable trays, localizing a leak to a length of cable rather than a point. Pressure and flow telemetry on the loop catches the leak that never reaches a sensor at all — a slow drop in loop pressure or a mismatch between supply and return flow betrays a leak upstream of any pooling. The 2026 frontier adds ML/IoT leak forecasting: correlating pressure, flow, makeup-water consumption, and acoustic signatures to flag a degrading connector days before it weeps. The response must be zoned: a leak signal should auto-isolate the affected rack or row (close the manifold valves, depressurize the branch) and alarm — not silently dump the whole hall's cooling, which would convert a contained drip into a hall-wide loss-of-cooling trip.

Deep dive: the coolant-leak cascade and the thermal-runaway twin (the two failure modes that define DLC risk)

Two cascades dominate the DLC failure catalog, and they are mirror images. The coolant-leak cascade starts with a breach — a chafed hose, a failed gasket, a cross-threaded quick-disconnect, a corroded cold-plate channel. Fluid escapes onto energized electronics; if PG25 reaches a busbar or PSU it can flash a short, and if loop pressure drops far enough the remaining racks on that branch lose flow. The contained version stops at the drip tray with a rack isolated and a service ticket. The uncontained version is a short, a fire-risk event, and a multi-rack loss-of-cooling trip in the same minute. The entire detection-and-isolation architecture above exists to keep this cascade in its contained form. The insurability and FM Global gating that shapes fluid choice (and that stalled two-phase immersion) is the financial expression of this same risk — see Chapter 5.5.

The thermal-runaway cascade is the leak's twin: instead of fluid leaving the loop, heat fails to leave the silicon. A blocked filter, a fouled cold plate, a seized pump, a closed valve, or a controls hang stops or starves flow with the loop still full. The junction climbs through throttle to trip in seconds (the ride-through problem above). On a synchronous job, one rack's trip cascades into a job-wide checkpoint restart across thousands of GPUs. The defenses differ from the leak case — they are flow/pressure health telemetry, redundant pumping, fast power-capping, and thermal storage — but the lesson is the same: the failure is fast, so the detection and response must be faster than the thermal time constant. Both cascades, treated as explicitly dual-use (random fault and attacker-induced — a maliciously closed valve looks identical to a seized one), are consolidated in the FMEA catalog in Appendix F. The transient-controls failure that can cause a thermal-runaway trip — a hunting valve, a setpoint oscillation — is in Chapter 5.12.

Serviceability and MTTR: the hot-swap contract

A liquid-cooled rack will need service — a cold plate fouls, a hose ages out, a node fails — and the question that decides your fleet availability is not whether but how long. Mean-time-to-repair is a design parameter you set at procurement, not a number you discover in operations. The two architectures that bracket it: a rack built on dripless quick-disconnects and blind-mate trays lets a technician pull a node, swap a cold plate, and re-seat it in minutes with no tools and no spill — the connectors break and seal cleanly, and the rest of the rack keeps running. A rack plumbed with conventional fittings and shared manifold segments that cannot isolate forces a drain-and-fill of a whole branch to touch one node — hours of downtime, a re-bleed to purge air, and a re-commission of that branch before it carries load again.

The serviceability contract is therefore a set of concrete commitments to extract at procurement: every fluid connection is a dripless quick-disconnect rated for the cycle count of expected service; every node is blind-mate so insertion seats both power and coolant without manual alignment; every branch has isolation valves so a single cold plate can be isolated without draining the row; and the CDU carries side-stream filtration (Deschutes-class designs run 0.2-micron filtration) and redundant pumps so filter and pump service never stop cooling. OCP is standardizing exactly these interfaces — rack/manifold geometry, UQD/UQDB couplings, leak-detection practice — precisely so a hall is not locked to one vendor's serviceability story. The cross-vendor interoperability of couplings, manifolds, and coolant chemistry remains an open standardization gap and a real fragmentation risk to weigh in procurement.

~1.2–2.0 L/min/kW

DLC flow rule of thumb (PG25); ~7.5–12 °C coolant delta-T target across the TCS

2025Dober PG25 / NVIDIA OCP

~3–5 °C

CDU heat-exchanger approach (FWS→TCS); Deschutes 2 MW CDU targets ~3 °C ATD

2025ASHRAE TC 9.9; Google OCP / DCD

~99.999%

Google fleet-wide CDU availability since 2020 (redundant pump+HX, UPS-backed)

2025Google Cloud (Project Deschutes, OCP EMEA)

seconds

loss-of-cooling thermal-trip window for 1 kW+ GPUs — UPS-backed pumps + N+1/2N rejection mandatory

2026Domain synthesis (ASHRAE TC 9.9; NVIDIA OCP)

~0.2 micron

side-stream filtration on Deschutes-class CDUs for extended coolant life and uptime

2025Google Cloud / DCD; Boyd

L1–L5

commissioning ladder: factory → install → startup → functional/fault → integrated systems test

2025BCxA / hyperscale Cx practice; Aggreko, Techsite

~45 °C

warm-water inlet band (W40/W45) GB200 & Rubin target — enables near-year-round chiller-less free cooling

2026ASHRAE TC 9.9 (5th ed.); NVIDIA

~55%

single-phase D2C share of the liquid-cooling market — the 2026 default this chapter commissions

2026DCD / IDTechEx

Commissioning: the L1–L5 ladder for a wet hall

Commissioning is where the design meets reality, and for a liquid-cooled hall it is the most consequential and least skippable phase in the build. The industry runs a five-level ladder, and a wet hall adds steps to every rung. Level 1 — factory acceptance: CDUs, manifolds, and cold-plate assemblies are tested and pressure-checked at the manufacturer before they ship, because finding a weeping weld on a loading dock is cheap and finding it in a live hall is not. Level 2 — installation verification: confirm the pipe runs, manifolds, sensors, and rack interfaces were installed to drawing — correct slopes, correct isolation valves, leak-detection cable routed where leaks will actually go. Level 3 — startup: the wet-hall-specific work happens here, and it is multi-stage. The loop is flushed (often multiple passes) to remove manufacturing debris, flux, and biofilm precursors before a single drop touches a cold plate; it is filled and bled to purge air that would otherwise hot-spot a plate; it is pressure-tested to prove integrity; and every leak sensor, flow meter, and temperature probe is calibrated. Level 4 — functional/fault: each subsystem is run at part-load, full-load, and through failure scenarios — kill a pump, simulate a leak signal, drop a chiller — to prove redundancy and control sequences actually behave. Level 5 — integrated systems test (IST): the whole facility runs together under simulated full IT load, including the worst case that the brochure never tests — the worst-case branch at full load with the design-day plant, plus a pull-the-plug ride-through demonstration that proves the UPS-backed pumps and failover hold the racks through a power and a plant event.

The fork: how deep do you commission, and who carries the risk if you skip it?

The real commissioning decision is how far up the ladder you go before energizing GPUs, and it is a risk-allocation fork, not a budget line. Commission to L5 with a full IST and a ride-through demonstration and you find the open thermal budget, the mis-balanced manifold, the leak sensor wired to the wrong zone, and the failover that hangs — on a test heat load, before a single accelerator is at risk. Skip to L3 and energize to hit a schedule and you have moved that discovery into production, where the worst-case branch reveals itself as an afternoon throttle, the un-tested failover reveals itself as a job-wide trip, and the mis-zoned leak sensor reveals itself as a hall-wide loss-of-cooling event. The cost of the skipped IST is not the test — it is the first incident it would have caught, paid in GPU-hours and a breached commissioning warranty. Commission the wet hall to L5. The worst-case-branch full-load test and the ride-through demonstration are the two that earn their cost many times over.

Ongoing fluid analysis: keeping the loop in spec for life

Commissioning hands you a clean, balanced, leak-tight loop. Operations has to keep it that way for five-plus years, and the loop degrades whether you watch it or not. PG25 and similar glycol coolants oxidize and lose inhibitor; dissimilar metals in cold plates, manifolds, and the CDU set up galvanic corrosion; biofilm grows wherever the biocide thins; and particulate sheds from every wetted surface to foul the microchannels that close your thermal budget. A fouled cold plate raises junction-to-coolant resistance — the largest term in the thermal ladder — and quietly eats the margin you commissioned. The countermeasure is a fluid-analysis program: scheduled sampling for pH, inhibitor concentration, conductivity, dissolved metals (the corrosion signature), and biological activity, with side-stream filtration (sub-micron on hyperscale CDUs) and biocide dosing tuned to the results. The cadence is a real cost and a real decision — quarterly sampling and an annual filter program is a defensible baseline — and it is cheaper than the alternative, which is discovering corrosion by the leak it eventually causes.

This closes the loop, literally and figuratively, on Part 5's mechanical content. The coolant chemistry and material-compatibility envelope is set in Chapter 5.4 (coolant selection) and Chapter 5.6 (CDU filtration and chemistry); the facility-water treatment and biocide/Legionella program for the heat-rejection side is in Chapter 5.7 and Chapter 5.8; the pressure-system mechanical engineering — pipe code, water-hammer, NDE, and hydrostatic acceptance that underwrites the L3 pressure test — is in Chapter 5.13; and the flush/fill/pressure acceptance procedures referenced here are detailed in the construction-execution path at Chapter 13.5.

This chapter sits at the end of Part 5's mechanical arc. Upstream, the density wall and thermal metrics are in Chapter 5.1; the DLC architecture and cold-plate/flow design in Chapter 5.4; CDUs and the secondary loop in Chapter 5.6; the warm-water facility loop in Chapter 5.7; heat rejection and its lack of chilled-water inertia in Chapter 5.8; and the retrofit/live-commissioning leak risk in Chapter 5.10. The transient-controls twin — anti-hunting, setpoint stability, dew-point excursions on a synchronized load slam — is Chapter 5.12; the pressure-system code, surge, and NDE basis is Chapter 5.13. The electrical spine that powers the UPS-backed pumps and times the ride-through is Chapter 4.5; the goodput-vs-availability framing behind the redundancy posture is Chapter 12.2; the checkpoint math behind training's trip-tolerance is Chapter 9.4; the construction-phase flush and pressure acceptance is Chapter 13.5. The consolidated coolant-leak and thermal-runaway FMEA catalog — treated as dual-use — lives in Appendix F.