The Definitive Guide toAI Data Centers
Ask the Guide
GuidePart 5

Part 5

Cooling & Thermal Management

13 chapters

5.1
Thermal Fundamentals & the Density Wall
Heat is the binding physical constraint of an AI data center: every watt you put into a chip must leave it through a stack of thermal resistances that air can no longer clear, so the density of the machine you intend to run silently dictates which cooling regime you are forced into — and the boundary between regimes is a cliff, not a slope.
5.2
Air Cooling at the Limit
Air cooling did not die at the density wall — it was pushed to a hard, well-defined economic ceiling around 40–50 kW per rack, and the engineering decision is no longer whether to use it but where it still wins outright and how far you can responsibly push it before the fan-power curve, not the chip, makes the choice for you.
5.3
Rear-Door Heat Exchangers & Air-Assisted Liquid Cooling (The Bridge)
Rear-door heat exchangers and air-assisted liquid cooling are bridge technologies: they let a building that was never plumbed for full DLC still host 40–75 kW racks today, at the price of a dew-point discipline and a density ceiling you will hit again in one GPU generation.
5.4
Direct-to-Chip Liquid Cooling (DLC) — The 2026 Default
Direct-to-chip liquid cooling stopped being a choice in 2026. Once a rack draws past the air ceiling, the only open decisions are single- vs two-phase, how you plumb the rack, and how tightly you budget flow, delta-T, and pressure, and each of those forks sets a downstream serviceability, reliability, and capex bill you live with for the asset's life.
5.5
Immersion Cooling (Single-Phase & Two-Phase)
Immersion wins the PUE and density argument on paper and loses the deployment argument in practice — single-phase is a serviceable niche with real heat-reuse appeal, two-phase is a stalled technology whose enabling fluid the chemical industry is walking away from, and direct-to-chip beat both to the 2026 rack.
5.6
CDUs & the Secondary Loop
The CDU is the firewall between a $40M GPU loop and the dirty, corrosive, dew-pointed water of the building, and the four ways to draw that line (in-rack L2A, in-rack L2L, row-level L2L, central L2L) decide your fleet's stranded capacity, your blast radius, and whether a single pump failure throttles a training job.
5.7
Facility Water Loops & Warm-Water Cooling
The single temperature you pick for the facility water supply — chilled W17/W27 or warm W32/W40/W45 — is the master setpoint of the whole thermal plant: it decides how many hours a year you run compressors, how much water you evaporate, and whether your waste heat is worth selling or worth nothing.
5.8
Heat Rejection: Chillers, Dry Coolers, Towers, Adiabatic & Economizers
Heat rejection is where the loop temperature you chose upstream gets cashed out against the climate you sited in — and the fork between a chiller, a dry cooler, a wet tower, and an adiabatic hybrid is really a fork between burning kilowatt-hours and burning liters, paid every hour for the life of the plant.
5.9
Heat Reuse & Waste-Heat Recovery (Engineering)
The grade of the heat you can deliver — not the quantity, which is enormous either way — decides whether your waste heat is a sellable district-heating commodity or a thermodynamic nuisance you pay to dump, and that grade was already fixed upstream by the facility-water temperature you chose before the slab was poured.
5.10
Retrofitting Air-Cooled Facilities for Liquid
A liquid retrofit is a negotiation with four fixed quantities the original building locked in (floor strength, plenum volume, electrical headroom, available water), and whichever one runs out first sets your real density ceiling and strands the rest.
5.11
Thermal Design, Reliability, Leak Detection & Commissioning
A liquid-cooled AI hall lives or dies on three things the brochure never shows you: the thermal budget that must close end-to-end at the worst-case branch, the leak-and-loss-of-cooling failure modes that can trip a 1 kW GPU in seconds, and a commissioning sequence rigorous enough to find those failures before the GPUs do.
5.12
Cooling-Controls Transient Dynamics & Setpoint Stability
A direct-to-chip loop has almost no thermal inertia to hide behind, so when thousands of GPUs slam their power in unison the cooling controls must answer in seconds — and the line between a stable answer and a self-oscillating one is set at design time by slew limits, loop tuning, and a dew-point margin you cannot tune away.
5.13
Facility Piping & Pressure-System Mechanical Engineering
Once a data hall is plumbed for liquid, the cooling problem becomes a pressure-system mechanical-engineering problem — and the code you build to, the surge you fail to model, and the metal you couple to the wrong metal are the three quiet ways a 132 kW rack gets shut down by a pipe rather than a chip.