Chapter 5.6
CDUs & the Secondary Loop
The CDU is the firewall between a $40M GPU loop and the dirty, corrosive, dew-pointed water of the building, and the four ways to draw that line (in-rack L2A, in-rack L2L, row-level L2L, central L2L) decide your fleet's stranded capacity, your blast radius, and whether a single pump failure throttles a training job.
What you'll decide here
- Where you draw the loop-isolation boundary — in-rack liquid-to-air (no facility water at the rack), in-rack liquid-to-liquid, row-level CDU, or central plant CDUs — and therefore your per-CDU blast radius, serviceability, and stranded-capacity exposure on the density ramp.
- The CDU sizing and redundancy posture: how much margin you carry over the rated rack load, whether you run N+1 internal pumps inside one cabinet or N+1 whole CDUs, and whether the pumps sit on UPS for thermal ride-through.
- The secondary-supply temperature setpoint relative to white-space dew point — the single control loop that, set wrong, condenses water onto live 800 VDC busbars and 1.2 kW GPUs.
- The fluid-chemistry and filtration regime (PG25 vs treated water, 50-micron vs sub-micron side-stream, additive replenishment cadence) that determines whether cold plates foul and clog over a 10-15 year life.
- How leak detection, dew-point control, and pump VFDs integrate into the BMS/DCIM and the GPU telemetry — because a CDU that cannot see the rack's coolant-inlet temperature cannot defend goodput.
Direct-to-chip liquid cooling (Chapter 5.4) gives you a closed loop of clean, conditioned coolant running through cold plates a millimetre above a 1.2 kW die. The building gives you facility water: chilled or warm, chemically treated for a cooling tower, full of the wrong dissolved solids, at a pressure and cleanliness the cold plates would never tolerate. These two fluids must exchange heat without ever mixing. The component that enforces that separation — and pumps, filters, monitors, and controls the clean side of it — is the Coolant Distribution Unit (CDU). It is the most operationally consequential box in a liquid-cooled hall that nobody outside the mechanical team thinks about, right up until a pump trips and a training run loses a thousand GPUs to a thermal throttle.
The CDU defines two loops. The primary or facility water system (FWS) is the building side — the loop that runs out to the chillers, dry coolers, or towers (Chapter 5.8). The secondary or technology cooling system (TCS) is the clean side — the loop that runs from the CDU heat exchanger out to the in-rack manifolds and cold plates (Chapter 5.4). The CDU is the membrane between them: a brazed-plate or shell-and-tube heat exchanger, a redundant pump set, a filtration package, an expansion/make-up provision, and a controls stack that holds the TCS supply temperature, flow, and pressure to the accelerator vendor's envelope. ASHRAE TC 9.9 and OCP both codify this FWS/TCS split because it is the line that makes everything downstream tractable: the facility side can be dirty, warm, and variable; the technology side is clean, tightly controlled, and the GPU vendor's warranty depends on it.
The membrane raises four decisions. Where you place it (the four CDU architectures), how big and how redundant you make it, what fluid and filtration you run on the clean side, and how you control the dew-point margin and detect leaks. Each carries a downstream cost measured in stranded megawatts, blast radius, or throttled GPUs.
What the CDU actually does
Strip away the marketing and a CDU performs five functions, and you size and select it against all five — not just the headline kW.
- Isolate. The heat exchanger keeps facility water (FWS) and technology coolant (TCS) physically separate. A leak on the dirty primary side cannot contaminate the cold plates; a leak on the clean secondary side cannot dump treated tower water into the white space. This is the entire reason the box exists.
- Pump. The CDU drives the secondary loop — it provides the flow and head to push coolant through the in-rack manifolds, the ~150-200 quick-disconnects, and the cold plates against their pressure-drop budget. For an NVL72-class rack that is roughly 80 L/min of flow at the rack; a row or central CDU multiplies that across many racks.
- Control temperature. The CDU modulates a facility-water control valve and pump speed (VFDs) to hold the TCS supply temperature to the GPU vendor's inlet spec — for GB200 that is roughly 20-25 °C today, with the warm-water roadmap (Chapter 5.7) pushing toward ~45 °C inlet. The approach temperature of the heat exchanger (how close the TCS supply can get to the FWS supply) is a first-order selection criterion: Google's 2 MW Project Deschutes CDU advertises a 3 °C approach, which directly buys warmer facility water and more free-cooling hours.
- Filter and condition. A side-stream filter keeps the secondary loop clean so cold-plate microchannels do not foul; the CDU also hosts make-up, de-gassing, and (often) chemical dosing for the TCS.
- Monitor and protect. Flow, supply/return temperature, differential pressure, conductivity, and leak sensors feed the BMS/DCIM and, increasingly, the cluster scheduler. The CDU is the sensor platform that tells the cluster whether its coolant is in spec before a chip throttles.
The master fork: where you draw the isolation boundary
There are four places to put the FWS/TCS membrane, and the choice is the defining decision of this chapter. They differ on whether facility water reaches the rack at all, on how many racks share one CDU (the blast radius), on serviceability, and on how gracefully the architecture absorbs the density ramp from 132 kW to 600 kW racks.
In-rack liquid-to-air (L2A). A self-contained CDU sits inside or atop the rack and rejects the captured heat back into the room air through an air-cooled heat exchanger. There is no facility water at the rack — the only utility is power and room CRAH/CRAC capacity. This is the brownfield superpower: you can land a liquid-cooled rack in a hall that has no plumbing, no leak risk to the white space, and no facility-water commissioning. The cost is capacity and efficiency — an L2A unit is bounded by what the room air can absorb (commonly ~22-100 kW depending on the unit), it dumps the heat back into a hall that now needs more air cooling, and at a ~15 °C approach it is the least thermally efficient option. It is a bridge, not a destination (and it overlaps with AALC in Chapter 5.3).
In-rack liquid-to-liquid (L2L). A CDU in the rack exchanges heat to facility water. Facility water now reaches the rack, but the blast radius is one rack — a CDU failure throttles only its own GPUs. Capacity is bounded by what fits in a few U (often ~50-110 kW per unit, sometimes more), which made it natural for early DLC but is increasingly tight as a single NVL72 rack alone draws ~115 kW of liquid load. You also pay for many small CDUs and many facility-water drops.
Row-level (in-row) L2L. One larger CDU serves a row of racks. This is the 2026 mainstream for dense AI rows: an MW-class cabinet (Google's Project Deschutes is a 2 MW unit; commercial in-row units run ~600 kW to >1 MW) feeds 10-20+ racks through a row manifold. You amortize one high-quality, internally-redundant CDU across the row, with a CDU-to-rack run kept short (commonly ≤20 m). The fork's cost is blast radius and concurrent maintainability: lose the row CDU without N+1 and you lose the whole row, so the redundancy posture (below) becomes non-negotiable.
Central / plant-level L2L. A few very large CDUs (or a CDU plant) serve a whole hall or pod from a mechanical room, distributing TCS through a building-scale secondary loop. This maximizes amortization, redundancy pooling, and serviceability (the CDUs are out of the white space), at the cost of the largest blast radius, the longest secondary runs (more pumping head, more fluid volume, harder hydraulic balancing), and the heaviest commissioning. It blurs into the facility-loop design of Chapter 5.7.
| Architecture | Facility water at rack? | Typical capacity | Blast radius | Serviceability | Best fit |
|---|---|---|---|---|---|
| In-rack L2A (liquid-to-air) | No — rejects to room air | ~22–100 kW/rack | One rack | Hot-swap in white space; no plumbing | Brownfield bridge, no facility water, pilots — see 5.3 |
| In-rack L2L (liquid-to-liquid) | Yes — one drop per rack | ~50–110+ kW/rack | One rack | Per-rack service; many small units | Mixed-density halls; per-rack isolation priority |
| Row-level L2L (in-row) | Yes — row manifold | ~600 kW–2 MW/CDU | One row (mitigate with N+1) | Service in row; fewer, larger units | 2026 mainstream for dense AI rows (NVL72) |
| Central / plant L2L | Yes — building secondary loop | Multi-MW plant | Hall / pod (largest) | CDUs out of white space; pooled spares | Large purpose-built campuses; max amortization |
The honest summary of the L2L-vs-L2A trade is capacity-per-dollar at scale: liquid-to-liquid delivers on the order of 8-10x the heat-rejection capacity of liquid-to-air at under ~2x the cost once you are building dense rows, because L2A is fundamentally capped by the room air it rejects into. L2A wins only where its singular advantage — no facility water at the rack — is worth more than capacity and PUE, which is exactly the brownfield retrofit case (Chapter 5.10). For a purpose-built training hall the decision collapses to row-level vs central L2L, and that is a blast-radius-vs-amortization argument, not a thermodynamics one.
Sizing: the margin question nobody wants to pay for
Sizing a CDU is a four-variable problem, not a matter of picking the kW that matches the rack. The four are heat load, flow, approach temperature, and pressure/head, and you must satisfy all of them simultaneously at the worst-case facility-water temperature, not the design-day average. A CDU rated '2 MW at 18 °C facility water' may deliver far less on the hottest day when the tower can only supply warmer water; the rating is a curve, not a number, and you size against the corner of that curve that your climate and heat-rejection plant actually produce (Chapter 5.8).
The flow side is set by the accelerator: roughly 1.2-2.0 L/min per kW of liquid load for PG25 coolant at the target delta-T. The head side is set by the worst-case hydraulic path — the longest, most restrictive run from CDU through row manifold, in-rack manifold, UQDs, and cold plates — and getting it wrong starves the far racks (the ±5% branch-balance target in Chapter 5.4 is a CDU-plus-manifold co-design problem, not a manifold-only one). The approach temperature is the efficiency lever: a tighter HX approach (3 °C vs 5 °C) lets you run warmer facility water for the same chip inlet, which is what unlocks free cooling and heat reuse (Chapter 5.9).
The decision that costs real money is oversize factor against the density ramp. A CDU plant sized exactly for today's 132 kW NVL72 rows has no headroom for the 190-230 kW VR200 generation, let alone ~600 kW Kyber. Buy the headroom now and you carry idle capex and worse part-load efficiency for two years; buy it later and you re-plumb a live hall mid-life. There is no clean answer — the practitioner question 'can you economically future-proof from 120 kW to 600 kW loops, or is a 2-3 year re-fit cycle now structural?' is genuinely open. The defensible move is to oversize the irreversible substrate (pipe risers, valve stations, mechanical-room footprint, facility-water capacity) while keeping the reversible CDU modules matched to current generation — the same reversible-vs-irreversible discipline as the slab and power chain in Chapter 1.1.
Redundancy: N+1 pumps inside the box vs N+1 boxes
Because a CDU sits in the goodput path, redundancy is mandatory. Where you put it is a fork with different cost and different blast radius.
Internal pump redundancy (N+1 within one CDU). Most quality CDUs ship with dual or N+1 pumps and often redundant power feeds per pump circuit (Project Deschutes runs fully redundant power feeds for each pump circuit), so a single pump or VFD failure does not stop flow. This protects against the most common failure mode — a pump — but the heat exchanger, the cabinet, and the controls are still single points. Lose the box and you lose the row.
Unit-level redundancy (N+1 CDUs). A spare CDU per row (or per N rows) covers the whole-box failure and enables concurrent maintainability — you can valve out and service a CDU without dropping the row. This is the posture that earns Tier-III/IV-class concurrent maintainability on the cooling side, and it is increasingly table stakes for training halls where the row is one job. The cost is a redundant MW-class cabinet, its facility-water drops, and the floor space.
UPS-backed pumps for thermal ride-through. This is the subtle one. DLC loops have almost no thermal inertia — there is no big chilled-water buffer, so on loss of facility power the coolant stops moving and a 1 kW+ die can trip in seconds. Air halls tolerated a few seconds of CRAH coast-down; DLC does not. The fix is to put the CDU pumps (and ideally the heat-rejection plant's critical pumps) on UPS/BESS so flow continues through a power transient and across a generator start. Skipping this is the cooling-side equivalent of skipping ride-through on the GPUs — see the transient/ride-through treatment in Chapter 5.8 and Chapter 5.12.
Dew point: the control loop that condenses water on your busbars
The single most important CDU control setpoint is the secondary-supply temperature relative to the white-space dew point. The coolant runs through manifolds, hoses, cold plates, and quick-disconnects — much of it exposed to room air. If any wetted surface drops below the room's dew point, atmospheric moisture condenses on it: water beading on hoses above live 800 VDC busbars and 1.2 kW GPUs. The rule is absolute: hold the TCS supply temperature above the white-space dew point at all times, which keeps the cooling 100% sensible (no condensation, no latent load) and is one of the structural reasons warm-water cooling is winning — a ~45 °C supply is nowhere near any realistic dew point.
This couples the CDU to the room. The dew point depends on white-space humidity, so the CDU's minimum supply setpoint is a function of the air-side environmental control (and of any humidification policy). Run the room too humid and you raise the dew-point floor, forcing a warmer minimum coolant supply and giving up cold-plate margin; run a chilled-water trim loop too aggressively and you risk dipping below dew point on a transient. The dew-point margin is therefore a negotiated setpoint between the mechanical (CDU) and environmental (CRAH/RH) controls, and it is exactly the kind of cross-loop setpoint interaction that Chapter 5.12 treats as a stability problem.
Fluid chemistry and filtration: the slow failure mode
Leaks and pump trips are the fast failure modes. The slow one is the secondary loop fouling itself over a 10-15 year life — and it is the one most likely to be under-engineered at commissioning because it does not show up for years. The TCS fluid is typically PG25 (25% propylene glycol / 75% deionized or treated water): the glycol provides freeze protection and biostatic properties, the water carries the heat. The trade is real — glycol raises viscosity and lowers specific heat, so a higher glycol fraction costs you pumping power and flow margin; you run the minimum glycol that meets freeze and biological requirements, not the maximum.
Three chemistry failure modes stalk the loop over its life, and the CDU is where you manage all three:
- Galvanic corrosion. Mixed metallurgy (copper cold plates, aluminium components, steel pipe, brazed-plate HX) plus a conductive fluid sets up galvanic cells. Corrosion inhibitors are dosed into the fluid, but inhibitors deplete — so this is a maintenance cadence, not a one-time fill.
- Biofilm. Warm water is a microbial habitat; biofilm fouls cold-plate microchannels and degrades heat transfer. The warm-water roadmap raises this risk exactly as it lowers the chiller bill, so biocide/biostat management gets more important as inlet temperatures climb.
- Particulate / additive depletion. Particulate from manufacturing residue, wear, and corrosion clogs the tightest channels in the cold plates; side-stream filtration (commonly 50-micron, sub-micron on premium units) catches it, but filters load and must be serviced, and additives must be replenished on schedule.
The open question the industry has not answered with public data is how often the TCS fluid must actually be replaced over a facility life, and what the real long-run material-compatibility failure rates are — vendor claims need independent validation. The practical consequence: the CDU's filtration spec, fluid-quality sampling cadence, and inhibitor-replenishment regime are commissioning decisions that determine whether you are servicing clogged cold plates in year three. Get the fluid program wrong and the failure shows up as a slow, fleet-wide rise in coolant delta-T and a quiet loss of goodput — the hardest kind to diagnose. Commissioning and fluid-quality acceptance are detailed in Chapter 5.11.
Deep dive: positive- vs negative-pressure secondary loops, and why leak strategy starts at the CDU
The CDU does not just move coolant — it sets the pressure regime of the secondary loop, and that choice is the first line of the leak-defense strategy. In a conventional positive-pressure loop, the coolant in the cold plates and manifolds is above atmospheric pressure, so any breach sprays coolant outward onto the electronics. Detection and fast isolation are everything: leak-detection rope and point sensors in drip trays, at manifold joints, and under the rack feed the BMS/DCIM, and the control logic must valve out and (if needed) drain the affected zone before the spray reaches a busbar. Dripless, dry-break quick-disconnects (UQD/UQDB per OCP) exist precisely to bound the spill at every connection point.
A negative-pressure (sub-atmospheric) loop inverts the failure mode: the CDU holds the loop below atmospheric pressure, so a breach draws air in rather than pushing coolant out — a leak becomes an ingress of air the CDU can detect as a pressure/level anomaly, not a spray onto live silicon. The cost is mechanical complexity, tighter sealing requirements, and a smaller margin before pump cavitation. Several vendors have built negative-pressure CDUs specifically to de-risk leaks on 100+ kW racks where a positive-pressure spray onto an 800 VDC busbar is unacceptable.
Either way, the leak strategy is a CDU-plus-rack-plus-DCIM system, not a sensor you bolt on. The CDU provides the pressure regime and the make-up/level telemetry that turns a leak into an early, actionable signal; the rack provides drip containment and dripless couplings; the DCIM correlates a coolant-level drop with a leak sensor and a rising cold-plate temperature into a single alarm with a known blast radius. ML/IoT leak forecasting — catching the slow level decline before the alarm — is moving from differentiator to table stakes. Detailed leak engineering lives in Chapter 5.11; the life-safety overlap with fire and electrical hazard is in Chapter 6.5.
Controls and integration: the CDU as a telemetry platform
A modern CDU is a controls node, and how it integrates decides whether the cooling can defend the compute or merely report on it after the fact. At minimum it modulates pump speed (VFDs) and the facility-water control valve to hold TCS supply temperature, flow, and differential pressure to setpoint, while streaming supply/return temperatures, flow, dP, conductivity, filter state, coolant level, and leak status to the BMS/DCIM. The VFDs themselves matter beyond control: ultra-low-harmonic drives (Project Deschutes specifies IEEE 519 ULHD VFDs) keep the CDU from polluting facility power quality — a real concern when you have dozens of MW-class pump drives on a hall (Chapter 4.10 on power quality).
The decision that separates a goodput-aware facility from a facilities-aware one is whether the CDU telemetry reaches the cluster scheduler, not just the BMS. If the scheduler can see a row's coolant-inlet temperature trending toward the throttle threshold, it can drain or de-rate the job before the GPUs throttle and corrupt a training step; if the data dies in the BMS, the first the cluster knows of a cooling problem is a 50% throttle on a thousand chips. This is the cooling-side instantiation of the DCIM/observability argument in Chapter 14.2 and the autonomy ladder in Chapter 14.13: the CDU's value is not just holding setpoint, it is being a sensor the cluster can act on.
Where this sits in the loop
The CDU is the hinge of the whole thermal stack. Upstream of it is the rack-side world — cold plates, in-rack manifolds, and quick-disconnects (Chapter 5.4) — whose pressure-drop and flow-balance budgets the CDU must satisfy. Downstream is the facility water loop and its temperature class (Chapter 5.7), then heat rejection (Chapter 5.8) or heat reuse (Chapter 5.9). The CDU's approach temperature is the dial that connects them: a tighter approach lets the facility loop run warmer, which is what unlocks year-round free cooling and high-grade heat capture. Choose the CDU architecture and approach badly and you have quietly capped your PUE, your free-cooling hours, and your heat-reuse grade before the first GPU boots.