Chapter 13.5
Cooling Acceptance: Air, Liquid-to-Chip & CDU Commissioning
Cooling acceptance is the one part of commissioning where the facility cannot test what it is built to do — a load bank rejects heat to air, never into a cold plate — so the liquid loop, the CDU controls, and the worst-case branch only ever see realistic transient heat-flux when real GPUs arrive, which makes mechanical Cx and GPU burn-in a single overlapping gate, not two sequential ones.
What you'll decide here
- Where the mechanical-Cx-to-GPU-burn-in boundary actually sits: a clean handoff at full flow on a load bank, or an explicit overlapping gate where the liquid loop is only proven once real silicon is dissipating into the cold plates.
- The fluid-cleanliness acceptance criterion you will hold the secondary loop to before first coolant touches a cold plate — conductivity floor, particulate class, and how many flush cycles you budget for — because under-flushing fouls cold-plate microchannels you cannot clean in place.
- Whether you witness CDU factory acceptance (FWT) or accept on a datasheet — the fork that decides whether a control-loop or pump-curve defect surfaces in a factory bay or in your live hall.
- How you prove the worst-case (hydraulically furthest, highest-rejection) branch makes flow at full load, given that you cannot create full load until the cluster exists — the load-realism limit that reshapes the entire acceptance sequence.
- What the leak-detection and cooling-failover interlocks must demonstrate, and how they tie into cluster burn-in so a CDU trip throttles or parks GPUs before junction temperatures run away.
Every other acceptance domain in Part 13 can be exercised to its design point with surrogate load. Electrical acceptance drives the switchgear and UPS with load banks (Chapter 13.3); generators and microgrids are paralleled and islanded against resistive and reactive banks (Chapter 13.4); integrated systems testing pulls the plug on a fully-loaded building (Chapter 13.6). Cooling acceptance is the exception that defines the whole part. A facility load bank is a resistor stack with a fan: it converts megawatts into hot air and rejects that air into the room. It does not, and cannot, push heat through a cold plate into the secondary liquid loop. So the very thing the liquid plant exists to do — absorb a synchronized, transient, kilowatt-per-chip heat flux at the die and carry it to rejection — is the one thing the facility cannot demonstrate before the GPUs are racked.
This chapter is organized around that limit. We walk airside acceptance, then the secondary-loop work that can be done dry or with surrogate heat — flushing, fluid-quality qualification, fill and purge, hydrostatic and pressure acceptance — then CDU acceptance and the worst-case-branch problem, and finally the leak-integrity, failover, and burn-in interlocks that close the gate. Each fork carries a downstream cost that comes due when real silicon arrives. Mechanical commissioning and GPU burn-in (Chapter 13.8) are not adjacent phases with a clean baton-pass; they overlap, by physics, because the liquid loop's true acceptance test is the first dense training run.
The acceptance map: what can be proven, and with what load
Cooling acceptance spans two physically distinct systems joined at the CDU. The facility water system (FWS) — chillers or dry coolers, towers, the primary loop, pumps, and the airside plant — is conventional mechanical Cx, and most of it can be driven to design with surrogate load: load banks dump heat into the room for the air handlers to reject, and the primary loop can be exercised by the CDU's own heat exchanger or by temporary process loads. The technology cooling system (TCS) — the secondary loop the CDU isolates from facility water, the in-rack manifolds, the quick-disconnects, and the cold plates themselves — is where the load-realism limit bites. You can flush it, fill it, pressure-test it, and run the pumps; you cannot subject the cold plates to a realistic per-die transient without dies dissipating into them. The CDU/TCS separation and loop architecture are engineered in Chapter 5.6; here we accept what was built there.
| Acceptance item | System | Provable pre-GPU? | Surrogate used | What only real GPUs reveal |
|---|---|---|---|---|
| Airside / room cooling capacity | FWS (air) | Yes — fully | Load banks reject to air | Nothing material; air is the load bank's native sink |
| Primary loop, heat rejection, free-cooling changeover | FWS (liquid) | Yes — to design heat | CDU HX or process load | Real annualized climate sequencing over seasons |
| Flushing, fluid quality, fill/purge | TCS | Yes — must precede GPUs | Deionized water then coolant | Long-term chemistry drift, biofouling onset |
| Hydrostatic / pressure-integrity test | Both | Yes — must precede fill | Hydrostatic pressure | Nothing; integrity is pressure-not-heat dependent |
| CDU flow, head, pump redundancy | CDU | Yes — at rated flow | Pump-only or balancing valves | Control response to a real synchronized load slam |
| Worst-case-branch flow at full load | TCS | Partially | Throttling to mimic full draw | True simultaneous full-rack rejection across all branches |
| Loop thermal-hydraulic transient stability | CDU + TCS | No | — | Setpoint stability under a kW/chip step (Chapter 5.12) |
| Leak-detection + cooling-failover interlock to throttle | TCS + IT | Partially | Manual trip injection | GPU throttle/park actually fires before Tj runaway |
Airside acceptance: the part that behaves
Even a fully liquid-cooled hall has a residual air load — roughly 15–17 kW per GB200 NVL72 rack stays on air (NICs, DIMMs, PSUs, optics, switch trays), and storage, networking, and any modest-density inference rows may be entirely air-cooled (Chapter 5.2). Airside acceptance is the conventional, well-understood half of this chapter, and it is genuinely provable pre-GPU because air is the load bank's native sink. The work: verify CRAH/RDHx/in-row capacity at design heat with load banks placed to mimic the rack thermal map; commission containment (hot/cold-aisle or rear-door) for leakage and bypass; tune supply-air setpoints against the ASHRAE A1–A4 envelope; and prove airflow balance so no rack starves. For hybrid halls running DLC plus rear-door exchangers (Chapter 5.3), the RDHx water side is part of this acceptance and its condensation/dew-point margin is set here.
The decision that matters in airside acceptance is how much residual-air capacity you commission relative to the liquid fraction. Over-commission and you have paid for air-handling you will idle as the hall liquid-cools more of the load through the density ramp; under-commission and a generation step-up that shifts the air/liquid split — or a cold-plate fault that dumps a rack's load to air — finds the room plant short. The conservative posture matches air capacity to the worst-case air fraction across the planned ramp, not to day-one steady state.
Secondary-loop flushing and fluid quality: the gate before first coolant
Before a drop of working coolant touches a cold plate, the secondary loop must be flushed and qualified, and this is the single most under-budgeted step in cooling acceptance. The cold-plate microchannels that make DLC work — sub-millimeter passages that drive the convective coefficient — are precisely what particulate and biological fouling block, and once a cold plate is fouled you cannot clean it in place; you replace it, in a live rack, with the loop drained. The flush is therefore not housekeeping. It is the gate that protects the most expensive and least serviceable surface in the building.
Practice converging in 2025–2026 is a multi-stage flush: circulate deionized water (commonly specified at ≥0.5 MΩ·cm resistivity) through manifolds and hoses, then through the full loop, until effluent conductivity stabilizes below a floor in the single-digit µS/cm range (a 5 µS/cm target is widely cited), with particulate held to a declared ISO 4406 cleanliness class. Only then is the system charged with the working fluid — typically PG25 (25% propylene glycol) for the freeze/biocide/material-compatibility envelope DLC loops need (Chapter 5.4). ASHRAE TC 9.9 frames the target chemistry through its water-quality classes; the flush is what gets you into class and the ongoing fluid-analysis program is what keeps you there.
Hydrostatic and pressure-integrity acceptance
Pressure-integrity is the one cooling-acceptance item that is fully provable pre-GPU and entirely independent of heat, because it tests the pressure boundary, not the thermal duty. The charged-piping code basis is ASME B31.x in North America or the EU PED / EN 13480 fork in Europe (Chapter 5.13), and the hydrostatic test typically pressurizes to 1.5× the rated working pressure and holds — durations of several hours are common practice — watching for any decay that betrays a joint, a gasket, or a quick-disconnect that did not seat. The acceptance sequence is strict and ordered: pressure-test before fill, fill before flush-to-quality, flush before coolant charge. Reorder it and you either flush a loop you have not proven leak-tight or charge working fluid into a loop you have not cleaned.
The decision embedded here concerns the quick-disconnects. A GB200-class rack carries on the order of 150–200 dripless quick-disconnects, and every one is a potential leak path that a hydrostatic hold exercises but a blind-mate cycle stresses differently. The fork: accept the QDs as installed on a single hydrostatic hold, or cycle a sample (mate/de-mate under pressure) to catch couplings that seal statically but weep after a service cycle. The second path costs acceptance time; the first costs a leak during the first board-swap. Given that serviceability is the whole point of dripless QDs, cycling a representative sample is the defensible call.
CDU acceptance: factory witness, flow verification, and the worst-case branch
The CDU is the seam of the entire cooling system — it isolates the technology loop from facility water, sets secondary flow and temperature, and carries the controls that must respond to load. It is also, as Uptime Intelligence has flagged, the component most likely to complicate commissioning, because many CDU vendors arrived from outside the data-center world and some had never integrated a unit into a complex fluid network before. That makes the factory witness test (FWT) fork consequential: witness the CDU's flow, head, pump-redundancy failover, and control response in the vendor's bay, or accept it on a datasheet and discover a defect in your live hall. The cost asymmetry is stark — a pump-curve or PID defect found at the factory is a vendor rework; found in the field it is a hall-level schedule hit with the cluster waiting. For any first-of-a-kind CDU model or vendor, FWT is the rational default.
On site, CDU acceptance proves rated flow and head, pump N+1 failover (kill the lead pump, confirm the lag pump holds flow without a thermal excursion), filtration and dew-point control, and the leak-detection integration. Flow verification is staged in increments — load added 25% → 50% → 75% → 100% with temperature differential, flow rate, and pressure drop logged across the CDU, the piping, and the rack manifolds at each step. But that staging runs against surrogate or balancing-valve load, which brings us to the hardest problem in the chapter.
The worst-case branch. A liquid loop balances flow across many parallel branches; the branch that is hydraulically furthest from the CDU and carrying the highest rejection is the one most likely to starve at full system load. Acceptance practice instruments that worst-case branch and verifies it makes its minimum flow when the whole system is loaded. The catch is the load-realism limit: you cannot create full simultaneous load across every branch without the full cluster, so the pre-GPU worst-case-branch test must simulate full draw — typically by throttling other branches with balancing valves to force the hydraulically-furthest node into its worst case, or by running a dummy thermal load. This proves the hydraulics under a static worst case. It does not prove the branch holds flow when every rack is simultaneously rejecting a synchronized training transient — that proof is deferred into the GPU-burn-in overlap.
| Acceptance item | Surrogate-load result | Deferred to GPU burn-in | Consequence of skipping the deferred test |
|---|---|---|---|
| CDU rated flow & head | Proven at 100% flow | — | None — flow is heat-independent |
| Pump N+1 failover | Proven (kill lead pump) | Failover under live thermal load | Failover may hold flow but not Tj margin under real heat |
| Worst-case-branch flow | Static worst case via throttling | Dynamic worst case, all racks live | A branch starves only when the whole hall slams together |
| Control-loop / setpoint stability | Not provable | Tuning under kW/chip step | Hunting, oscillation, or dew-point excursion in production |
| Leak-detect to GPU-throttle interlock | Manual trip injection only | Real trip throttles/parks GPUs | Interlock fires too slow and Tj runs away on a real loss |
Leak integrity, cooling failover, and the interlock with burn-in
Leak detection in a liquid-cooled hall is not a smoke-detector afterthought — it is a real-time interlock that must throttle or park the GPUs before a coolant loss drives junction temperatures past their limit. Two architectural choices set the acceptance work. First, positive vs. negative-pressure operation: a negative-pressure (sub-atmospheric) secondary loop draws air in on a breach instead of pushing coolant out, turning a spray onto live electronics into an air ingress — a fundamentally safer failure mode that some designs adopt specifically to de-risk leaks. Acceptance must confirm the pressure regime behaves as designed under a fault. Second, the detection-to-action chain: rope/spot leak sensors, flow and pressure anomaly detection, and the logic that converts a detection into a GPU power-cap, throttle, or park. Acceptance injects faults — manual trips, simulated sensor alarms, a forced CDU pump loss — and confirms the action fires fast enough.
This is exactly where the chapter's thesis becomes an operational gate. A cooling-failover test against a load bank proves the CDU swings to its standby pump and the alarm propagates. It does not prove that the GPU throttle-or-park interlock actually protects real silicon, because there is no real silicon to protect and the thermal time constants of a die-to-coolant path are not in play. The DLC loop has almost no chilled-water inertia to ride through a slam (Chapter 5.12), so the margin between a CDU trip and a thermal runaway is small and must be proven against real heat. The leak-and-failover interlock therefore cannot be fully accepted until it is wired to a cluster that can be throttled — which is the burn-in overlap.
Deep dive: why the proxy training run is the only true cooling emulator
The load-realism limit is structural, not a gap you can engineer away with a better load bank, and understanding why clarifies the whole acceptance strategy. A resistive load bank reproduces the magnitude of heat but neither the path (it rejects to air, not through a cold plate) nor the dynamics (it steps in coarse resistive increments, not a millisecond-scale synchronized die transient). Reactive and AI-emulating load banks improve the electrical realism — they reproduce the power-factor and the synchronized current slam the BBU/BESS/GPU-capacitance stack must absorb (the canonical treatment of this dynamic-load realism gap lives in Chapter 13.6) — but even an AI-emulating load bank still rejects its heat to air. No load bank pushes a kilowatt-per-die transient through a microchannel cold plate into the secondary loop. The only emulator that does is a real cluster running a real job.
That is why the proxy/reference training run (Chapter 13.9) is the true acceptance test for the thermal-hydraulic system, the same way it is for the network fabric and the scheduler. A synchronized training step is the worst case for the cooling loop: every rack ramps and drops die heat together, the worst-case branch faces simultaneous full rejection, the CDU control loops face their hardest slam, and the dew-point margin faces its real excursion on the down-step. Acceptance criteria that bridge load-bank IST to first-real-workload should explicitly name the proxy run as the moment the cooling loop is finally proven — coolant inlet held in band (20–25 °C for GB200) under synchronized load, no throttling traceable to thermal-hydraulics, worst-case branch holding flow, and the leak-throttle interlock demonstrated against live silicon. Until that run, the cooling system is conditionally accepted, not accepted.
Deep dive: the acceptance sequence, ordered (and why order is load-bearing)
Cooling acceptance is one of the few domains where getting the order wrong silently invalidates downstream tests. The defensible sequence:
- 1. Pressure-integrity / hydrostatic — 1.5× rated working pressure hold, before any fill, so you never flush or charge a loop you have not proven leak-tight. Cycle a QD sample under pressure.
- 2. Flush to fluid quality — DI water circulation to the conductivity floor (≤5 µS/cm) and declared ISO 4406 particulate class, before coolant charge, so debris never reaches a cold plate.
- 3. Charge working fluid & purge — PG25 (or specified coolant), air-purge the loop, sample and label the fluid; trapped air destroys pump performance and the convective coefficient.
- 4. CDU acceptance at rated flow — flow, head, N+1 pump failover, filtration, dew-point control; FWT first for any new model.
- 5. Static worst-case-branch verification — throttle/dummy-load the hydraulically-furthest node to its worst case; confirm minimum flow.
- 6. Surrogate failover & interlock injection — manual trips, simulated leak alarms, forced pump loss; confirm alarm propagation and action logic.
- 7. (Deferred / burn-in overlap) — transient loop stability, dynamic worst-case branch, and the real GPU-throttle interlock, against the cluster.
The first six are mechanical Cx; the seventh is the burn-in overlap. Skip the ordering — flush before pressure-test, charge before flush, accept the CDU before the loop is clean — and each violation contaminates the step it precedes. The order is not bureaucracy; it is the dependency graph of the physics.
Anti-patterns
The recurring cooling-acceptance failures all share a root cause: treating the liquid loop as if a load bank could accept it, or treating mechanical Cx and burn-in as cleanly separable. Four are worth naming:
- Signing off cooling on a load-bank pass. Declaring the cooling system accepted because the facility load-bank IST hit nameplate. The load bank proved FWS capacity and nothing about TCS transient behavior, worst-case-branch dynamics, or the throttle interlock. The first synchronized training run finds what the load bank could not.
- Under-flushing to recover schedule. Calling the loop clean before conductivity and particulate truly stabilize. The debt is paid as a cold-plate replacement campaign in a live hall, weeks later, after unexplained throttling.
- Accepting a first-of-kind CDU on a datasheet. Skipping factory witness on a new CDU model or vendor. A pump-curve or control defect that would have been a factory rework becomes a hall-level schedule hit with the cluster idle.
- Treating mechanical Cx and burn-in as a clean handoff. Closing cooling Cx and walking away before the deferred items (transient stability, dynamic worst-case branch, real throttle interlock) are proven against silicon. The gate has two halves; signing only the first one leaves the loop conditionally accepted while everyone behaves as if it is done.