Chapter 13.5

Cooling Acceptance: Air, Liquid-to-Chip & CDU Commissioning

Cooling acceptance is the one part of commissioning where the facility cannot test what it is built to do — a load bank rejects heat to air, never into a cold plate — so the liquid loop, the CDU controls, and the worst-case branch only ever see realistic transient heat-flux when real GPUs arrive, which makes mechanical Cx and GPU burn-in a single overlapping gate, not two sequential ones.

GOODPUTDENSITY-RAMPPOWER-BOUND

What you'll decide here

Where the mechanical-Cx-to-GPU-burn-in boundary actually sits: a clean handoff at full flow on a load bank, or an explicit overlapping gate where the liquid loop is only proven once real silicon is dissipating into the cold plates.
The fluid-cleanliness acceptance criterion you will hold the secondary loop to before first coolant touches a cold plate — conductivity floor, particulate class, and how many flush cycles you budget for — because under-flushing fouls cold-plate microchannels you cannot clean in place.
Whether you witness CDU factory acceptance (FWT) or accept on a datasheet — the fork that decides whether a control-loop or pump-curve defect surfaces in a factory bay or in your live hall.
How you prove the worst-case (hydraulically furthest, highest-rejection) branch makes flow at full load, given that you cannot create full load until the cluster exists — the load-realism limit that reshapes the entire acceptance sequence.
What the leak-detection and cooling-failover interlocks must demonstrate, and how they tie into cluster burn-in so a CDU trip throttles or parks GPUs before junction temperatures run away.

Every other acceptance domain in Part 13 can be exercised to its design point with surrogate load. Electrical acceptance drives the switchgear and UPS with load banks (Chapter 13.3); generators and microgrids are paralleled and islanded against resistive and reactive banks (Chapter 13.4); integrated systems testing pulls the plug on a fully-loaded building (Chapter 13.6). Cooling acceptance is the exception that defines the whole part. A facility load bank is a resistor stack with a fan: it converts megawatts into hot air and rejects that air into the room. It does not, and cannot, push heat through a cold plate into the secondary liquid loop. So the very thing the liquid plant exists to do — absorb a synchronized, transient, kilowatt-per-chip heat flux at the die and carry it to rejection — is the one thing the facility cannot demonstrate before the GPUs are racked.

This chapter is organized around that limit. We walk airside acceptance, then the secondary-loop work that can be done dry or with surrogate heat — flushing, fluid-quality qualification, fill and purge, hydrostatic and pressure acceptance — then CDU acceptance and the worst-case-branch problem, and finally the leak-integrity, failover, and burn-in interlocks that close the gate. Each fork carries a downstream cost that comes due when real silicon arrives. Mechanical commissioning and GPU burn-in (Chapter 13.8) are not adjacent phases with a clean baton-pass; they overlap, by physics, because the liquid loop's true acceptance test is the first dense training run.

The acceptance map: what can be proven, and with what load

Cooling acceptance spans two physically distinct systems joined at the CDU. The facility water system (FWS) — chillers or dry coolers, towers, the primary loop, pumps, and the airside plant — is conventional mechanical Cx, and most of it can be driven to design with surrogate load: load banks dump heat into the room for the air handlers to reject, and the primary loop can be exercised by the CDU's own heat exchanger or by temporary process loads. The technology cooling system (TCS) — the secondary loop the CDU isolates from facility water, the in-rack manifolds, the quick-disconnects, and the cold plates themselves — is where the load-realism limit bites. You can flush it, fill it, pressure-test it, and run the pumps; you cannot subject the cold plates to a realistic per-die transient without dies dissipating into them. The CDU/TCS separation and loop architecture are engineered in Chapter 5.6; here we accept what was built there.

What cooling acceptance can prove before GPUs — and what it cannot

Acceptance item	System	Provable pre-GPU?	Surrogate used	What only real GPUs reveal
Airside / room cooling capacity	FWS (air)	Yes — fully	Load banks reject to air	Nothing material; air is the load bank's native sink
Primary loop, heat rejection, free-cooling changeover	FWS (liquid)	Yes — to design heat	CDU HX or process load	Real annualized climate sequencing over seasons
Flushing, fluid quality, fill/purge	TCS	Yes — must precede GPUs	Deionized water then coolant	Long-term chemistry drift, biofouling onset
Hydrostatic / pressure-integrity test	Both	Yes — must precede fill	Hydrostatic pressure	Nothing; integrity is pressure-not-heat dependent
CDU flow, head, pump redundancy	CDU	Yes — at rated flow	Pump-only or balancing valves	Control response to a real synchronized load slam
Worst-case-branch flow at full load	TCS	Partially	Throttling to mimic full draw	True simultaneous full-rack rejection across all branches
Loop thermal-hydraulic transient stability	CDU + TCS	No	—	Setpoint stability under a kW/chip step (Chapter 5.12)
Leak-detection + cooling-failover interlock to throttle	TCS + IT	Partially	Manual trip injection	GPU throttle/park actually fires before Tj runaway

The fork in every row is whether surrogate load suffices or whether the test is deferred into the GPU-burn-in overlap. 'Surrogate' = load bank, dummy thermal load, or pump-only circulation.

Why the load bank lies about the loop

A resistive load bank and a rack of GPUs reject the same number of watts, so a facility team is tempted to treat them as interchangeable for cooling acceptance. They are not. The load bank rejects to air, which the room's CRAH/RDHx plant handles; the GPUs reject to liquid, through a thermal-resistance stack — die, TIM, cold-plate baseplate, microchannel, coolant film — that the load bank never engages. Worse, the dynamics differ: a synchronized training step ramps tens of megawatts of die heat in milliseconds and drops it just as fast, while a load bank steps in coarse resistive increments. So even a 100%-of-nameplate load-bank test leaves the secondary loop's control valves, pump-VFD slew, and dew-point margin under synchronized slam untested — exactly the failure modes Chapter 5.12 warns about. The honest acceptance position: the load bank proves the FWS capacity envelope and nothing about TCS transient behavior.

Airside acceptance: the part that behaves

Even a fully liquid-cooled hall has a residual air load — roughly 15–17 kW per GB200 NVL72 rack stays on air (NICs, DIMMs, PSUs, optics, switch trays), and storage, networking, and any modest-density inference rows may be entirely air-cooled (Chapter 5.2). Airside acceptance is the conventional, well-understood half of this chapter, and it is genuinely provable pre-GPU because air is the load bank's native sink. The work: verify CRAH/RDHx/in-row capacity at design heat with load banks placed to mimic the rack thermal map; commission containment (hot/cold-aisle or rear-door) for leakage and bypass; tune supply-air setpoints against the ASHRAE A1–A4 envelope; and prove airflow balance so no rack starves. For hybrid halls running DLC plus rear-door exchangers (Chapter 5.3), the RDHx water side is part of this acceptance and its condensation/dew-point margin is set here.

The decision that matters in airside acceptance is how much residual-air capacity you commission relative to the liquid fraction. Over-commission and you have paid for air-handling you will idle as the hall liquid-cools more of the load through the density ramp; under-commission and a generation step-up that shifts the air/liquid split — or a cold-plate fault that dumps a rack's load to air — finds the room plant short. The conservative posture matches air capacity to the worst-case air fraction across the planned ramp, not to day-one steady state.

Secondary-loop flushing and fluid quality: the gate before first coolant

Before a drop of working coolant touches a cold plate, the secondary loop must be flushed and qualified, and this is the single most under-budgeted step in cooling acceptance. The cold-plate microchannels that make DLC work — sub-millimeter passages that drive the convective coefficient — are precisely what particulate and biological fouling block, and once a cold plate is fouled you cannot clean it in place; you replace it, in a live rack, with the loop drained. The flush is therefore not housekeeping. It is the gate that protects the most expensive and least serviceable surface in the building.

Practice converging in 2025–2026 is a multi-stage flush: circulate deionized water (commonly specified at ≥0.5 MΩ·cm resistivity) through manifolds and hoses, then through the full loop, until effluent conductivity stabilizes below a floor in the single-digit µS/cm range (a 5 µS/cm target is widely cited), with particulate held to a declared ISO 4406 cleanliness class. Only then is the system charged with the working fluid — typically PG25 (25% propylene glycol) for the freeze/biocide/material-compatibility envelope DLC loops need (Chapter 5.4). ASHRAE TC 9.9 frames the target chemistry through its water-quality classes; the flush is what gets you into class and the ongoing fluid-analysis program is what keeps you there.

The under-flush trap (and its cousin, the wrong-fluid trap)

Two fluid-side mistakes recur, and both are expensive precisely because they surface late. Under-flushing — declaring the loop clean before conductivity and particulate counts truly stabilize, usually to recover schedule — seeds the cold plates with debris that does not announce itself until weeks of operation have raised per-die thermal resistance and the cluster starts throttling for no obvious reason. The fix is a cold-plate replacement campaign across a live hall. Wrong-fluid or mixed-fluid — topping a PG25 loop with a different glycol, an incompatible inhibitor package, or plain water — quietly attacks gaskets and dissimilar-metal joints (Chapter 5.13) and shifts the freeze and biocide envelope. Hold the flush acceptance to a hard conductivity/particulate gate, sample and label the working fluid, and never let schedule pressure shorten the flush. The hours you save are borrowed against a cold-plate swap you will repay with interest.

Hydrostatic and pressure-integrity acceptance

Pressure-integrity is the one cooling-acceptance item that is fully provable pre-GPU and entirely independent of heat, because it tests the pressure boundary, not the thermal duty. The charged-piping code basis is ASME B31.x in North America or the EU PED / EN 13480 fork in Europe (Chapter 5.13), and the hydrostatic test typically pressurizes to 1.5× the rated working pressure and holds — durations of several hours are common practice — watching for any decay that betrays a joint, a gasket, or a quick-disconnect that did not seat. The acceptance sequence is strict and ordered: pressure-test before fill, fill before flush-to-quality, flush before coolant charge. Reorder it and you either flush a loop you have not proven leak-tight or charge working fluid into a loop you have not cleaned.

The decision embedded here concerns the quick-disconnects. A GB200-class rack carries on the order of 150–200 dripless quick-disconnects, and every one is a potential leak path that a hydrostatic hold exercises but a blind-mate cycle stresses differently. The fork: accept the QDs as installed on a single hydrostatic hold, or cycle a sample (mate/de-mate under pressure) to catch couplings that seal statically but weep after a service cycle. The second path costs acceptance time; the first costs a leak during the first board-swap. Given that serviceability is the whole point of dripless QDs, cycling a representative sample is the defensible call.

CDU acceptance: factory witness, flow verification, and the worst-case branch

The CDU is the seam of the entire cooling system — it isolates the technology loop from facility water, sets secondary flow and temperature, and carries the controls that must respond to load. It is also, as Uptime Intelligence has flagged, the component most likely to complicate commissioning, because many CDU vendors arrived from outside the data-center world and some had never integrated a unit into a complex fluid network before. That makes the factory witness test (FWT) fork consequential: witness the CDU's flow, head, pump-redundancy failover, and control response in the vendor's bay, or accept it on a datasheet and discover a defect in your live hall. The cost asymmetry is stark — a pump-curve or PID defect found at the factory is a vendor rework; found in the field it is a hall-level schedule hit with the cluster waiting. For any first-of-a-kind CDU model or vendor, FWT is the rational default.

On site, CDU acceptance proves rated flow and head, pump N+1 failover (kill the lead pump, confirm the lag pump holds flow without a thermal excursion), filtration and dew-point control, and the leak-detection integration. Flow verification is staged in increments — load added 25% → 50% → 75% → 100% with temperature differential, flow rate, and pressure drop logged across the CDU, the piping, and the rack manifolds at each step. But that staging runs against surrogate or balancing-valve load, which brings us to the hardest problem in the chapter.

The worst-case branch. A liquid loop balances flow across many parallel branches; the branch that is hydraulically furthest from the CDU and carrying the highest rejection is the one most likely to starve at full system load. Acceptance practice instruments that worst-case branch and verifies it makes its minimum flow when the whole system is loaded. The catch is the load-realism limit: you cannot create full simultaneous load across every branch without the full cluster, so the pre-GPU worst-case-branch test must simulate full draw — typically by throttling other branches with balancing valves to force the hydraulically-furthest node into its worst case, or by running a dummy thermal load. This proves the hydraulics under a static worst case. It does not prove the branch holds flow when every rack is simultaneously rejecting a synchronized training transient — that proof is deferred into the GPU-burn-in overlap.

CDU/TCS acceptance: surrogate-load test vs. real-silicon test

Acceptance item	Surrogate-load result	Deferred to GPU burn-in	Consequence of skipping the deferred test
CDU rated flow & head	Proven at 100% flow	—	None — flow is heat-independent
Pump N+1 failover	Proven (kill lead pump)	Failover under live thermal load	Failover may hold flow but not Tj margin under real heat
Worst-case-branch flow	Static worst case via throttling	Dynamic worst case, all racks live	A branch starves only when the whole hall slams together
Control-loop / setpoint stability	Not provable	Tuning under kW/chip step	Hunting, oscillation, or dew-point excursion in production
Leak-detect to GPU-throttle interlock	Manual trip injection only	Real trip throttles/parks GPUs	Interlock fires too slow and Tj runs away on a real loss

The same acceptance items, split by what a pre-GPU surrogate proves versus what is necessarily deferred into the burn-in overlap. This split is the operational definition of the load-realism limit.

20–25 °C

GB200 NVL72 coolant inlet spec; deviation can throttle GPUs up to ~50%

2025NVIDIA OCP / Introl

~80 L/min

DLC flow per GB200 NVL72 rack (~1.2–2.0 L/min per kW design rule)

2025Dober / NVIDIA OCP

~2.4 MW

NVL72 CDU/row-level cooling capacity (per-rack heat is ~132 kW: ~115 kW liquid + ~17 kW air)

2025NVIDIA OCP / Introl

≤5 µS/cm

secondary-loop conductivity floor flushed to before coolant charge (DI ≥0.5 MΩ·cm)

2026Liquid-cooling commissioning practice (XD Thermal / Introl synthesis)

1.5×

rated working pressure for hydrostatic acceptance hold (ASME B31.x / EN 13480 basis)

2025Liquid-cooling commissioning practice; ASME B31

2–3 weeks

install + commissioning per GB200 NVL72 system; load staged 25→50→75→100%

2026Introl GB200 NVL72 deployment

~55%

single-phase direct-to-chip share of the liquid-cooling market (the loop you are commissioning)

2026DCD / IDTechEx

~96%

best-in-class training goodput the loop must protect; a cooling trip is lost goodput

2025SemiAnalysis ClusterMAX / CoreWeave

Leak integrity, cooling failover, and the interlock with burn-in

Leak detection in a liquid-cooled hall is not a smoke-detector afterthought — it is a real-time interlock that must throttle or park the GPUs before a coolant loss drives junction temperatures past their limit. Two architectural choices set the acceptance work. First, positive vs. negative-pressure operation: a negative-pressure (sub-atmospheric) secondary loop draws air in on a breach instead of pushing coolant out, turning a spray onto live electronics into an air ingress — a fundamentally safer failure mode that some designs adopt specifically to de-risk leaks. Acceptance must confirm the pressure regime behaves as designed under a fault. Second, the detection-to-action chain: rope/spot leak sensors, flow and pressure anomaly detection, and the logic that converts a detection into a GPU power-cap, throttle, or park. Acceptance injects faults — manual trips, simulated sensor alarms, a forced CDU pump loss — and confirms the action fires fast enough.

This is exactly where the chapter's thesis becomes an operational gate. A cooling-failover test against a load bank proves the CDU swings to its standby pump and the alarm propagates. It does not prove that the GPU throttle-or-park interlock actually protects real silicon, because there is no real silicon to protect and the thermal time constants of a die-to-coolant path are not in play. The DLC loop has almost no chilled-water inertia to ride through a slam (Chapter 5.12), so the margin between a CDU trip and a thermal runaway is small and must be proven against real heat. The leak-and-failover interlock therefore cannot be fully accepted until it is wired to a cluster that can be throttled — which is the burn-in overlap.

The overlapping gate: mechanical Cx ↔ GPU burn-in

The instinct is to draw a clean line — finish mechanical cooling Cx, sign it off, then start GPU burn-in (Chapter 13.8). That line is fiction, and pretending it is real strands risk. The honest model is an explicit overlapping, sequenced gate: mechanical Cx proves everything surrogate load can prove (airside capacity, flush/fluid quality, pressure integrity, CDU rated flow, static worst-case branch, manual failover) and hands a conditionally accepted loop to burn-in. Burn-in then closes the deferred items — transient loop stability, dynamic worst-case branch under synchronized load, and the real leak-throttle interlock — using the GPUs as the only load source that exercises the cold plates. The acceptance artifact that survives this is a single combined criterion that bridges load-bank IST (Chapter 13.6) to first-real-workload, with the cooling sign-off explicitly conditioned on burn-in results. Treat the two phases as one gate with two halves, schedule them as overlapping, and the load-realism limit becomes a managed sequence instead of a surprise.

Deep dive: why the proxy training run is the only true cooling emulator

The load-realism limit is structural, not a gap you can engineer away with a better load bank, and understanding why clarifies the whole acceptance strategy. A resistive load bank reproduces the magnitude of heat but neither the path (it rejects to air, not through a cold plate) nor the dynamics (it steps in coarse resistive increments, not a millisecond-scale synchronized die transient). Reactive and AI-emulating load banks improve the electrical realism — they reproduce the power-factor and the synchronized current slam the BBU/BESS/GPU-capacitance stack must absorb (the canonical treatment of this dynamic-load realism gap lives in Chapter 13.6) — but even an AI-emulating load bank still rejects its heat to air. No load bank pushes a kilowatt-per-die transient through a microchannel cold plate into the secondary loop. The only emulator that does is a real cluster running a real job.

That is why the proxy/reference training run (Chapter 13.9) is the true acceptance test for the thermal-hydraulic system, the same way it is for the network fabric and the scheduler. A synchronized training step is the worst case for the cooling loop: every rack ramps and drops die heat together, the worst-case branch faces simultaneous full rejection, the CDU control loops face their hardest slam, and the dew-point margin faces its real excursion on the down-step. Acceptance criteria that bridge load-bank IST to first-real-workload should explicitly name the proxy run as the moment the cooling loop is finally proven — coolant inlet held in band (20–25 °C for GB200) under synchronized load, no throttling traceable to thermal-hydraulics, worst-case branch holding flow, and the leak-throttle interlock demonstrated against live silicon. Until that run, the cooling system is conditionally accepted, not accepted.

Deep dive: the acceptance sequence, ordered (and why order is load-bearing)

Cooling acceptance is one of the few domains where getting the order wrong silently invalidates downstream tests. The defensible sequence:

1. Pressure-integrity / hydrostatic — 1.5× rated working pressure hold, before any fill, so you never flush or charge a loop you have not proven leak-tight. Cycle a QD sample under pressure.
2. Flush to fluid quality — DI water circulation to the conductivity floor (≤5 µS/cm) and declared ISO 4406 particulate class, before coolant charge, so debris never reaches a cold plate.
3. Charge working fluid & purge — PG25 (or specified coolant), air-purge the loop, sample and label the fluid; trapped air destroys pump performance and the convective coefficient.
4. CDU acceptance at rated flow — flow, head, N+1 pump failover, filtration, dew-point control; FWT first for any new model.
5. Static worst-case-branch verification — throttle/dummy-load the hydraulically-furthest node to its worst case; confirm minimum flow.
6. Surrogate failover & interlock injection — manual trips, simulated leak alarms, forced pump loss; confirm alarm propagation and action logic.
7. (Deferred / burn-in overlap) — transient loop stability, dynamic worst-case branch, and the real GPU-throttle interlock, against the cluster.

The first six are mechanical Cx; the seventh is the burn-in overlap. Skip the ordering — flush before pressure-test, charge before flush, accept the CDU before the loop is clean — and each violation contaminates the step it precedes. The order is not bureaucracy; it is the dependency graph of the physics.

Anti-patterns

The recurring cooling-acceptance failures all share a root cause: treating the liquid loop as if a load bank could accept it, or treating mechanical Cx and burn-in as cleanly separable. Four are worth naming:

Signing off cooling on a load-bank pass. Declaring the cooling system accepted because the facility load-bank IST hit nameplate. The load bank proved FWS capacity and nothing about TCS transient behavior, worst-case-branch dynamics, or the throttle interlock. The first synchronized training run finds what the load bank could not.
Under-flushing to recover schedule. Calling the loop clean before conductivity and particulate truly stabilize. The debt is paid as a cold-plate replacement campaign in a live hall, weeks later, after unexplained throttling.
Accepting a first-of-kind CDU on a datasheet. Skipping factory witness on a new CDU model or vendor. A pump-curve or control defect that would have been a factory rework becomes a hall-level schedule hit with the cluster idle.
Treating mechanical Cx and burn-in as a clean handoff. Closing cooling Cx and walking away before the deferred items (transient stability, dynamic worst-case branch, real throttle interlock) are proven against silicon. The gate has two halves; signing only the first one leaves the loop conditionally accepted while everyone behaves as if it is done.

The systems accepted here are engineered upstream: the density wall and cooling hierarchy in Chapter 5.1; air at the limit in Chapter 5.2 and rear-door bridge in Chapter 5.3; direct-to-chip DLC, cold plates, manifolds, and QDs in Chapter 5.4; the CDU and secondary-loop isolation in Chapter 5.6; warm-water facility loops in Chapter 5.7; leak detection and the thermal-design commissioning sequence in Chapter 5.11; and the cooling-controls transient dynamics that the load-realism limit defers into burn-in in Chapter 5.12. The pressure-system code basis and water-hammer/surge analysis are in Chapter 5.13. Within Part 13: commissioning levels and governance in Chapter 13.1; scripts and ATPs in Chapter 13.2; electrical acceptance that this is sequenced after in Chapter 13.3; IST and the dynamic-load realism gap in Chapter 13.6; GPU burn-in that this overlaps with in Chapter 13.8; and the proxy training run that is the true cooling emulator in Chapter 13.9. Day-2 liquid-cooling observability picks up in Chapter 14.2.