Guide › Day-2 Operations, Upgrades & Lifecycle › 14.7

Chapter 14.7

Capacity, Power & Thermal Management in Operation

In a power-bound facility the megawatts you energized are a fixed, capital-intensive ceiling — the operational job is to fill that ceiling with goodput without tripping it, and every fork in this chapter is a trade between how full you run the budget and how violently the workload can swing it.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

Whether to oversubscribe the power budget at all — and if so, by how much — given that training leaves only ~3% headroom while inference leaves ~21%, so the same overbooking policy that is safe on an inference hall trips the breakers on a training hall.
What the rack-, row-, and facility-level power-cap hierarchy is, who is allowed to override it, and how fast it engages — because a software cap that reacts in milliseconds is the only thing standing between a synchronous all-reduce and a feeder trip.
How much stranded capacity you will tolerate, and which of the three strands (cooling-limited, power-chain-limited, fragmentation-limited) you are willing to spend capital to recover versus simply schedule around.
Whether the facility participates in demand response / curtailment, and which workloads are eligible — because the revenue is real but the wrong workload curtailed at the wrong moment destroys more goodput value than the demand-response payment is worth.
What forecasting cadence and reserved-capacity buffer the capacity-planning function runs at, so that the density ramp (40 kW → 132 kW → 600 kW racks) never arrives faster than the power and cooling substrate can absorb it.

By the time a facility reaches steady-state operation, its single most expensive and least reversible decision — how many megawatts it can energize — is already made and paid for. The interconnection slot is won or lost, the transformers are humming, the cooling plant is sized. What remains is an operational discipline that the chip-bound era never had to learn: how to run a fixed power budget hot enough to earn its capital back, without running it so hot that a workload transient trips it. This is the day-2 face of the power-bound thesis. The constraint is no longer how many GPUs you can buy or even how many you can cool — it is how completely you can convert energized megawatts into goodput, request by request and step by step.

This chapter treats four operational functions as a single problem because they share one ceiling: live power management and oversubscription (running the budget full), transient management in operation (keeping the swings inside the budget), stranded-capacity and thermal operations (recovering megawatts the design left on the table), and capacity planning, workload-aware operations, and demand response (matching the ramp and the grid). The physics of the transients themselves is canonical in Chapter 4.5 (storage and ride-through) and Chapter 5.12 (cooling-controls dynamics); here we manage them operationally.

The master fork: how full do you run the power budget?

A facility provisioned for 100 MW of IT load rarely draws 100 MW. Nameplate is the sum of every device's worst-case rating; real workloads almost never align their peaks. The gap between nameplate and realized peak is power headroom, and the central operational question is whether to sell that headroom — to oversubscribe, deploying more racks than the budget would support at nameplate, betting that they will not all peak together. Get the bet right and you have manufactured capacity out of statistics: the canonical result is that an inference cluster can host roughly 30% more servers under the same power budget with negligible performance loss. Get it wrong and the budget overruns, the protection scheme fires, and you convert a statistical near-miss into a real outage.

The fork is not whether oversubscription is a good idea in the abstract — it is whether your workload leaves headroom to sell. Here the two archetypes diverge sharply. A synchronous training job runs every GPU in lockstep: the all-reduce barrier forces the whole cluster to draw peak power at the same instant, then idle together while gradients sync. That correlated swing leaves almost nothing to overbook — measured headroom on training clusters is on the order of ~3%. Inference is the opposite: thousands of independent requests arrive at random phases, their prefill (compute-bound, high-power) and decode (memory-bound, lower-power) stages uncorrelated across the fleet, so the aggregate draw is smooth and well below the sum of peaks — headroom on the order of ~21%. The same overbooking policy is free money on one hall and a feeder trip on the other.

Never oversubscribe a synchronous training hall on inference-derived headroom

The most expensive operational error in this chapter is importing an inference oversubscription policy onto a training cluster. Inference leaves ~21% headroom because its peaks are uncorrelated; a synchronous training run leaves ~3% because the all-reduce barrier correlates every GPU's peak to the same microsecond. Overbook a training hall as if it were inference and the first hard step after a checkpoint reload — when all ranks resume compute simultaneously — draws the entire fleet to peak at once and overruns the budget. The fix is not a bigger margin guess; it is to characterize the actual workload's correlated-peak profile (see the deep dive below) and size oversubscription to that, with a fast power cap as the backstop. → goodput framing in Chapter 14.1.

Oversubscription posture by workload archetype

Archetype	Peak correlation	Power headroom	Oversubscription posture	Backstop if budget overruns
Synchronous pre-training	High — all-reduce barrier aligns every GPU's peak	~3%	Minimal to none; size to measured correlated peak	Fast rack/row power cap → frequency throttle; checkpoint-tolerant of the slowdown
Post-training / RL (rollouts)	Mixed — async rollout pool smooth, trainer correlated	Between training and inference; disaggregate to measure separately	Oversubscribe the rollout pool; protect the trainer	Cap rollout pool first (interruption-tolerant); shield trainer
Online inference	Low — independent requests, uncorrelated prefill/decode	~21%	Aggressive (POLCA-class ~30% more servers viable)	Priority-based capping; throttle low-SLO tenants first
Batch inference	Low — embarrassingly parallel, schedulable	Highest realized headroom; queue-shapeable	Most aggressive; the natural overbooking sink	Pause/defer the queue; zero SLA cost

Headroom figures are measured production/research values (training ~3% vs inference ~21%); POLCA-class oversubscription yields ~30% more inference servers under a fixed budget. Postures are 2026 practitioner consensus, not a standard.

The table is a posture ladder. As you descend from synchronous training to batch inference, two things move together: the workload's peaks become less correlated (more headroom to sell) and its tolerance for being throttled rises (a cheaper backstop). That is not a coincidence — it is why batch inference is the natural sink for oversubscription risk across a mixed fleet. The sophisticated operator does not pick one posture; it runs a heterogeneous budget where the interruptible, uncorrelated workloads absorb the overbooking that the synchronous, latency-bound workloads cannot. → the disaggregated RL picture is in Chapter 1.4; the inference burst profile in Chapter 1.3.

The power-cap hierarchy: the budget's enforcement layer

Oversubscription is a bet, and every bet needs a stop-loss. In a power budget that stop-loss is the power-cap hierarchy — a layered set of enforcement points that hold realized draw under the energized ceiling no matter what the workload does. The hierarchy spans four levels, each slower and broader than the one below it. At the GPU/node level, the accelerator's own power-management firmware clamps board power in microseconds by reducing clock frequency — the fastest and most surgical control. At the rack/PDU level, intelligent rack PDUs and the BMC enforce a rack budget across nodes. At the row/lineup level, the DCIM control plane allocates a shared budget across racks fed from a common busway or RPP. At the facility level, the EMS holds the campus draw under the interconnection limit and any demand-response obligation.

The fork here is where you set the binding cap and how fast it engages. Cap too low and you leave goodput on the table — GPUs throttled that the budget could have fed. Cap too high, or too slowly, and a correlated transient breaches the budget before the control loop catches it. The hard constraint is speed: a synchronous all-reduce drives the fleet from idle to peak in milliseconds, far faster than any facility-level EMS can react. That is why the binding, time-critical cap must live in the GPU firmware and the rack BMC — the only layers fast enough — with the DCIM and EMS layers setting budgets the fast layers enforce, not reacting to transients themselves. A facility EMS that tries to chase a training transient is always one control-loop period too late.

Power capping moves the bottleneck, it does not remove it

Capping a GPU's power reduces its clock, which lengthens the compute phase, which on a synchronous job means every other GPU waits at the barrier — so a cap applied to relieve a power transient can silently convert into a goodput tax across the whole run. Worse, recent characterization work shows capping is far less effective during memory-bound decode than during compute-bound prefill: clamping board power barely touches a phase that was never power-limited to begin with. The operational lesson is that the power cap is a real-time safety device, not a free efficiency lever — every cap you set should be accounted against goodput, and decode-heavy inference responds to capping very differently from prefill-heavy or training load. → MFU/goodput accounting in Chapter 14.1; the telemetry to see it in Chapter 14.2.

Transient management in operation

The power transient is what makes oversubscription dangerous in the first place. A synchronous cluster does not draw a steady load — it pulses. Compute phases pull every GPU to peak; communication and checkpoint phases let them fall together; a job start, a job kill, or a checkpoint reload steps the entire facility load in one swing. At fleet scale these swings are enormous and abrupt: NERC documented data-center load losses of roughly 1,500 MW on a single fault, with one Virginia event shedding ~1.5 GW in 82 seconds — events severe enough to trigger NERC's rare Level 3 alert and make grid-side ride-through a mandatory planning input rather than a courtesy. The physics of absorbing these swings — UPS, BESS, ride-through, and the controls that damp them — is engineered in Chapter 4.5 and Chapter 5.12. The operational question this chapter owns is narrower and continuous: how do you run the live facility so the transients you generate stay inside what your absorption layer and your grid contract can take?

The answer is a set of operational levers that shape the workload's power profile before it reaches the power chain. Power smoothing deliberately holds a floor under idle phases (a synthetic GPU load during the all-reduce trough) so the swing the grid sees is smaller than the swing the GPUs actually produce — trading a little energy for a lot of transient amplitude. Ramp-rate limiting staggers job starts and checkpoint reloads so a 100 MW cluster does not step its entire load in one second; the scheduler launches ranks in waves. Stagger and phase-offset scheduling spreads correlated peaks across time on purpose. And the storage layer — UPS and BESS — absorbs whatever residual swing the workload levers leave. The fork is how much of the transient you suppress in software (cheap energy, free capital, but it taxes goodput and is the operator's job) versus in hardware (BESS sized to the swing — capital up front, no goodput tax). Most facilities do both; the ratio is the design decision.

Deep dive: why the all-reduce barrier is a power-systems event, not just a networking one

It is tempting to file gradient synchronization under networking and forget it at the switchboard. That is the mistake that produced the NERC Level 3 alert. Consider the mechanism. A synchronous data-parallel training step has two phases with opposite power signatures: a compute phase where every GPU runs forward/backward at or near TDP, and a communication phase — the all-reduce — where GPUs largely idle while the fabric exchanges gradients. Because the barrier forces every rank to the same phase at the same time, the cluster's aggregate power is not the smooth average of thousands of independent devices; it is a near-square wave swinging between the fleet's correlated peak and its correlated trough, cycling at the step frequency.

Now scale it. On a 100 MW training cluster that swing can be tens of megawatts, stepping in well under a second, thousands of times a day. To the GPU it is a clock-frequency artifact; to the upstream transformer, the UPS, and the utility feeder it is a load transient with real di/dt, voltage-sag, and frequency-excursion consequences — the July 2024 event saw frequency rise to 60.047 Hz and voltage to 1.07 pu when ~1.5 GW dropped. This is why oversubscription headroom on training is ~3% and not 21%: there is no statistical smoothing to exploit when the workload is, by construction, a synchronized oscillator. And it is why the operational levers above — power smoothing to fill the trough, ramp-rate limiting to soften the edge, BESS to absorb the residual — are not efficiency niceties but the difference between a facility the utility will interconnect and one it will not. → the absorption hardware in Chapter 4.5; the grid-side instantaneous-loss data in Chapter 4.5 and siting/interconnection in Chapter 3.2.

Stranded capacity: the megawatts the design left on the floor

Stranded capacity is energized, paid-for power or cooling that cannot be converted into compute because some other resource caps out first. In a power-bound world that is capital spent on megawatts that earn nothing. It comes in three distinct strands, and the fork is which one you are actually suffering — because the recovery for each is different and confusing them wastes the recovery capital too.

Cooling-limited. The power chain can feed the racks but the cooling plant cannot remove the heat, so racks sit half-populated or throttled. This is the dominant strand in halls built for air and asked to host liquid-class density — the cooling cliff showing up as a utilization ceiling. Recovery is mechanical: more heat-rejection capacity, warmer water, or a cooling retrofit. → Chapter 5.4.
Power-chain-limited. Cooling has headroom but a transformer, busway, RPP, or breaker is the binding constraint — often because the design reserved fault-margin or redundancy capacity that nameplate planning never released to load. Recovery is electrical and partly statistical: oversubscription releases reserved headroom; rebalancing phases and circuits recovers fragmented capacity. → Chapter 4.6.
Fragmentation-limited. Both power and cooling have aggregate headroom, but it is scattered — a kilowatt free here, two there — in pockets too small to land a 132 kW rack. This is the most insidious strand because the DCIM dashboard shows spare capacity that the placement engine cannot use. Recovery is operational: workload-aware placement, defragmentation campaigns, and consolidating partial rows.

Misdiagnose the strand and you spend in the wrong place. Add chillers to a fragmentation problem and you have bought cooling you still cannot fill; oversubscribe a cooling-limited hall and you trip the thermal protection you were already brushing against. The diagnostic discipline — read which resource is actually binding, per row, from telemetry before committing recovery capital — is the operational core of stranded-capacity management. → the telemetry that reveals the binding constraint is in Chapter 14.2.

Thermal operations: running the cooling plant against the load

Thermal operations is the cooling-side twin of power management: the continuous job of matching heat rejection to a load that swings as violently as the power draw does. The same all-reduce that pulses the power chain pulses the heat load — and the cooling plant's thermal mass and control loops respond on a slower timescale than the electrical ones, which is its own hazard. A GB200-class rack with ~1 kW+ GPUs can thermal-trip within seconds of a cooling-loop loss; there is no air buffer at liquid density to ride through on. That makes UPS-backed coolant pumps and N+1/2N heat rejection not redundancy luxuries but operational survival requirements.

The day-2 levers are setpoint and flow management against the DLC envelope: holding coolant inlet inside the ~20–25 °C window, flow at the rated ~1.2–2.0 L/min per kW, and the secondary supply above white-space dew point to stay 100% sensible. The fork in steady state is how warm you run the loop. Warmer water expands free-cooling hours and crushes the cooling share of PUE (warm-water DLC plants approach ~1.1 versus 1.3–1.5+ for air), but it shrinks the thermal margin to the throttle threshold — a GB200 that deviates from inlet spec can lose up to ~50% of its clocks. Run cold and you waste compressor energy buying margin you may not need; run warm and you are operating closer to the cliff, where a transient that would have been absorbed becomes a throttle event and a goodput loss. Cooling-controls stability under these transients is canonical in Chapter 5.12; the piping and CDU mechanics in Chapter 5.13; predictive detection of plant degradation (cooling-capacity loss 14–30 days out, chiller-bearing wear 4–8 weeks out) in Chapter 14.5.

~3% vs ~21%

power oversubscription headroom: synchronous training vs inference (correlated vs uncorrelated peaks)

2026Measurement of Generative AI Workload Power Profiles (arXiv) / Day-2 ops research

~30%

more inference servers deployable under a fixed power budget via priority-based oversubscription

2024POLCA: Power Oversubscription in LLM Cloud Providers (Microsoft Research)

~1,500 MW

data-center load lost on a single 230 kV fault; ~1.5 GW dropped in 82 s (VA, 2024) — NERC Level 3 alert

2026NERC Level 3 Alert / Utility Dive

~98-100 GW

US flexible load integratable at 0.5% annual curtailment; avg demand-response event ~2 hours

2025Nicholas Institute / Duke 'Rethinking Load Growth' synthesis

1-2.8%

rate reduction from a 1-2% peak-demand cut; the demand-response value lever

2025Demand-flexibility headroom analysis (grid procurement research)

~1.1 vs 1.3-1.5+

PUE: well-designed warm-water DLC plant vs air halls — the thermal-ops efficiency spread

2025SemiAnalysis / Uptime Institute

seconds

thermal-trip window for 1 kW+ GPUs after coolant-loop loss — no air buffer at liquid density

2025Cooling & thermal-management research synthesis

45%

share of impactful data-center outages attributable to power (mostly UPS) — the operational risk weighting

2025Uptime Institute outage analysis / Day-2 ops research

Capacity planning and forecasting under a hard ceiling

Capacity planning in a power-bound facility is the discipline of never letting demand for megawatts outrun the supply you have energized, while never leaving so much energized capacity idle that the capital decays unused. It is a forecasting problem with an asymmetric loss function: under-provision and you turn away revenue against a depreciation clock that runs whether the racks are full or not; over-provision and you have stranded energized megawatts, the costliest resource on the site. The forecast that matters is not steady-state load — it is the density ramp. The substrate must be planned against the generation you will deploy next, because the irreversible parts (transformer capacity, busway ampacity, cooling-plant tonnage, floor loading) cannot be retrofitted on the timescale that rack density is climbing: 40 kW air-cooled racks gave way to ~132 kW NVL72, heading to ~190–230 kW Rubin-class and ~600 kW Kyber-class. A hall whose power and cooling were sized to today's density strands its own future.

The operational fork is the reserved-capacity buffer: how much energized headroom you hold back against the next ramp step and against failures. Hold too little and the next GPU generation arrives with nowhere to land — the density-ramp trap, now an operational reality rather than a scoping one. Hold too much and you are paying to idle megawatts. The discipline is to forecast the ramp curve generation-by-generation, size the irreversible substrate to it, and keep the reversible IT fit-out matched to current draw — reserving the headroom you cannot retrofit (floor, water, ampacity) while deferring the spend you can. The interconnection and energization sequencing that feeds this forecast (50–100 MW tranches, energized capacity leading commissioned load) is in Chapter 3.2; the scoping-time version of the ramp decision is in Chapter 1.1.

Workload-aware operations and demand response

The most powerful lever an AI factory has over its own power profile is the scheduler. Because a large share of the fleet's load is interruptible (batch inference) or schedulable (RL rollouts, off-peak training), the operator can shape when and where power is drawn — not just react to it. Workload-aware operations is the practice of treating the power budget and the job queue as one optimization: placing interruptible work where stranded capacity hides, throttling low-SLO tenants first when a cap engages, smoothing the aggregate profile by phase-offsetting correlated jobs, and shifting deferrable load to cheap or low-carbon hours.

That same flexibility is what makes demand response and curtailment economically real for AI loads — and what makes it dangerous. The grid will pay handsomely for flexible load: studies find roughly 98–100 GW of US load integratable at just 0.5% annual curtailment, with a 1–2% peak-demand cut translating into 0.5–2.8% rate reductions system-wide. For a facility, participating means accepting curtailment events (averaging ~2 hours) in exchange for a better interconnection deal or a demand-response payment — increasingly a condition of large-load tariffs, not an option. The fork is which workloads you make eligible. Curtail batch inference or defer a checkpointable training run and the cost is near zero — exactly the load demand response was designed for. Curtail an online-inference SLA or a synchronous run mid-step and you destroy goodput value that dwarfs the demand-response payment: a breached latency SLA, or a forced checkpoint-and-restart that throws away hours of correlated compute. The revenue is real, but it must be earned from the interruptible tail of the fleet, never from its latency-bound or synchronous core.

Demand-response eligibility by workload

Workload	Interruption cost	DR eligibility	Mechanism on a curtailment event
Batch inference	Near zero (queue-and-retry)	Ideal — the primary DR sink	Pause/defer the queue; resume after the event
RL rollouts	Low (staleness-tolerant, restartable)	Strong — shed rollout pool first	Throttle or pause rollout generation; protect the trainer
Synchronous training	High (forced checkpoint/restart)	Conditional — only at checkpoint boundaries	Power-smooth/ramp-limit; checkpoint then idle if event is long
Online inference	Highest (breached latency SLA, lost revenue)	Avoid — never the DR source	Shed only after exhausting all interruptible load; geo-shift if possible

Eligibility heuristic, not a standard. Curtailment value is real but asymmetric: the cost of curtailing the wrong workload exceeds the DR payment. Maps to the oversubscription posture table above.

Energy and efficiency operations

The final function folds power, thermal, and workload management into a single objective: useful work per megawatt-hour. PUE measures only the facility overhead — and at warm-water DLC's ~1.1 there is little overhead left to chase. The frontier has moved inside the white space, to metrics like work-per-watt and goodput-per-MWh that count whether the IT load itself is doing useful computation or burning power on idle GPUs, retries, throttled clocks, and failed jobs. A facility at PUE 1.1 running GPUs at 40% MFU is wasting far more energy than one at PUE 1.2 running at 90%. Energy operations therefore reaches past the cooling plant into the cluster: raising MFU, cutting failure-induced rework, and timing flexible load to low-carbon, low-cost grid hours. The efficiency levers that matter most in 2026 are no longer mechanical; they are scheduling and reliability levers. → the full KPI and reliability-economics treatment is in Chapter 14.1.

Deep dive: the integrated operating loop — one budget, four control planes

The reason this chapter binds power, transients, stranded capacity, planning, and demand response into one subject is that in a live facility they are not separable — they are four control planes acting on a single shared resource, the energized power budget, and an action on one ripples through all four. Make this concrete with a single event: a 50 MW training run finishes and frees its budget at 2 a.m.

Capacity planning sees freed headroom and the placement engine looks for work to fill it. Workload-aware ops routes a deferred batch-inference queue into the gap — but that queue is being held at low power precisely because a demand-response event is forecast for the morning peak, so the scheduler front-loads it overnight to be done before curtailment. As the batch load ramps, the transient-management layer applies ramp-rate limiting so the 50 MW does not step back in one swing, and the power-cap hierarchy holds the row budget while the freed training racks and the new batch racks briefly overlap. Thermal ops tracks the heat load shifting from one row to another and rebalances flow so neither row drifts toward its throttle threshold. Stranded-capacity diagnostics note that the batch work filled power headroom but left cooling headroom unused in an adjacent row — a fragmentation signal for the next placement pass.

No single dashboard owns that sequence; it is the integrated operating loop running continuously. The organizational consequence — that facility-ops and ML-platform-ops must share one view of the budget and one definition of goodput, rather than optimizing their halves in conflict — is the operating-model question taken up in Chapter 14.11, and the change-control discipline that governs any deliberate move within this loop (a cap change, a setpoint change, a DR enrollment) is the MOP/SOP regime in Chapter 14.12.

Human error, not physics, is the plurality outage cause — and operation is where it strikes

Every lever in this chapter — setting a cap, changing a setpoint, enrolling a workload in demand response, rebalancing a circuit — is a live manual action on energized, oversubscribed plant running close to its ceiling. The fleet data is unambiguous: power accounts for ~45% of impactful outages (mostly UPS), but human error underlies 70–80% of all outages, ~85% of those from process failures rather than slips. The implication for operation is that the dominant risk is not that the physics is too aggressive — it is that an undisciplined change to the live budget tips a margin that was deliberately thin. This is why none of the operational levers here are improvised: they run under the procedures-and-error-trap framework canonical in Chapter 14.12, and the unified incident-command model that responds when one trips is in Chapter 14.11.

The transient physics this chapter manages operationally is engineered in Chapter 4.5 (UPS, BESS, ride-through) and Chapter 5.12 (cooling-controls dynamics); the LV distribution that the power-cap hierarchy enforces against is in Chapter 4.6 and the DC-power path in Chapter 4.7. The cooling cliff behind cooling-limited stranding is in Chapter 5.4, with piping mechanics in Chapter 5.13. The interconnection and energization sequencing that bounds capacity planning is in Chapter 3.2, and the scoping-time density-ramp decision in Chapter 1.1. The goodput and reliability-economics framing this chapter serves is canonical in Chapter 14.1; the telemetry that reveals the binding constraint in Chapter 14.2; predictive maintenance of the plant in Chapter 14.5; the operating-model and incident-command boundary in Chapter 14.11; and the change-management discipline governing every live action here in Chapter 14.12.