Guide › Day-2 Operations, Upgrades & Lifecycle › 14.5

Chapter 14.5

Predictive & Preventive Maintenance of Power and Cooling Plant

In a 24/7 synchronous AI factory the maintenance question is not 'is the plant reliable?' but 'can you service it without dropping the job?' — and the answer is set years earlier by whether you bought concurrent maintainability and built the condition-based program that lets you intervene on the equipment's schedule instead of the failure's.

POWER-BOUNDGOODPUT

What you'll decide here

Where each asset class sits on the run-to-failure / time-based / condition-based / predictive spectrum — and why putting a 132 kW liquid loop or a 2N UPS string on the wrong rung is the most expensive default in the building.
Whether the facility was commissioned as concurrently maintainable (Tier III-class and up) — because if any single power or cooling path cannot be isolated and worked live, every preventive task becomes a planned outage you must negotiate against a synchronous training run.
Which leading indicators you instrument now (coolant chemistry, UPS cell impedance, switchgear partial discharge, bearing vibration, oil dielectric) versus the lagging alarms you inherit if you skip the sensors — the gap between a scheduled intervention and a 02:00 trip.
How you forecast and pre-position spares for power and cooling plant against failure-rate data and long supply leads — because a CDU pump or a switchgear breaker you do not have on the shelf converts a 4-hour repair into a multi-week capacity strand.
Who owns the maintenance calendar across the IT/facility boundary and the colo/operator contract — because the body that schedules a transformer outage and the body that owns goodput are usually not the same body, and that seam is where avoidable downtime lives.

A traditional enterprise data center can take a maintenance window. Drain a hall, fail traffic to the other availability zone, swap the UPS batteries on a Sunday at 03:00, and bring it back — the workload never noticed because the workload was designed to be moved. An AI factory in a pre-training run is the opposite kind of machine. It is one tightly-coupled job spread across thousands of accelerators that all move at the speed of the slowest straggler, drawing tens of megawatts through a power chain and rejecting it through a liquid loop that cannot be paused. There is no other zone to fail to. The job does not migrate; it checkpoints and restarts, and every restart is goodput you will never recover. The maintenance problem is therefore whether you can service the plant without dropping the job, and that answer was largely decided at commissioning, by whether you bought the redundancy topology that lets you isolate a path and work it live.

This chapter is about maintaining the power and cooling plant that keeps the accelerators fed. We walk the maintenance-strategy spectrum — run-to-failure, time-based/preventive, condition-based, and predictive — and assign each asset class to a rung, with the cost of getting it wrong attached. We then go deep on the predictive program for each subsystem: the power chain (transformers, switchgear, UPS, batteries, generators), the cooling plant (chillers, pumps, CRAHs, cooling towers), and the liquid-cooling-specific failure modes that have no precedent in air-cooled operations. We close on the hardest operational constraint of all — running maintenance in a synchronous world — and on the spares forecasting that turns a fast repair into a slow one when you get it wrong. The failure taxonomy and rates that feed this program are catalogued in Chapter 14.3; the telemetry stack that carries the signals lives in Chapter 14.2; this chapter is about what you do with them.

The maintenance-strategy spectrum: four rungs, one decision per asset

Maintenance is not one philosophy applied uniformly; it is a per-asset decision along a spectrum, and the discipline is matching the strategy to the asset's failure economics. Four rungs, from cheapest-to-set-up and most-expensive-when-it-bites, to most-instrumented and cheapest-over-life:

Run-to-failure (reactive). Run the asset until it breaks, then fix it. Rational only where the asset is cheap, redundant, non-critical, and its failure is benign and instantly detectable — a single redundant fan in an N+2 bank, a corridor light. Apply it to anything in the critical path of a synchronous job and you have chosen to be surprised at 02:00.

Time-based / preventive (PM). Service on a fixed calendar or runtime interval regardless of condition — the annual generator load-bank test, the quarterly switchgear inspection, the CDU filter change every N hours. This is the regulatory and warranty baseline, and for many assets it is correct. Its weakness is that it is blind to actual condition: you replace healthy components on schedule (waste) and you still miss the component that degrades faster than the interval (surprise). PM intervals also generate outages — every scheduled task on a non-redundant path is a planned hit you must negotiate.

Condition-based (CBM). Service when a measured parameter crosses a threshold — coolant conductivity past a setpoint, a UPS cell's internal impedance up 20% from baseline, a bearing's vibration RMS over the ISO band. You intervene only when the asset tells you it needs it, on your schedule rather than the failure's. CBM is the workhorse rung for AI-factory plant: it converts most surprises into scheduled interventions, but it requires the sensors and the baselines, which is the up-front cost most operators under-budget.

Predictive (PdM). CBM plus a model that projects remaining useful life from the trend, so you schedule the intervention against a forecasted failure date and a lead time rather than a static threshold. This is where 2026 DCIM and digital-twin tooling is heading — refrigerant-loss detection before cooling degrades, battery RUL across a whole string, generator-readiness prognostics — and where the agentic-ops vendors are concentrating. The payoff is real but it is the most data-hungry and most over-promised rung; treat vendor 'AI predictive' claims as CBM-plus-a-trendline until proven otherwise.

Power & cooling plant → maintenance-strategy assignment

Asset class	Right rung (AI factory)	Leading indicator	If you default to run-to-failure
MV/LV transformers	CBM + time-based	Dissolved-gas analysis (DGA) of oil; winding/IR thermography; load history	Catastrophic, weeks-to-months to replace; ~2-4 yr new-unit lead — a capacity strand, not a repair
Switchgear & breakers	CBM (online) + time-based	Partial-discharge monitoring; IR thermography on connections; mechanism cycle count	Arc-flash / bus fault takes a whole lineup; risk to personnel, not just uptime
UPS (static / modular)	Time-based + CBM	Capacitor ESR/temperature; fan hours; eco-mode transfer logs; thermal imaging	Loss of ride-through during the next utility event; the load you were protecting trips
UPS batteries (VRLA / Li)	CBM (impedance) + predictive RUL	Cell internal impedance vs baseline; float-voltage drift; cell temperature spread	A dead string discovered only at the transfer it was supposed to cover
Generators	Time-based + CBM	Oil analysis; coolant/fuel quality; start-reliability log; vibration; load-bank result	A no-start at the one moment the grid is gone; wet-stacking from chronic light-load runs
Chillers / pumps / fans	CBM (vibration) + time-based	Vibration RMS & spectrum; bearing temperature; motor current signature; ΔP across filters	Bearing seizure cascades to a thermal event in a hall with no air margin
CDUs (liquid loop)	CBM + predictive	Coolant chemistry (pH, conductivity, ORP); filter ΔP; pump vibration; leak/level sensors	Fouling or a pump failure throttles GPUs up to ~50% — goodput loss, not a clean trip

The right rung per asset class for an AI factory in a synchronous-workload posture. 'Leading indicator' is the measured parameter that drives a condition-based or predictive intervention; the consequence column names what a wrong-rung default costs.

The master fork: concurrent maintainability is bought, not bolted on

Every maintenance decision in this chapter is downstream of one design decision you cannot make in operations: was the facility commissioned to be concurrently maintainable? Uptime's Tier III and IV require that any single capacity component or distribution path can be removed from service — for planned maintenance — without dropping the critical load. If you have it, preventive work on a transformer, a UPS module, or a CDU is a routine isolate-and-service task on the redundant path while the job keeps running. If you do not have it — a Tier II or an N-topology hall, common in cost-optimized training builds that traded availability for goodput-per-dollar — then every PM task on a non-redundant path is a planned outage you must schedule against the workload. The consequence is stark: the same battery swap is a Tuesday-afternoon non-event in a 2N hall and a negotiated, checkpoint-coordinated production stoppage in an N hall. Decide your maintainability posture at design time (Chapter 12.1); you will live with its maintenance calendar for the asset's life.

Predictive maintenance of the power chain

The power chain is the most mature predictive-maintenance domain because the failure physics are well-understood and the instrumentation is decades-proven — but AI loads change the stress profile, which changes what you watch for. Walk it from the utility inward.

Transformers are the canonical predictive asset: dissolved-gas analysis of the insulating oil is a chromatograph reading the fault before it propagates — acetylene signals arcing, ethylene signals overheating, hydrogen signals partial discharge — and the trend over months gives weeks-to-months of warning. The AI twist is the load profile. Training power swings hard and fast (see Chapter 4.5 on transients), and large synchronized step-loads thermally cycle windings in a way steady enterprise load never did, accelerating insulation aging. Online winding-temperature and load-history monitoring is no longer optional at these duty cycles. The replacement reality is what raises the stakes: a failed power transformer is a procurement, not a repair — standard lead times ran to roughly 128 weeks and beyond into 2026, so the only acceptable failure mode is one you saw coming.

Switchgear and breakers fail two ways: insulation breakdown (which partial-discharge monitoring catches as the corona current rises before the flashover) and connection degradation (which infrared thermography catches as a loose or corroded busbar joint glows hot under load relative to its neighbors — distinguishable within seconds of a scan). Online PD sensors and periodic IR surveys are the standard CBM pair. The consequence of skipping them is categorically worse than a trip: a bus fault or arc-flash event takes an entire switchgear lineup and is a personnel-safety event, not merely a goodput event. NERC's 2026 Level 3 alert — roughly 1,500 MW of data-center load lost on a single 230 kV fault — is the reminder that the upstream electrical plant is now a grid-scale reliability concern, not just a building one.

UPS and batteries are where predictive maintenance pays for itself fastest, because the battery is the single most failure-prone element in the power chain and its failure is invisible until the moment it matters. The discipline is per-cell internal impedance trended against the commissioning baseline: a cell rising 20-25% from baseline is failing and will be the open link in the string at the next transfer, regardless of what its float voltage reads. Layer on float-voltage drift, per-cell temperature spread (thermal imaging across the string reveals the differential signature of a degrading cell), and discharge-event depth/count. For the UPS itself, the wear items are the DC-bus and filter capacitors (ESR and temperature trend) and the cooling fans (runtime hours). The fork worth naming: VRLA strings demand aggressive impedance-based CBM because they degrade silently and on a 4-6 year clock; lithium strings carry their own BMS telemetry and shift the program toward consuming the BMS state-of-health rather than instrumenting cells yourself. → ride-through and energy-storage roles in Chapter 4.5.

Generators are a readiness problem, not an uptime problem — they spend almost all their life not running and must start on the one occasion the grid is gone. Predictive practice is oil analysis (wear-metal spectroscopy reading bearing and ring wear), coolant and fuel-quality testing (microbial contamination in stored diesel is a classic no-start cause), vibration on the genset, and a logged start-reliability statistic. The AI-specific trap is wet-stacking: behind-the-meter gas and standby diesel that run chronically light-loaded glaze the cylinders and foul the exhaust, so the periodic load-bank test is both a PM task and a health check. → black-start and load-bank testing detail in commissioning practice; behind-the-meter generation economics in Part 3.

Predictive maintenance of the cooling plant

Cooling-plant predictive maintenance is rotating-machinery maintenance plus heat-transfer maintenance, and the AI factory raises the stakes on both because the thermal margin has collapsed. In an air-cooled enterprise hall a chiller hiccup is buffered by minutes of thermal mass; in a 132 kW-per-rack liquid hall the time-to-throttle after a cooling fault is measured in tens of seconds (cooling-controls transient dynamics are canonical in Chapter 5.12). That collapsed margin is why CBM on the cooling plant is not a cost-optimization — it is the thing standing between a degrading bearing and a hall-wide thermal event.

The rotating equipment — chillers, primary and secondary pumps, CRAH/CRAC fans, cooling-tower fans — is classic vibration-analysis territory. Bearing-defect frequencies appear in the vibration spectrum weeks before the bearing seizes; rising RMS, bearing temperature, and motor-current signature analysis triangulate the failure. The heat-transfer side adds its own indicators: approach temperature (the gap between the refrigerant and the water it is cooling) widening signals fouled tubes; differential pressure across filters and strainers rising signals a clogging element; refrigerant charge trending down signals a leak — and 2026 predictive tooling increasingly flags refrigerant loss before cooling capacity visibly degrades. Cooling towers and evaporative systems add water chemistry (scaling, Legionella control, blowdown) to the program. Running this rung as run-to-failure carries the worst consequence of the cooling program: a pump or chiller that fails uncaught in a dense liquid hall does not give you a clean trip and a restart, it gives you a thermal excursion that throttles or trips accelerators while you scramble, and in a synchronous job that is a job-wide restart.

Liquid-cooling-specific maintenance: the new operational surface

Direct-to-chip liquid cooling — the 2026 default for any dense AI hall (Chapter 5.4) — introduces a maintenance surface that simply did not exist in air-cooled operations, and most of it is chemistry and fluid mechanics, disciplines the average DC ops team had never owned. Three enemies dominate, and all three are slow, cumulative, and invisible to a temperature alarm until they have already cost you: corrosion (galvanic and general, attacking cold plates, manifolds, and the wetted metals), scaling/mineral deposition (narrowing the sub-millimeter cold-plate microchannels), and biological fouling (microbial growth in the warm, nutrient-bearing loop). Each reduces effective channel cross-section, raises thermal resistance and pressure drop, and ends in hot spots and GPU throttling — the GB200-class envelope throttles up to ~50% on coolant deviation, so this is a direct, silent goodput tax.

The predictive program is built on coolant chemistry monitoring: pH (corrosion onset), conductivity (ionic contamination and inhibitor depletion), ORP/oxidation state, and turbidity (particulate and biological load) — the four parameters that, per 2026 CDU-sensor practice, reveal corrosion, scaling, fouling, and contamination long before any temperature alarm trips. On top of chemistry sit the mechanical CBM points: filter differential pressure (the single best clogging indicator, and the most common scheduled service), CDU pump vibration, coolant flow and supply-temperature trends per loop, and the leak-detection and level-sensing layer that turns a slow seep into an alarm before it becomes a puddle on a live busbar. The CDU itself is the unit of concurrent maintainability here: a CDU with internal N+1 pumps and redundant heat-exchanger paths can have a pump serviced live; a single-pump CDU cannot, which is why the CDU redundancy decision at design time is a maintenance decision (Chapter 5.4).

Coolant chemistry is a maintenance program, not a fill-and-forget

The most common 2026 liquid-cooling operational failure is treating the coolant as a one-time fill. Inhibitor packages deplete, glycol-water blends drift, and biological growth establishes — on a timescale of months, not years. An operator that commissioned a beautiful DLC hall and then never sampled the loop will discover the problem as creeping GPU throttling and rising pump ΔP that no temperature alarm explained, because the loop was fouling from the inside. Stand up coolant sampling and chemistry trending from day one, budget the fluid-replacement and side-stream filtration cadence into the maintenance calendar, and treat the chemistry log with the same seriousness as the UPS battery impedance log. The fluid is a consumable with a maintenance program; the alternative is a slow, untraceable goodput leak.

~7 days

best-in-class MTBF per 512 H100s; new clusters fail far more during 3-4 wk burn-in — the failure environment maintenance must manage

2025SemiAnalysis (100k H100 clusters)

~90% / ~96%

industry-avg vs best-in-class goodput; reliability overhead 6-21% of TCO — the prize a good maintenance program protects

2025SemiAnalysis ClusterMAX / CoreWeave

99.982% / 99.995%

Uptime Tier III / Tier IV availability; concurrent maintainability is the property that lets you service live

2025Uptime Institute

inlet 20-25C, ~80 L/min

GB200 NVL72 DLC spec; deviation throttles GPUs up to ~50% — why coolant CBM is a goodput control

2025NVIDIA OCP / Introl

~128 wks

standard HV power-transformer lead time (to ~60 mo constrained); a failed unit is a strand, mandating predictive catch

2025Wood Mackenzie / pv magazine

~1,500 MW

data-center load lost on one 230 kV fault, triggering NERC's rare Level 3 alert — upstream plant is now grid-critical

2026NERC Level 3 Alert / Utility Dive

20-25%

UPS battery cell impedance rise vs baseline that flags a failing cell before any transfer event exposes it

2025VRLA UPS predictive-maintenance literature

~0 L/kWh

WUE of closed-loop liquid designs — the loop that eliminates evaporative water but adds a chemistry-maintenance program

2025Microsoft / Vertiv synthesis

Maintenance windows in a 24/7 synchronous world

Here is the operational crux that distinguishes AI-factory maintenance from every prior generation of critical-facility maintenance: in a traditional data center you fail traffic away and take the window; in a synchronous AI cluster there is nowhere to fail to, so the window must be created without dropping the job. A pre-training run is one job. You cannot move 40% of it to another availability zone for an afternoon. You have exactly three levers, and they form a clean decision tree.

Lever one: concurrent maintainability. If the facility is Tier III/IV-class, the preventive task happens on the redundant path while the job runs — the cleanest answer, and the one bought at design time. This is why the maintainability fork above is the master decision: it determines whether maintenance is invisible to goodput or a negotiation with it.

Lever two: coordinate with the checkpoint cadence. When a task genuinely requires the load down — a non-redundant-path job, an N-topology hall, a fault that breaks concurrent maintainability — you do not pick a clock time; you pick a checkpoint boundary. The job is already writing checkpoints at the Young/Daly optimal interval (Chapter 9.4); the marginal cost of a planned stoppage at a checkpoint is one interval of recompute, not a from-scratch restart. The maintenance calendar and the ML-platform's checkpoint schedule must therefore be planned together — which is an organizational requirement, not a technical one (below).

Lever three: batch the work. If the load is coming down anyway — for a firmware campaign (Chapter 14.8), a fabric upgrade, a planned migration — the discipline is to pull every deferrable preventive task into that same window. A drained hall is the scarcest resource in an AI factory; wasting it on a single task is a planning failure.

The consequence of getting this wrong is not abstract. An operator that runs an N-topology training hall on a time-based PM calendar — fixed-date transformer inspections, fixed-interval battery service — without coordinating those dates with the workload will either (a) defer the PM and accept rising failure risk, or (b) take the dates and bleed goodput on the platform team's behalf without telling them. Both are common; both are avoidable; both trace to treating the maintenance calendar as a facility artifact rather than a joint facility/ML-ops artifact.

Deep dive: who owns the maintenance calendar across the IT/facility and colo/operator boundary

The single most reliable source of avoidable AI-factory downtime is not an equipment failure — it is the seam between the body that schedules a power or cooling outage and the body that owns goodput. In a self-build hyperscaler these can be aligned by org design. In the far more common case — a tenant running GPUs in a wholesale colo, or an enterprise on a neocloud — they are structurally separate. The colo operator owns the transformer, the UPS, the chillers, and their maintenance calendar; the tenant owns the accelerators and the job. The operator's standard maintenance notice ('we are taking UPS module B2 out for service on the 14th, no expected impact under N+1') is written for an enterprise tenant whose workload tolerates it. For a synchronous training tenant, 'no expected impact' is true for the facility and false for the job if anything goes wrong during the window with the redundant path carrying full load — the moment of a PM is exactly when the surviving path is most stressed.

The discipline that closes the seam: (1) a contractual maintenance-notice window long enough for the tenant to align it with a checkpoint boundary; (2) a shared change calendar visible to both the facility-ops body and the ML-platform body, not two private calendars; (3) explicit ownership of who calls 'go/no-go' on a window when a training run is mid-epoch; and (4) a standing agreement that the redundant-path stress during a PM is treated as an elevated-risk period, not business-as-usual. This is the operational-organization and incident-command material of Chapter 14.11 and the change-management framework of Chapter 14.12, but it surfaces here because the maintenance calendar is where the boundary is tested most often. The colo/operator boundary is, in practice, the boundary that decides whether your preventive maintenance is invisible or expensive.

Spares forecasting for power and cooling plant

Predictive maintenance buys you warning; spares strategy decides whether the warning matters. A CDU pump you can see failing three weeks out via vibration trending is only useful if you have the replacement pump — or can get it inside three weeks. The entire value of a predictive program collapses if the spare is a 16-week procurement. Spares forecasting for plant is therefore the necessary complement to the predictive program, and it obeys a different logic than IT-component sparing (the FRU/RMA machinery of Chapter 14.6).

The forecasting inputs are three: the component's failure rate (from fleet reliability data — Chapter 14.3), the component's replacement lead time, and the criticality of its function (does its loss drop load, or merely consume redundancy?). The product of failure rate and lead time, weighted by criticality, sets the on-site stocking level. The plant-specific wrinkle is the brutal lead-time asymmetry: commodity items (fans, filters, pumps, capacitors, contactors, sensors) are days-to-weeks and cheaply over-stocked; long-lead items (power transformers at ~128 weeks, switchgear lineups, CDUs, chillers) cannot be spared on the shelf at all for cost and space reasons, which forces a different hedge — predictive catch with maximum warning, manufacturer hold agreements, designed-in redundancy so the unit's loss is survivable, and in the extreme, a shared regional spare across sites. The fork: you either hold the spare or you hold the warning, and which one is available is a function of lead time and unit cost, not preference.

Spares posture by plant component class

Component class	Typical lead time	Posture	Why
Filters, fans, sensors, contactors	Days	Hold spare (deep)	Cheap, small, high turnover; consumed by the PM program itself
Pumps, capacitors, breakers, valves	Weeks	Hold spare (working stock)	Failure-rate x lead-time justifies on-site; repairs must beat thermal/ride-through margin
UPS modules, CRAH units	Weeks to months	Hold 1-2 + redundancy	Modular redundancy covers the gap; one floater per facility is common
CDUs	Months	Hold warning + N+1 design	Too costly/large to shelf many; redundancy plus predictive catch is the hedge
Chillers	Months	Hold warning + N+1 plant	Plant-level redundancy carries the loss; vibration/approach-temp CBM gives runway
Power transformers, switchgear lineups	~1-4+ years	Hold warning + hold agreement	Un-shelvable; DGA/PD predictive catch and a manufacturer slot are the only real hedge

Stocking posture follows lead time and criticality, not uniform policy. 'Hold spare' = on-site inventory; 'Hold warning' = predictive catch + redundancy + hold agreements, because the unit is too costly/long-lead to shelf.

Predictive maintenance is a goodput investment, not a facilities cost line

The reframe that justifies the whole program: in an AI factory, a power or cooling fault does not just risk availability — it risks goodput, and goodput is 6-21% of TCO at stake (Chapter 14.1). A CDU pump that fails uncaught throttles a hall of GPUs and may restart a synchronous job from its last checkpoint; the cost is not the pump, it is the lost compute-hours across thousands of accelerators. That asymmetry — a few-thousand-dollar component whose silent failure costs millions in lost training time — is exactly the economics that makes condition-based and predictive maintenance pay for itself many times over on the critical-path plant, even though the same math would never justify it for a corridor light. Budget the sensors, the baselines, and the chemistry program out of the goodput line, not the facilities line, because that is where the return actually lands.

This chapter assumes the failure taxonomy and fleet failure-rate data it acts on — those are canonical in Chapter 14.3, and they feed the availability model in Chapter 12.1. The telemetry and DCIM stack that carries the condition signals, plus alarm rationalization, is Chapter 14.2. The redundancy topologies that determine concurrent maintainability are Chapter 12.1; the goodput-vs-availability reframe that values maintenance is Chapter 12.2. Power-chain physics behind the stress profile (transients, ride-through, energy storage) live in Chapter 4.5; the DLC system and CDU redundancy that this chapter maintains are engineered in Chapter 5.4, with cooling-controls transient dynamics in Chapter 5.12. The checkpoint cadence that maintenance windows must align to is Chapter 9.4. Spares logistics and the RMA lifecycle continue in Chapter 14.6; firmware-campaign windows in Chapter 14.8; the org boundary, incident command, and change-management discipline that own the maintenance calendar in Chapter 14.11 and Chapter 14.12.