Guide › Day-2 Operations, Upgrades & Lifecycle › 14.2

Chapter 14.2

DCIM, Telemetry & Observability for GPU-Dense, Liquid-Cooled Facilities

A GPU-dense liquid-cooled facility runs two telemetry universes — sub-second IT health and minute-scale OT plant — and the operator's defining decision is whether to correlate them into one goodput-aware control plane or leave them as two siloed pipelines that each blame the other when a cluster stalls.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Whether facility (OT/BMS/SCADA) and IT (DCGM/Redfish/telemetry) data live in one correlated time-series plane or in two siloed stacks that meet only in a human's head during an incident.
What you sample, at what rate, and where you keep it — the cardinality-and-retention bargain that decides whether your observability bill or your blind-spot risk dominates.
Where the alarm philosophy lives: an ISA-18.2 / EEMUA-191-rationalized OT alarm system distinct from SLO-burn IT alerting, with shelving and flood-suppression designed in before the first transient.
How far up the autonomy ladder you commit the DCIM — read-only dashboards, advisory analytics, closed-loop setpoint control, or an operational digital twin that closes the loop against an as-built model.
Which signals are load-bearing for the metrics that pay the bills — goodput, PUE, time-to-detect, time-to-cordon — versus which are vanity telemetry you are paying to store and never query.

A GPU-dense, liquid-cooled AI factory is the most heavily instrumented building most operators will ever run, and also the one where the instrumentation most often fails to answer the only question that matters: why did goodput just drop? The reason is structural. Two telemetry universes coexist in the same hall and almost never share a clock, a namespace, or an on-call rotation. On one side is the IT plane — per-GPU temperature, power, SM clocks, ECC and XID/SXID error events, NVLink and fabric counters — sampled at one-to-ten-second cadence by agents like NVIDIA's DCGM exporter and scraped into a time-series database. On the other is the OT plane — CDU flow and pressure, coolant supply/return temperature, dew point, valve positions, pump speeds, leak-detection zones, switchgear and UPS state — polled at thirty-to-sixty-second cadence over Modbus, BACnet, SNMP, and increasingly Redfish, surfaced through a BMS/SCADA front end and a DCIM platform.

This chapter is about the decisions that determine whether those two universes become one observable system or stay two. We start from the stack itself — what each layer measures and at what rate — then take the central fork: siloed telemetry versus IT/facility correlation, and the protocol and data-architecture choices that make correlation possible or impossible. We treat liquid-cooling observability as its own discipline, because liquid is the subsystem where a missed signal turns into a flooded electrical room. We separate alarm management (an OT engineering discipline on the ISA-18.2 / EEMUA-191 lineage) from alerting (an SRE discipline of SLO burn), because conflating them is how you build a system nobody trusts. And we close on the operational twin — the live, as-built sibling of the design-validation twin from Chapter 2.7 — and the autonomy ladder a DCIM climbs from dashboard to closed-loop control.

Two telemetry universes, one building

The instinct of a newcomer is to treat telemetry as a volume problem — collect everything, store it forever, and the answers will be in there somewhere. In a 100,000-GPU campus that instinct produces a monitoring system that is simultaneously too expensive and unable to answer an incident. The right framing is a layered stack, where each layer has a native protocol, a native cadence, and a native owner, and the engineering question is how the layers connect, not how much each one emits.

The IT/AI telemetry layer is the fast one. Per-GPU health flows from the device through NVML into DCGM and out via an exporter in Prometheus line format; XID and SXID events (the GPU and NVSwitch error codes) are the load-bearing failure signals, and on a healthy device the error counters — remapped rows, PCIe replay, uncorrectable ECC — sit at zero, so a nonzero value is itself the alarm. Above the device sit the fabric counters (InfiniBand/UFM or Ethernet/NetQ), the scheduler's view of job and node state, and the application's own loss curve and step time. This layer's cadence is seconds, its cardinality is very high — a single GB200 NVL72 rack is 72 GPUs times dozens of metrics each — and its consumer is the SRE who needs to cordon a lemon node before it poisons a collective.

The facility/OT layer is the slow, durable one. Power chain (switchgear, UPS, PDU, branch-circuit metering), cooling plant (chillers, dry coolers, CDUs, pumps, valves), environmental sensors, and physical-security systems report through PLCs and a BMS into a SCADA/DCIM front end. Its cadence is tens of seconds to minutes, its cardinality is modest, and its consumer is the facility engineer and the alarm system. Critically, this layer is governed by a different culture: it values determinism, alarm rationalization, and long-horizon trending over the high-cardinality, ephemeral-by-design ethos of the IT side. A thermal event originates in the OT layer and manifests as a goodput collapse in the IT layer, and the two layers, left to themselves, will not tell you they are the same event.

The master fork: siloed telemetry vs IT/facility correlation

Decide whether OT and IT telemetry share one time-series plane before you buy the DCIM. In a siloed architecture the BMS/DCIM and the IT observability stack are separate products with separate databases, separate clocks, and separate on-call. They are cheaper to procure and each is best-of-breed — but when a CDU pump degrades and 72 GPUs throttle, the facility dashboard shows a flow anomaly, the IT dashboard shows an MFU cliff, and a human has to guess they are the same incident. In a correlated architecture, OT and IT telemetry land in a common time-series plane with synchronized time and a shared asset namespace (this GPU is in this rack is on this CDU is on this branch circuit), so the throttle and the flow anomaly appear on one timeline as cause and effect. Correlation costs more in integration and forces an organizational merge of two cultures — but it is the only architecture in which mean-time-to-diagnose for a power-or-cooling-induced goodput loss is measured in seconds rather than the duration of a bridge call. Decide this before you buy the DCIM, because retrofitting a shared namespace onto two mature silos is a multi-quarter project.

The telemetry stack, layer by layer

Layer	What it measures	Native protocol	Typical cadence	Cardinality	Primary consumer
Application / job	Loss, step time, MFU, collective stall, checkpoint events	Framework + scheduler API; OTel traces	Per step (sub-second to seconds)	Low per job, high across fleet	ML platform / training SRE
GPU / node health	Temp, power, SM clock, ECC, XID/SXID, NVLink, remapped rows	NVML → DCGM → exporter (Prometheus)	1–10 s	Very high (per-GPU × dozens)	Cluster SRE / autonomous recovery
Fabric	Port counters, BER, congestion, link flaps, PFC/ECN	UFM / SNMP / gNMI / NetQ	1–30 s	High (per-port)	Network ops
Rack / power chain	Branch-circuit power, PDU, busbar, BBU/UPS state	Redfish / SNMP / Modbus	5–30 s	Medium	Facility electrical
Liquid-cooling plant	CDU flow/pressure, supply/return temp, dew point, leak zones, valve/pump state	Redfish (DSP2064) / BACnet / Modbus	10–60 s	Medium	Facility mechanical / alarm system
Heat rejection / utility	Chiller/tower/dry-cooler state, ambient, WUE, meter, switchgear	BACnet / Modbus / DNP3 / SCADA	30–60 s	Low–medium	Facility / energy ops

Cadence and cardinality are 2026 practitioner norms for a GPU-dense liquid-cooled hall; protocols are the dominant choices, not the only ones. DCGM/Redfish figures cross-checked against NVIDIA and DMTF documentation.

Protocols and data architecture: the seam where correlation succeeds or fails

Correlation is not a dashboard feature you switch on; it is a data-architecture decision made at the seam between the layers above. Three sub-decisions determine whether the seam holds.

The protocol convergence decision. Historically the IT plane spoke IPMI and the OT plane spoke Modbus/BACnet, and never the twain met. The 2026 lever is Redfish (the DMTF's RESTful successor to IPMI), which has crossed from a server-management API into a facility-telemetry API: the DMTF added liquid-cooling equipment to Redfish in the 2023.1 release (the DSP2064 model for CDUs, coolant loops, leak detection, pumps, and reservoirs), so a CDU and a GPU can now expose their state through the same schema and HTTP transport. That does not eliminate the BMS — chillers, switchgear, and legacy plant still speak BACnet and Modbus — but it gives you a convergence target. The fork is whether you adopt Redfish-as-the-common-model and bridge the stragglers, or keep a polyglot architecture and pay for translation at every integration. Above the wire, OpenTelemetry is becoming the lingua franca for the IT side's metrics, traces, and logs, which lets application step-time and node health share a schema with the SRE tooling.

The time-sync and namespace decision. Correlation is impossible without a shared clock and a shared identity. A throttle event timestamped by a GPU's monotonic clock and a flow anomaly timestamped by a PLC's local clock cannot be aligned to causality if the clocks drift by seconds — and at one-second IT cadence, seconds of skew destroy the signal. PTP/NTP discipline across both planes is a precondition, not a nicety. Equally, every datum needs an asset key that resolves the physical topology: GPU → node → rack → CDU → branch circuit → PDU → room. Without that graph you cannot answer "which GPUs are downstream of the pump that just slowed," which is the entire value proposition of correlation.

The store-and-retain decision. A GPU-dense fleet is a cardinality bomb. Store every per-GPU metric at full resolution forever and your time-series database cost rivals a non-trivial slice of the power bill; store too little and the one transient you needed is gone before the postmortem. The mature pattern is tiered: high-resolution hot storage for days, downsampled rollups for months, and event-triggered full-fidelity capture (a "flight recorder" that keeps raw samples around any XID, throttle, or leak event). The decision you are actually making is where to put the blind spot — and the only defensible answer is to keep full fidelity exactly where incidents originate and accept lossy aggregation everywhere else.

~90% / ~96%

goodput (effective training time): industry avg vs best-in-class; the metric correlation exists to protect

2025SemiAnalysis ClusterMAX / CoreWeave

~7 days

top-tier H100 operator MTBF per 512 GPUs — the failure cadence the IT telemetry plane must catch

2025SemiAnalysis (100k H100 clusters)

419

unplanned interruptions over 54 days on 16,384 H100s (~1 every 3 hr); 78% hardware-caused

2024Meta (Llama 3 405B paper)

<25 °C / <10 °C

GB200 NVL72 coolant inlet ceiling and cold-plate delta-T; deviation throttles GPUs up to ~50%

2025NVIDIA OCP / Introl

10–60 s

CDU/liquid-loop telemetry cadence (flow, pressure, temp, dew point, leak) in production AI halls

2026Maintech / vendor liquid-cooling monitoring guidance

2023.1

Redfish release that added liquid-cooling equipment (DSP2064): CDUs, loops, leak detection, pumps

2023DMTF Redfish / SPMF

~99.999%

fleet CDU availability sustained since 2020 (Google Project Deschutes) — the bar telemetry-driven ops targets

2025Google Cloud (OCP EMEA)

>10 in 10 min

EEMUA-191 / ISA-18.2 alarm-flood threshold — the rate above which an operator can no longer respond

2016 (std.)EEMUA 191 / ANSI-ISA-18.2

Liquid-cooling observability is its own discipline

In an air-cooled hall, a cooling failure is a slow thermal ramp that buys you minutes and at worst trips a thermal limit. In a direct-to-chip liquid hall, the failure modes are faster and more consequential, and the telemetry must be designed around them specifically. There are three you instrument for above all others.

Throttle-by-deviation. The GB200 NVL72 envelope is tight — coolant inlet below ~25 °C, roughly 80 L/min flow, under ~10 °C rise across the cold plates — and a deviation does not fail the rack, it throttles the GPUs up to ~50%. That signal is invisible to a naive availability monitor: every node is "up," every GPU is "healthy," and goodput has silently halved. Only by correlating the CDU's flow/temperature telemetry with the GPUs' clock-throttle reason codes do you see that a cooling deviation, not a software bug, is eating the run. This is the single most important correlation in a liquid-cooled facility.

Leak detection. Liquid above an electrical room is a step-change in risk, so leak telemetry is zoned, redundant, and wired to act: rope/spot sensors per rack and per CDU, pressure-decay detection on the loops, and reservoir-level trending that catches slow loss before it becomes a puddle. The mature posture pairs detection with negative-pressure loop design and automated isolation valves, so a confirmed leak can drain rather than spray. The telemetry decision is the confidence threshold: a false trip drops a rack of GPUs, a missed trip wets a busbar, and the gate is set so an isolation command fires only when leak sensors agree and flow/pressure confirm.

Fluid quality and slow drift. The signals that predict a failure weeks out are not dramatic: conductivity and pH creep, particulate and filter-differential pressure, biological fouling, dew-point margin narrowing toward condensation. These feed the predictive-maintenance program (canonical in Chapter 14.5) rather than the alarm system, and they are exactly the trends a siloed IT stack never sees and a correlated plane surfaces as a degrading-CDU early warning.

Deep dive: instrumenting the technology-cooling vs facility-water loop boundary

A direct-to-chip system is two loops joined at the CDU's heat exchanger: the closed, clean technology cooling system (TCS) loop that touches the cold plates, and the facility water system (FWS) loop that carries heat to rejection. The CDU isolates them, and the most diagnostic telemetry lives at that boundary, because most thermal incidents present as a divergence across it. Supply and return temperature on both sides give you the approach temperature of the heat exchanger; a rising approach at constant load means the HX is fouling or a pump is degrading long before any GPU throttles. Flow on both sides, differenced, localizes whether a problem is in the rack manifolds (TCS) or the plant (FWS). Filter differential pressure on the TCS side trends toward a clog. Dew point versus TCS supply temperature is the condensation margin you must never cross.

The decision this forces is where you place the redundant instrumentation. CDUs ship with internal sensors, but a single internal flow meter is a single point of diagnostic failure — when it drifts, you cannot tell a real flow loss from a bad sensor. Mature operators add independent facility-side metering so the CDU's self-report can be cross-checked, and they alarm on the disagreement between the two as a sensor-health signal in its own right. This is the liquid-cooling expression of a general telemetry principle: instrument the boundary, cross-check the critical sensor, and treat sensor disagreement as a first-class event. Loop separation, CDU redundancy, and commissioning of this instrumentation are detailed in Chapter 5.11 and Chapter 13.5.

Alarm management is not alerting

The most common cultural failure in an AI-factory operations center is treating the OT alarm system and the IT alert system as the same thing. They are different disciplines with different standards, and merging them naively produces the worst of both: an alarm flood nobody can action and an SLO-burn signal nobody trusts.

Alarm management is an OT engineering discipline on the ISA-18.2 / IEC 62682 / EEMUA-191 lineage, and its core insight is that an alarm is a request for operator action, not a notification. Every alarm must survive rationalization: it is justified against a written alarm philosophy, documented with a consequence, a required operator action, and a time-to-respond, and anything that fails that test is not an alarm. The discipline exists because of a quantifiable failure mode — the alarm flood, conventionally defined as more than ten new alarms in ten minutes, the rate above which a human operator is overwhelmed and starts ignoring the board. The standard's countermeasures are specific: rationalization to cut nuisance alarms, prioritization so the critical ones stand out, shelving (the operator-controlled, time-bounded, audited suppression of a known nuisance alarm), and state-based suppression so a planned maintenance evolution does not generate a thousand expected alarms. In a liquid-cooled hall this matters acutely: a single CDU trip can cascade flow, temperature, and throttle alarms across an entire row, and without designed flood suppression the operator sees a wall of red and cannot find the root cause inside it.

Alerting is the SRE discipline of paging on SLO burn — you define goodput, latency, or error budgets and alert when the rate of consumption threatens the objective, deliberately avoiding alerts on every transient symptom. The two systems must coexist, and the operator's decision is how they relate. The wrong answer is to dump every OT alarm into the IT pager or every IT symptom onto the OT board. The right answer is a federated model: each system rationalized to its own discipline, with a deliberate, narrow set of cross-links — a confirmed cooling alarm that is eating goodput should reach the training SRE, and a fleet-wide goodput collapse should prompt the facility operator to look at the plant. Correlation feeds both; it does not merge them.

Alarm management (OT) vs alerting (IT) — two disciplines, deliberately federated

Dimension	Alarm management (OT)	Alerting / SRE (IT)
Governing standard	ISA-18.2 / IEC 62682 / EEMUA-191	SRE practice (SLO/error-budget, e.g. Google SRE)
Trigger philosophy	Abnormal condition requiring operator action	SLO burn rate threatening an objective
Unit of concern	The alarm (rationalized, prioritized, documented)	The symptom stream → the page
Failure mode designed against	Alarm flood (>10 in 10 min); operator overload	Alert fatigue; paging on noise
Suppression mechanism	Shelving, state-based suppression, MoC-governed	Inhibition, deduplication, burn-rate windows
Owner / consumer	Facility operator; control room	On-call SRE; incident response
Cadence / horizon	Seconds–minutes; long-trended, audited	Seconds; ephemeral, fast-cycling

The columns are the design distinctions an AI-factory ops center must preserve. Conflating the two is the recurring anti-pattern.

The unrationalized-alarm trap

The fastest way to make a brand-new, fully-instrumented facility operationally blind is to ship it with every sensor wired to an alarm and none of them rationalized. Day one looks impressive — thousands of points, a wall of dashboards — and the first real transient produces a flood of hundreds of correlated alarms in seconds, inside which the one alarm that matters is invisible. Operators learn within a week that the board is noise and stop trusting it, and from then on the alarm system is decorative. The fix is not more dashboards; it is rationalization before go-live — an alarm philosophy, a documented action and priority for every alarm, designed shelving and state-based suppression for known maintenance evolutions, and flood-handling tested during commissioning (Chapter 13.5). An alarm the operator cannot act on is not safety; it is the thing that hides the alarm they could have acted on.

The operational twin and the autonomy ladder

A DCIM is not one thing; it is a position on a ladder, and the operator's strategic decision is how far up that ladder to climb — because each rung trades a real capability against a real risk. The rungs are: read-only visualization (dashboards and trends, no analytics); advisory analytics (anomaly detection and predictive-maintenance lead times, human acts); closed-loop control (the DCIM moves setpoints — pump speed, valve position, power caps — within bounded authority); and the operational digital twin (a live, as-built model that simulates the consequence of an action before committing it, and increasingly runs an agentic control loop). The 2026 reference designs — NVIDIA's Omniverse DSX digital-twin blueprint, the hyperscalers' in-house plant analytics — point at the top of the ladder, but most operators are correct to climb deliberately rather than leap.

The crucial distinction this chapter draws is between the operational twin and the design-validation twin of Chapter 2.7. The design twin is built before steel is cut to validate a thermal and power design against a model; it is a planning artifact. The operational twin is the design twin's live descendant, continuously reconciled against as-built reality — real sensor streams, real failed pumps, real firmware revisions — so that when it predicts the effect of raising a setpoint, it is predicting against the building that exists, not the one that was drawn. The decision that makes or breaks an operational twin is the as-built fidelity loop: a twin that drifts from reality (a swapped CDU, a re-cabled rack, a derated chiller the model does not know about) gives confidently wrong predictions, which is worse than no twin at all. Closing that loop — continuous and re-commissioning telemetry feeding the model — is the subject of Chapter 14.14, and the autonomy decisions the twin enables are taken up in Chapter 14.13.

Deep dive: why closed-loop DCIM control is gated on correlation and an as-built twin

It is tempting to let the DCIM close the loop early — let it raise pump speed when GPU temperatures climb, let it cap power when a transient threatens the UPS. The reason mature operators gate this is that a closed-loop action on a liquid-cooled GPU hall is coupled across the IT/OT seam in ways a single-plane controller cannot see. Raise the facility-water flow to cool a hot rack and you may pull supply temperature below the dew point and condense on a busbar; cap GPU power to relieve a thermal margin and you stall a synchronous collective and tank goodput across thousands of GPUs that were fine. A controller that sees only the OT plane optimizes the plant and breaks the workload; one that sees only the IT plane does the reverse.

This is why the autonomy ladder cannot be climbed faster than the correlation and twin underneath it. Closed-loop control is safe only when the controller can (a) see both planes on one clock and one asset graph, so it knows the GPUs downstream of the valve it is about to move, and (b) simulate the action against an as-built model before committing, so it catches the dew-point excursion or the goodput hit in silico. The practitioner gates of the production literature make this concrete — an automated cooling command fires only when leak sensors are precise, flow and pressure are within roughly ±5% of design, and GPU die temperature sits a couple of degrees below throttle. Those gates are correlation made executable. Build the correlated plane and the as-built twin first; earn the closed loop second. The full autonomy framing lives in Chapter 14.13.

What to instrument for the metrics that pay the bills

The final discipline is editorial: a GPU-dense facility can emit more telemetry than anyone will ever query, so the question is not what you can measure but what is load-bearing for the four numbers that govern the business — goodput, PUE, time-to-detect, and time-to-cordon. Work backward from each.

Goodput (the metric canonicalized in Chapter 14.1) is protected by the correlation of throttle-reason codes, XID/SXID events, fabric link flaps, and CDU deviation — the signals that explain why effective training time dropped. If a metric does not help you attribute a goodput loss, it is not load-bearing for goodput.
PUE / WUE is protected by the power-chain and heat-rejection metering — branch-circuit power, CDU and chiller energy, water draw — trended over seasons. This is the OT plane's classic job and the place its long-horizon retention earns its keep.
Time-to-detect is set by the fastest plane's cadence and the quality of the failure signals on it — which is why per-GPU XID monitoring at one-second cadence, not minute-scale facility polling, is what catches a lemon node before it poisons a collective.
Time-to-cordon is set by how directly detection wires to action — the cordon-and-drain path that ejects a failing node, fed by the same health telemetry, treated in Chapter 10.7 (autonomous recovery) and the failure taxonomy of Chapter 14.3.

Everything else is, at best, useful context and, at worst, vanity telemetry you are paying to store. The cardinality-and-retention bargain from earlier in the chapter is the same decision seen from the cost side: keep full fidelity where it serves one of these four numbers, downsample everywhere else, and resist the impulse to retain everything forever because storage felt cheap on day one and did not stay that way at fleet scale.

This chapter is the observability spine of Part 14. The goodput and reliability-economics metrics it exists to protect are canonical in Chapter 14.1; the failure taxonomy and SDC mechanisms its telemetry must catch live in Chapter 14.3; the autonomous-recovery loop that consumes its signals is in Chapter 10.7 and operationally in Chapter 14.4. The predictive-maintenance program fed by its slow-drift signals is Chapter 14.5; the change-management and procedures discipline that governs alarm shelving and MoC is Chapter 14.12; the agentic-ops and closed-loop-control decisions the twin enables are Chapter 14.13; and the continuous/re-commissioning loop that keeps the as-built twin honest is Chapter 14.14. On the engineering side, the design-validation twin it descends from is Chapter 2.7; the liquid-cooling reliability, leak-detection, and CDU commissioning it instruments are Chapter 5.11 and Chapter 13.5; and the cooling-controls transient dynamics behind every setpoint decision are Chapter 5.12.