The Definitive Guide toAI Data Centers
Ask the Guide
Guide Commissioning & Go-Live13.8

Chapter 13.8

GPU Node Burn-In, Diagnostics & Stress Validation

A GPU node that boots and passes a smoke test is not a commissioned node — burn-in is the deliberate, time-bounded campaign that converts a hall full of accelerators into a fleet whose failures have already happened on your clock instead of mid-training-run on the customer's.

GOODPUTDENSITY-RAMPPOWER-BOUND

What you'll decide here

  1. How long you soak — the 72-hour minimum that catches infant mortality versus the 168-hour (7-day) campaign that the strictest acceptance standards now demand — and what that schedule costs you in deferred revenue against a depreciation clock that is already running.
  2. Which diagnostic depth you run at each gate: light DCGM health checks (seconds) for triage versus the full long-stress / memtest / EUD passes (1.5+ hours per node) that actually exercise the silicon, and how you parallelize them across thousands of nodes without serializing the ramp.
  3. Where you draw the SDC-hunting line — what fraction of compute you spend on silent-corruption detection at commissioning versus pushing it into the day-2 fleet-scanner regime, given that no burn-in catches every marginal die.
  4. Your throttle-free thermal acceptance criterion: whether a node must hold full clocks for the entire soak under a realistic thermal load, or whether you accept documented throttle margin — and how you separate a bad GPU from a bad cold-plate, a bad CDU, or a bad rack position.
  5. The accept / RMA / quarantine decision boundary itself: the numeric thresholds (ECC error counts, XID codes, NVLink replay rates, straggler delta) that move a node from the production pool to the vendor return queue before it is ever scheduled.

Every prior chapter in Part 13 commissioned the building — the power chain energized, the cooling loop proven, the fabric link-tested, the integrated system demonstrated to ride through faults. This chapter commissions the thing the building exists to run: the GPU nodes themselves. It is the point in the program where the facility hands off to the machine, and it is the gate most often skipped under schedule pressure — because the nodes boot, the smoke test passes, and the temptation to declare victory and start scheduling is enormous. That temptation is the most expensive mistake in the bring-up. A GPU fleet has a bathtub-curve failure distribution: a population of marginal dies, cold solder joints, mis-seated HBM stacks, and borderline cold-plate contacts that will fail early and predictably. Burn-in is the deliberate forcing of that infant-mortality population to fail on your acceptance clock, in a controlled window, before any of it touches a synchronous training job whose entire run restarts from checkpoint when a single GPU dies.

The consequences here are unusually legible because they are denominated in goodput. Skip the soak and the infant-mortality failures simply move downstream into the first real workload, where a single bad GPU at hour 40 of a synchronous run means a checkpoint restart across the whole job, a straggler that drags the slowest-link collective, or worse, a silent corruption that poisons gradients for days before anyone notices. This chapter covers node bring-up and inventory, the DCGM diagnostic ladder, the 72-to-168-hour soak, HBM/ECC validation, thermal screening under load, silent-data-corruption (SDC) hunting, straggler detection, and power-behavior validation, with the downstream cost attached to each acceptance decision. It qualifies the individual node; Chapter 13.9 takes the qualified fleet to cluster scale.

Bring-up and inventory: trust nothing the BOM claims

The first act of burn-in is not a stress test — it is an inventory reconciliation against ground truth. At the scale of a modern AI hall, a non-trivial fraction of nodes arrive with something quietly wrong: a GPU running an older VBIOS than its siblings, an HBM stack that enumerates at the wrong capacity, a PCIe link that trained at x8 instead of x16, an NVLink that came up degraded, a NIC seated in the wrong slot, a firmware revision that does not match the qualified baseline. None of these prevent the node from booting. All of them silently destroy performance or reliability once the node is scheduled into a tightly-coupled job. The inventory pass enumerates every accelerator, every memory stack, every link, every firmware and driver version, and reconciles it against the qualified design baseline — not the purchase order, the baseline.

This is where the fork between homogeneity-as-acceptance-criterion and tolerate-drift is decided, and it is a real fork with a real cost. A synchronous training job moves at the speed of its slowest node; mixed firmware, mixed clock states, or one node with a degraded NVLink turns the whole scale-up domain into a straggler. The disciplined posture treats any deviation from the golden baseline — firmware, VBIOS, driver, link width, power limit, even GPU clock-offset — as a defect to be remediated before acceptance, not a curiosity to be noted. The cost of the alternative is paid every step of every collective for the life of the cluster. Out-of-band firmware management (Redfish / PLDM-over-MCTP, per the OCP GPU firmware spec) is what makes baseline enforcement tractable at fleet scale; the management fabric that carries it was commissioned in Chapter 8.7.

The DCGM diagnostic ladder

NVIDIA's Data Center GPU Manager (DCGM) provides the de facto diagnostic ladder for the Blackwell/Hopper-class fleet, and its four run-levels are the vocabulary of GPU acceptance. The fork is depth-versus-throughput: you cannot afford to run the deepest level on every node continuously, and you cannot afford to not run it before acceptance. The right program runs each level at the right gate.

  • -r 1 (seconds): deployment / software sanity — does the GPU enumerate, is the driver healthy, are NVML/persistence-mode and the basic plumbing correct. This is the triage check you run constantly, including as a day-2 health gate before scheduling a job onto a node.
  • -r 2 (~2 min): adds quick integration and PCIe/NVLink bandwidth sanity. Still fast enough to gate a node before each job in production.
  • -r 3 (several min): hardware stress — targeted compute, memory bandwidth, and power-stress plugins that actually load the silicon. This is the lightest level that meaningfully exercises the GPU, and the minimum that belongs in an acceptance run.
  • -r 4 (~1.5 hr, GPU-count dependent): the deepest pass — adds the memtest plugin (memtest86-style HBM pattern testing) and, from DCGM 3.1+, the End-User Diagnostic (EUD), a vendor-supplied post-mortem-grade test. This is the acceptance-grade diagnostic: it is what you run at the start and end of the soak window.

The DCGM ladder is diagnostic, not durational — it answers "is this GPU defective right now," not "will this GPU survive a week under load." The latter question is what the soak exists to answer, and the two tools are complementary: long-running synthetic stress (GPU-Burn, dense GEMM loops, NCCL collective floods) holds the silicon at thermal and electrical full-load for hours-to-days while DCGM is sampled at the boundaries and on every fault.

The soak: 72 hours, 168 hours, and the schedule cost of each

The central fork of this chapter is the soak duration, and it is a genuine economic decision, not a best-practice platitude. The candidate windows cluster around three values, and each buys a different slice of the bathtub curve at a different cost in deferred revenue.

Burn-in soak duration — the acceptance fork
Soak windowWhat it catchesWhat it missesSchedule / revenue costTypical use
Smoke test only (minutes–hours)Dead-on-arrival, mis-cabled, won't-enumerate, gross firmware driftNearly all thermal, time-, and load-dependent infant mortalityNegligibleTriage gate; never an acceptance gate on its own
72 hours (3-day minimum)The bulk of infant-mortality: marginal dies, bad HBM stacks, weak solder, cold-plate contact failuresSlow-onset thermal drift; rare marginal cases on the long tail~3 days of deferred utilization per node-batchIndustry minimum for a credible acceptance
168 hours (7-day full soak)Adds the long-tail marginal population and diurnal/thermal-cycling-sensitive faults; spans a full operational weekWear-out failures (those are day-2, not infant mortality)~7 days deferred; the largest revenue drag in the programStrict acceptance; hyperscaler / top-tier neocloud standard
Thermal-cycled soak (power-cycle loops within the window)Solder-joint fatigue, connector seating, quick-disconnect integrity under expansion/contractionSame long-tail wear-out limitsAdds complexity; overlaps the 72–168 hr windowWhere mechanical/seating risk is high (dense liquid racks)
Durations are 2026 practitioner ranges (Together AI seven-phase guide; Introl validation frameworks; ClusterMAX 2.0). Revenue-cost column assumes neocloud-class GPU economics; absolute figures scale with cluster size and contract.

The decision turns on marginal cull per deferred revenue-day, not on "longer is better." The 72-hour soak captures the steep part of the infant-mortality curve and is the floor below which an acceptance claim is not credible. Extending to 168 hours captures the long tail and exposes the fleet to a full operational week of diurnal thermal cycling, which is exactly the regime that shakes out seating and contact defects in dense liquid-cooled racks — but every incremental day is a day the depreciation clock runs on an idle asset (Chapter 1.8's 2–3 year accelerated economic life makes this drag real). The defensible answer is workload-conditioned: a checkpoint-tolerant batch-inference fleet can accept a 72-hour cull and absorb the residual failures cheaply; a node destined for a 50,000-GPU synchronous pre-training cluster, where one infant-mortality failure restarts the whole job, earns the full 168-hour soak many times over. This is the same goodput-versus-availability logic that Chapter 12.2 applies to redundancy, pushed down to the node.

HBM and ECC validation

High-bandwidth memory is the single most failure-prone subsystem on a modern accelerator, and the Llama 3 405B data makes the case quantitatively: across 419 unplanned interruptions in 54 days on 16,384 H100s, faulty GPUs accounted for ~30% and HBM3 specifically for another ~17% — together more than half of all hardware interruptions, the largest single category being the GPU/HBM complex. HBM validation therefore is not a sub-step of GPU testing; it is a first-class acceptance gate. The DCGM memtest plugin (-r 4) walks the HBM with memtest86-style patterns to surface hard stuck bits and addressing faults. But the more important signal during soak is the ECC error-rate trajectory.

The fork here is correctable-error tolerance. ECC silently corrects single-bit errors, so a node accumulating a high but correctable error rate looks healthy to a smoke test while it is, in fact, a die degrading toward an uncorrectable failure. The disciplined acceptance criterion goes past "zero uncorrectable errors" (necessary but trivial) to a ceiling on correctable-error rate and on row-remapping events over the soak window, because a rising correctable rate predicts the uncorrectable failure that will later halt a job. Modern GPUs expose row-remapping (sparing out bad memory rows): a node that exhausts or rapidly consumes its spare rows during burn-in is signalling marginal HBM and belongs in the RMA queue, not the production pool. Accept it and you have scheduled a future uncorrectable double-bit error into a synchronous run, where it presents as an XID, a crashed rank, and a checkpoint restart.

Thermal screening under load: separating a bad GPU from a bad cold-plate

In an air-cooled world, thermal screening was a property of the chip. In the liquid-cooled, 120-to-600 kW-per-rack world of 2026, thermal screening is a property of the chip-plus-cold-plate-plus-loop-plus-rack-position, and the entire diagnostic challenge is attribution: a node throttling under load might have a marginal GPU, a poorly-seated cold plate, an under-flowing quick-disconnect, a partially-blocked manifold, or simply a rack position at the warm end of the CDU's loop. The GB200 NVL72 envelope is unforgiving — coolant inlet 20–25 °C, ~80 L/min, with deviation throttling GPUs up to ~50% — so the thermal acceptance criterion is throttle-free operation at full sustained clocks under realistic load for the entire soak, and a single throttling node forces a structured attribution before you can RMA the GPU.

This is the canonical place where node burn-in and cooling commissioning interlock. The cooling loop and CDU were proven in Chapter 5.11 and the integrated thermal ride-through in Chapter 13.6 — but those tests typically used resistive load banks that heat uniformly. A GPU under a dense GEMM workload produces a spatially and temporally spiky heat flux that a resistive bank cannot reproduce, which is exactly why a node can pass cooling IST and still throttle under real compute. Burn-in is the first time the cooling system meets a true thermal load. The decision that falls out: when a node throttles, you swap the GPU and the cold-plate as a unit and re-soak before concluding the silicon was bad, because the cheaper and more common defect is mechanical contact, not a dead die. Mislabel a cold-plate problem as a GPU defect and you ship a good GPU back to the vendor and re-seat the same bad plate under its replacement.

SDC hunting: the failures that do not announce themselves

Hard failures are merciful — they crash the rank, throw an XID, and force a restart you can see. Silent data corruption is the adversary: a GPU computes the wrong answer and returns it without any error signal, poisoning gradients or inference outputs while every dashboard stays green. At hyperscale this has moved from anomaly to expectation. Meta's fleet analysis finds on the order of one machine per thousand affected by SDC, and reports that for a large-scale training run an SDC event is expected every one-to-two weeks; Google has stated an SDC event roughly every week-or-two during Gemini training. With rising silicon density, SDC now occurs at roughly one fault per thousand devices — far above the cosmic-ray soft-error floor that older reliability models assumed.

The fork is how much commissioning compute you spend hunting silent corruption versus deferring it to the day-2 fleet-scanner regime, and there is no free answer because no burn-in catches every marginal die. Commissioning-time SDC hunting runs deterministic, self-checking workloads — known-answer GEMMs, redundant computation with bit-exact comparison, NCCL collectives whose results are verified against a golden reference — across the fleet, flagging any node whose arithmetic diverges. This is expensive (it is compute spent producing no model) and incomplete (SDC is often data-pattern- and temperature-dependent, so a node clean at commissioning can corrupt later). The disciplined posture treats commissioning SDC hunting as a coarse cull that removes the grossly-defective dies, while explicitly handing the residual, marginal, condition-dependent population to a continuous day-2 detection program — Meta's Fleetscanner / Ripple / Hardware Sentinel lineage of out-of-band and in-band scanning. The hand-off is the decision: what you do not catch at acceptance, you must commit to catching in production, or you have simply chosen to ship silent corruption. The fault taxonomy that frames hard / transient / silent failures is developed in Chapter 14.3.

Deep dive: why SDC is the hardest acceptance problem, and what a credible program actually does

SDC resists burn-in for three structural reasons. First, it is silent by definition — there is no error code to trigger on, so detection requires you to already know the correct answer and compare against it, which means you can only hunt SDC inside workloads whose output you can verify. Second, it is condition-dependent: recent large-scale gate-level fault-injection studies on production-class data-center GPUs (over three million simulator-hours across dozens of CUDA micro-benchmarks) show SDC outcomes are dominated by subtle wrong-result corruptions — NaN/±INF account for only ~1% of SDC outcomes and single-bit flips for under 40% of bit-flip events — meaning the corruption usually looks like a plausible wrong number, not an obvious garbage value, and surfaces only under specific data patterns and thermal states. Third, it is rare per device but common per fleet: at one-in-a-thousand, a single node is unlikely to corrupt, but a 50,000-GPU cluster will see corruption continuously.

A credible commissioning program therefore does not pretend to eliminate SDC. It does three things: (1) runs known-answer, bit-exact-verified workloads during the soak to cull the grossly-defective dies that corrupt under common conditions; (2) baselines per-node arithmetic behavior so the day-2 program has a reference to detect drift against; and (3) contractually and operationally commits to a continuous detection regime — periodic in-band test workloads, out-of-band scanners, and gradient/loss anomaly monitoring during real training — because the marginal population that burn-in cannot surface is precisely the population that day-2 scanning exists to catch. The OCP SDC-in-AI whitepaper and Meta's reliability engineering are the reference architecture here. The acceptance question is not "is this fleet SDC-free" (unanswerable) but "have we culled the gross defects and stood up the regime that catches the rest."

Straggler detection: the failure that does not fail

A straggler is the most insidious node in a tightly-coupled cluster: it works, it computes correct answers, it throws no errors — it is simply slow. In a synchronous collective, the slowest participant sets the pace for every other GPU, so one node running 10% slow taxes the entire scale-up domain by 10% on every step, indefinitely, with no alarm. Stragglers come from the drift the inventory pass was supposed to catch (a lower power limit, a degraded NVLink running at reduced lane count, a node thermally throttling at the warm end of a loop) and from defects that only manifest under collective load. Commissioning is the cheapest possible time to find them, because in production a straggler hides inside aggregate throughput metrics and can persist for weeks.

The technique is an offline node-sweep qualification: run identical, fixed workloads (single-GPU compute, then pairwise and ring NCCL collectives) across every node and every link, and flag any node or link whose completion time deviates from the fleet median beyond a tight threshold. The fork is the threshold itself — set it too loose and you ship slow nodes that quietly erode goodput; too tight and you RMA nodes for noise and stall the ramp. The published research lineage (Guard and related straggler-detection work) frames this as the same problem at two timescales: an offline sweep at commissioning to qualify the fleet, and online straggler detection as a day-2 SLO-burn signal. The acceptance criterion is a per-node performance band — typically expressed as a maximum allowed deviation from the fleet median on a reference collective — and a node outside the band is remediated (re-seat, re-flash, re-cable) and re-tested, or rejected. The NCCL bandwidth acceptance gates that this feeds into are developed at cluster scale in Chapter 13.9; the fabric link-health baseline it builds on came from Chapter 13.7.

Power-behavior validation: the load-step the grid will actually see

The last acceptance dimension is the one that reaches back into the power chain, and it is uniquely an AI-cluster problem. GPU clusters do not draw smooth power — a synchronous training job produces violent synchronized load swings as thousands of GPUs simultaneously transition between compute (full draw) and communication (collective idle) phases, every few hundred milliseconds. At node scale, burn-in validates that an individual node's power draw, transient behavior, and power-capping response match the qualified envelope: that it holds its rated power limit under sustained load, that it responds correctly to power-capping commands, and that its draw does not exceed the budget the rack PDU and busbar were sized for. A node that over-draws or fails to honor a power cap is a node that can trip a breaker or push a rack past its provisioned budget — a real risk in a power-bound facility where the rack is sized close to its limit.

The decision that connects node burn-in to facility commissioning: node-level power validation establishes the per-node transient signature, but the aggregate synchronized-swing behavior only appears at cluster scale, when many nodes phase-lock. This is the dynamic-load-realism gap that Chapter 13.6 treats as canonical — resistive load banks during IST cannot reproduce the reactive, fast-transient, phase-correlated swing of a real collective, so node burn-in (real GPUs, real workloads, real transients) is the first hardware that produces the genuine electrical signature. It is also why the staged load ramp of Chapter 13.10 and the proxy training run of Chapter 13.9 are the only true validations of how the BBU/BESS/GPU-capacitance mitigation stack behaves under the load dynamics the cluster will actually impose. Accept a node whose power behavior is out of envelope and you have seeded a future breaker trip or a brownout-induced throttle into the production fleet.

Node acceptance criteria → accept / remediate / RMA decision boundary
DimensionSignal measuredPassRemediate & re-soakRMA / reject
Inventory / baselineFirmware, VBIOS, driver, link width, power limit vs golden baselineExact match to baselineRe-flash / re-seat to baseline, re-verifyHardware mismatch that cannot be brought to baseline
DiagnosticsDCGM -r 4 (compute, memtest, EUD)Clean pass at start and end of soakSingle transient fault: re-run, investigateReproducible diagnostic failure
HBM / ECCUncorrectable errors; correctable-error rate; row-remap eventsZero uncorrectable; correctable rate under ceiling; stable spare rowsIsolated correctable spike: extend soak, watch trajectoryAny uncorrectable; rising correctable trend; spare-row exhaustion
ThermalSustained clocks under realistic load over full soakThrottle-free at rated clocks for the windowRe-seat GPU+cold-plate as a unit, re-soakThrottling persists after mechanical remediation
SDCKnown-answer / bit-exact verified workloadsNo arithmetic divergence in soakSingle anomaly: re-test under varied patterns/tempsReproducible silent miscompute
Straggler / powerNode-sweep deviation from fleet median; power-cap responseWithin performance band; honors power envelopeRe-cable / re-flash / re-seat NVLink, re-testPersistent out-of-band perf or power behavior
Representative acceptance thresholds synthesized from DCGM diagnostics docs, Together AI's seven-phase guide, ClusterMAX 2.0, and operator practice. Exact numeric thresholds are operator- and generation-specific; treat as the structure of the decision, not fixed limits.
72–168 hr
GPU node burn-in / soak window (3-day minimum to 7-day strict acceptance)
2025Together AI seven-phase guide; Introl validation frameworks; ClusterMAX 2.0
3–4 weeks
bring-up burn-in period before a new cluster's failure rate decays toward the mature baseline
2025SemiAnalysis (100k H100 clusters)
~7 days / 512 GPUs
mature best-in-class H100 MTBF; freshly-racked clusters fail far more often
2025SemiAnalysis (100k H100 clusters)
419 in 54 days
unplanned interruptions on 16,384 H100s during Llama 3 405B — ~1 every 3 hr
2024Meta (Llama 3 paper) / Tom's Hardware
~30% + ~17%
Llama 3 interruptions attributed to faulty GPU and to HBM3 — together >½ of hardware faults
2024Meta (Llama 3 paper) / DataCenterDynamics
~1 in 1,000
machines affected by silent data corruption (SDC) at fleet scale
2025Meta Engineering (How Meta keeps its AI hardware reliable)
every 1–2 weeks
expected SDC events during a large-scale training run (Meta; Google reports similar for Gemini)
2025–2026Meta Engineering; IEEE / arXiv SDC studies
~1.5 hr
DCGM -r 4 (deep, incl. memtest + EUD) runtime per node, GPU-count dependent
2026NVIDIA DCGM Diagnostics documentation

The acceptance package: what burn-in must produce

Burn-in is done when it produces a signed acceptance artifact that the next phase and the eventual operator inherit, not when the soak timer expires. The package is the bridge from "the nodes survived a week" to "the fleet is qualified to schedule," and it is what distinguishes acceptance from a soak that was merely run. It records: the inventory reconciliation against baseline (every node at qualified firmware/driver/link state); the DCGM results at soak entry and exit; the HBM/ECC trajectory with the correctable-error ceiling and any remapping events; the thermal record proving throttle-free operation under load; the SDC-hunt results and the explicit hand-off to the day-2 detection regime; the straggler node-sweep with the per-node performance band; the power-behavior envelope; and the disposition of every node that failed — what it failed, what was remediated, what was RMA'd, and the re-soak result for anything repaired.

This artifact is the per-node analogue of the commissioning documentation discipline established in Chapter 13.2, and it feeds directly into two downstream gates: the cluster-scale benchmarking and reference training run of Chapter 13.9, which assumes a fleet of qualified nodes and will expose any straggler or SDC the node-level cull missed; and the day-2 reliability program seeded at handover in Chapter 13.10, which inherits the burn-in baselines as the reference against which fleet drift is measured. A node that was never burned-in does not have a baseline — which means the day-2 program has nothing to detect degradation against, and the first time anyone learns the node is marginal is when it corrupts a training run.

This chapter qualifies the individual node; the fleet it produces is taken to scale in Chapter 13.9 (NCCL collective gates, reference/proxy training, goodput accounting). It inherits the management/firmware fabric from Chapter 8.7, the cooling-loop proof from Chapter 5.11, the fabric link-health baseline from Chapter 13.7, and the integrated-system and dynamic-load context from Chapter 13.6. Its documentation discipline follows Chapter 13.2; its output feeds the staged ramp and day-2 hand-off of Chapter 13.10. The goodput-versus-availability logic behind the soak-duration fork is developed in Chapter 12.2; the hard/transient/silent failure taxonomy and fleet reliability data live in Chapter 14.3; the operational KPIs that the burn-in baselines seed are in Chapter 14.1.