Chapter 13.8

GPU Node Burn-In, Diagnostics & Stress Validation

A GPU node that boots and passes a smoke test is not a commissioned node — burn-in is the deliberate, time-bounded campaign that converts a hall full of accelerators into a fleet whose failures have already happened on your clock instead of mid-training-run on the customer's.

GOODPUTDENSITY-RAMPPOWER-BOUND

What you'll decide here

How long you soak — the 72-hour minimum that catches infant mortality versus the 168-hour (7-day) campaign that the strictest acceptance standards now demand — and what that schedule costs you in deferred revenue against a depreciation clock that is already running.
Which diagnostic depth you run at each gate: light DCGM health checks (seconds) for triage versus the full long-stress / memtest / EUD passes (1.5+ hours per node) that actually exercise the silicon, and how you parallelize them across thousands of nodes without serializing the ramp.
Where you draw the SDC-hunting line — what fraction of compute you spend on silent-corruption detection at commissioning versus pushing it into the day-2 fleet-scanner regime, given that no burn-in catches every marginal die.
Your throttle-free thermal acceptance criterion: whether a node must hold full clocks for the entire soak under a realistic thermal load, or whether you accept documented throttle margin — and how you separate a bad GPU from a bad cold-plate, a bad CDU, or a bad rack position.
The accept / RMA / quarantine decision boundary itself: the numeric thresholds (ECC error counts, XID codes, NVLink replay rates, straggler delta) that move a node from the production pool to the vendor return queue before it is ever scheduled.

Every prior chapter in Part 13 commissioned the building — the power chain energized, the cooling loop proven, the fabric link-tested, the integrated system demonstrated to ride through faults. This chapter commissions the thing the building exists to run: the GPU nodes themselves. It is the point in the program where the facility hands off to the machine, and it is the gate most often skipped under schedule pressure — because the nodes boot, the smoke test passes, and the temptation to declare victory and start scheduling is enormous. That temptation is the most expensive mistake in the bring-up. A GPU fleet has a bathtub-curve failure distribution: a population of marginal dies, cold solder joints, mis-seated HBM stacks, and borderline cold-plate contacts that will fail early and predictably. Burn-in is the deliberate forcing of that infant-mortality population to fail on your acceptance clock, in a controlled window, before any of it touches a synchronous training job whose entire run restarts from checkpoint when a single GPU dies.

The consequences here are unusually legible because they are denominated in goodput. Skip the soak and the infant-mortality failures simply move downstream into the first real workload, where a single bad GPU at hour 40 of a synchronous run means a checkpoint restart across the whole job, a straggler that drags the slowest-link collective, or worse, a silent corruption that poisons gradients for days before anyone notices. This chapter covers node bring-up and inventory, the DCGM diagnostic ladder, the 72-to-168-hour soak, HBM/ECC validation, thermal screening under load, silent-data-corruption (SDC) hunting, straggler detection, and power-behavior validation, with the downstream cost attached to each acceptance decision. It qualifies the individual node; Chapter 13.9 takes the qualified fleet to cluster scale.

Bring-up and inventory: trust nothing the BOM claims

The first act of burn-in is not a stress test — it is an inventory reconciliation against ground truth. At the scale of a modern AI hall, a non-trivial fraction of nodes arrive with something quietly wrong: a GPU running an older VBIOS than its siblings, an HBM stack that enumerates at the wrong capacity, a PCIe link that trained at x8 instead of x16, an NVLink that came up degraded, a NIC seated in the wrong slot, a firmware revision that does not match the qualified baseline. None of these prevent the node from booting. All of them silently destroy performance or reliability once the node is scheduled into a tightly-coupled job. The inventory pass enumerates every accelerator, every memory stack, every link, every firmware and driver version, and reconciles it against the qualified design baseline — not the purchase order, the baseline.

This is where the fork between homogeneity-as-acceptance-criterion and tolerate-drift is decided, and it is a real fork with a real cost. A synchronous training job moves at the speed of its slowest node; mixed firmware, mixed clock states, or one node with a degraded NVLink turns the whole scale-up domain into a straggler. The disciplined posture treats any deviation from the golden baseline — firmware, VBIOS, driver, link width, power limit, even GPU clock-offset — as a defect to be remediated before acceptance, not a curiosity to be noted. The cost of the alternative is paid every step of every collective for the life of the cluster. Out-of-band firmware management (Redfish / PLDM-over-MCTP, per the OCP GPU firmware spec) is what makes baseline enforcement tractable at fleet scale; the management fabric that carries it was commissioned in Chapter 8.7.

The DCGM diagnostic ladder

NVIDIA's Data Center GPU Manager (DCGM) provides the de facto diagnostic ladder for the Blackwell/Hopper-class fleet, and its four run-levels are the vocabulary of GPU acceptance. The fork is depth-versus-throughput: you cannot afford to run the deepest level on every node continuously, and you cannot afford to not run it before acceptance. The right program runs each level at the right gate.

-r 1 (seconds): deployment / software sanity — does the GPU enumerate, is the driver healthy, are NVML/persistence-mode and the basic plumbing correct. This is the triage check you run constantly, including as a day-2 health gate before scheduling a job onto a node.
-r 2 (~2 min): adds quick integration and PCIe/NVLink bandwidth sanity. Still fast enough to gate a node before each job in production.
-r 3 (several min): hardware stress — targeted compute, memory bandwidth, and power-stress plugins that actually load the silicon. This is the lightest level that meaningfully exercises the GPU, and the minimum that belongs in an acceptance run.
-r 4 (~1.5 hr, GPU-count dependent): the deepest pass — adds the memtest plugin (memtest86-style HBM pattern testing) and, from DCGM 3.1+, the End-User Diagnostic (EUD), a vendor-supplied post-mortem-grade test. This is the acceptance-grade diagnostic: it is what you run at the start and end of the soak window.

The DCGM ladder is diagnostic, not durational — it answers "is this GPU defective right now," not "will this GPU survive a week under load." The latter question is what the soak exists to answer, and the two tools are complementary: long-running synthetic stress (GPU-Burn, dense GEMM loops, NCCL collective floods) holds the silicon at thermal and electrical full-load for hours-to-days while DCGM is sampled at the boundaries and on every fault.

Why a diagnostic pass is necessary but not sufficient

A node can pass DCGM -r 4 cleanly and still be a defect you must reject. The deep diagnostic runs for ninety minutes; a marginal HBM stack, a cold-plate with poor contact, or a die that browns out only at sustained full power and elevated junction temperature will pass a ninety-minute test and fail at hour 30 of a real run. This is the entire reason diagnostics and soak are different acceptance gates. The diagnostic catches the already-broken; the soak catches the about-to-break. Treating a clean DCGM pass as acceptance is the single most common way burn-in theater substitutes for burn-in — you get the checkbox without the infant-mortality cull, and the failures you paid the schedule to surface arrive later, in production, denominated in lost goodput instead of deferred revenue.

The soak: 72 hours, 168 hours, and the schedule cost of each

The central fork of this chapter is the soak duration, and it is a genuine economic decision, not a best-practice platitude. The candidate windows cluster around three values, and each buys a different slice of the bathtub curve at a different cost in deferred revenue.

Burn-in soak duration — the acceptance fork

Soak window	What it catches	What it misses	Schedule / revenue cost	Typical use
Smoke test only (minutes–hours)	Dead-on-arrival, mis-cabled, won't-enumerate, gross firmware drift	Nearly all thermal, time-, and load-dependent infant mortality	Negligible	Triage gate; never an acceptance gate on its own
72 hours (3-day minimum)	The bulk of infant-mortality: marginal dies, bad HBM stacks, weak solder, cold-plate contact failures	Slow-onset thermal drift; rare marginal cases on the long tail	~3 days of deferred utilization per node-batch	Industry minimum for a credible acceptance
168 hours (7-day full soak)	Adds the long-tail marginal population and diurnal/thermal-cycling-sensitive faults; spans a full operational week	Wear-out failures (those are day-2, not infant mortality)	~7 days deferred; the largest revenue drag in the program	Strict acceptance; hyperscaler / top-tier neocloud standard
Thermal-cycled soak (power-cycle loops within the window)	Solder-joint fatigue, connector seating, quick-disconnect integrity under expansion/contraction	Same long-tail wear-out limits	Adds complexity; overlaps the 72–168 hr window	Where mechanical/seating risk is high (dense liquid racks)

Durations are 2026 practitioner ranges (Together AI seven-phase guide; Introl validation frameworks; ClusterMAX 2.0). Revenue-cost column assumes neocloud-class GPU economics; absolute figures scale with cluster size and contract.

The decision turns on marginal cull per deferred revenue-day, not on "longer is better." The 72-hour soak captures the steep part of the infant-mortality curve and is the floor below which an acceptance claim is not credible. Extending to 168 hours captures the long tail and exposes the fleet to a full operational week of diurnal thermal cycling, which is exactly the regime that shakes out seating and contact defects in dense liquid-cooled racks — but every incremental day is a day the depreciation clock runs on an idle asset (Chapter 1.8's 2–3 year accelerated economic life makes this drag real). The defensible answer is workload-conditioned: a checkpoint-tolerant batch-inference fleet can accept a 72-hour cull and absorb the residual failures cheaply; a node destined for a 50,000-GPU synchronous pre-training cluster, where one infant-mortality failure restarts the whole job, earns the full 168-hour soak many times over. This is the same goodput-versus-availability logic that Chapter 12.2 applies to redundancy, pushed down to the node.

New clusters fail far more than mature ones — and that is the point

A mature, best-in-class H100 cluster runs at roughly one failure per 512 GPUs every seven days. A freshly racked cluster fails far more often — the bring-up burn-in period commonly runs three to four weeks before the failure rate decays toward the mature baseline (SemiAnalysis, 100k-H100 analysis, 2025). The naive read is "the new cluster is broken." The correct read is "the burn-in is working": you are watching the infant-mortality population fail on schedule, in the window you built for it, instead of in production. A burn-in that surfaces no failures on a new fleet of thousands of nodes points to a soak that was too short, too gentle, or never actually loaded the silicon. The absence of failures during burn-in is a smell, not a success.

HBM and ECC validation

High-bandwidth memory is the single most failure-prone subsystem on a modern accelerator, and the Llama 3 405B data makes the case quantitatively: across 419 unplanned interruptions in 54 days on 16,384 H100s, faulty GPUs accounted for ~30% and HBM3 specifically for another ~17% — together more than half of all hardware interruptions, the largest single category being the GPU/HBM complex. HBM validation therefore is not a sub-step of GPU testing; it is a first-class acceptance gate. The DCGM memtest plugin (-r 4) walks the HBM with memtest86-style patterns to surface hard stuck bits and addressing faults. But the more important signal during soak is the ECC error-rate trajectory.

The fork here is correctable-error tolerance. ECC silently corrects single-bit errors, so a node accumulating a high but correctable error rate looks healthy to a smoke test while it is, in fact, a die degrading toward an uncorrectable failure. The disciplined acceptance criterion goes past "zero uncorrectable errors" (necessary but trivial) to a ceiling on correctable-error rate and on row-remapping events over the soak window, because a rising correctable rate predicts the uncorrectable failure that will later halt a job. Modern GPUs expose row-remapping (sparing out bad memory rows): a node that exhausts or rapidly consumes its spare rows during burn-in is signalling marginal HBM and belongs in the RMA queue, not the production pool. Accept it and you have scheduled a future uncorrectable double-bit error into a synchronous run, where it presents as an XID, a crashed rank, and a checkpoint restart.

Thermal screening under load: separating a bad GPU from a bad cold-plate

In an air-cooled world, thermal screening was a property of the chip. In the liquid-cooled, 120-to-600 kW-per-rack world of 2026, thermal screening is a property of the chip-plus-cold-plate-plus-loop-plus-rack-position, and the entire diagnostic challenge is attribution: a node throttling under load might have a marginal GPU, a poorly-seated cold plate, an under-flowing quick-disconnect, a partially-blocked manifold, or simply a rack position at the warm end of the CDU's loop. The GB200 NVL72 envelope is unforgiving — coolant inlet 20–25 °C, ~80 L/min, with deviation throttling GPUs up to ~50% — so the thermal acceptance criterion is throttle-free operation at full sustained clocks under realistic load for the entire soak, and a single throttling node forces a structured attribution before you can RMA the GPU.

This is the canonical place where node burn-in and cooling commissioning interlock. The cooling loop and CDU were proven in Chapter 5.11 and the integrated thermal ride-through in Chapter 13.6 — but those tests typically used resistive load banks that heat uniformly. A GPU under a dense GEMM workload produces a spatially and temporally spiky heat flux that a resistive bank cannot reproduce, which is exactly why a node can pass cooling IST and still throttle under real compute. Burn-in is the first time the cooling system meets a true thermal load. The decision that falls out: when a node throttles, you swap the GPU and the cold-plate as a unit and re-soak before concluding the silicon was bad, because the cheaper and more common defect is mechanical contact, not a dead die. Mislabel a cold-plate problem as a GPU defect and you ship a good GPU back to the vendor and re-seat the same bad plate under its replacement.

SDC hunting: the failures that do not announce themselves

Hard failures are merciful — they crash the rank, throw an XID, and force a restart you can see. Silent data corruption is the adversary: a GPU computes the wrong answer and returns it without any error signal, poisoning gradients or inference outputs while every dashboard stays green. At hyperscale this has moved from anomaly to expectation. Meta's fleet analysis finds on the order of one machine per thousand affected by SDC, and reports that for a large-scale training run an SDC event is expected every one-to-two weeks; Google has stated an SDC event roughly every week-or-two during Gemini training. With rising silicon density, SDC now occurs at roughly one fault per thousand devices — far above the cosmic-ray soft-error floor that older reliability models assumed.

The fork is how much commissioning compute you spend hunting silent corruption versus deferring it to the day-2 fleet-scanner regime, and there is no free answer because no burn-in catches every marginal die. Commissioning-time SDC hunting runs deterministic, self-checking workloads — known-answer GEMMs, redundant computation with bit-exact comparison, NCCL collectives whose results are verified against a golden reference — across the fleet, flagging any node whose arithmetic diverges. This is expensive (it is compute spent producing no model) and incomplete (SDC is often data-pattern- and temperature-dependent, so a node clean at commissioning can corrupt later). The disciplined posture treats commissioning SDC hunting as a coarse cull that removes the grossly-defective dies, while explicitly handing the residual, marginal, condition-dependent population to a continuous day-2 detection program — Meta's Fleetscanner / Ripple / Hardware Sentinel lineage of out-of-band and in-band scanning. The hand-off is the decision: what you do not catch at acceptance, you must commit to catching in production, or you have simply chosen to ship silent corruption. The fault taxonomy that frames hard / transient / silent failures is developed in Chapter 14.3.

Deep dive: why SDC is the hardest acceptance problem, and what a credible program actually does

SDC resists burn-in for three structural reasons. First, it is silent by definition — there is no error code to trigger on, so detection requires you to already know the correct answer and compare against it, which means you can only hunt SDC inside workloads whose output you can verify. Second, it is condition-dependent: recent large-scale gate-level fault-injection studies on production-class data-center GPUs (over three million simulator-hours across dozens of CUDA micro-benchmarks) show SDC outcomes are dominated by subtle wrong-result corruptions — NaN/±INF account for only ~1% of SDC outcomes and single-bit flips for under 40% of bit-flip events — meaning the corruption usually looks like a plausible wrong number, not an obvious garbage value, and surfaces only under specific data patterns and thermal states. Third, it is rare per device but common per fleet: at one-in-a-thousand, a single node is unlikely to corrupt, but a 50,000-GPU cluster will see corruption continuously.

A credible commissioning program therefore does not pretend to eliminate SDC. It does three things: (1) runs known-answer, bit-exact-verified workloads during the soak to cull the grossly-defective dies that corrupt under common conditions; (2) baselines per-node arithmetic behavior so the day-2 program has a reference to detect drift against; and (3) contractually and operationally commits to a continuous detection regime — periodic in-band test workloads, out-of-band scanners, and gradient/loss anomaly monitoring during real training — because the marginal population that burn-in cannot surface is precisely the population that day-2 scanning exists to catch. The OCP SDC-in-AI whitepaper and Meta's reliability engineering are the reference architecture here. The acceptance question is not "is this fleet SDC-free" (unanswerable) but "have we culled the gross defects and stood up the regime that catches the rest."

Straggler detection: the failure that does not fail

A straggler is the most insidious node in a tightly-coupled cluster: it works, it computes correct answers, it throws no errors — it is simply slow. In a synchronous collective, the slowest participant sets the pace for every other GPU, so one node running 10% slow taxes the entire scale-up domain by 10% on every step, indefinitely, with no alarm. Stragglers come from the drift the inventory pass was supposed to catch (a lower power limit, a degraded NVLink running at reduced lane count, a node thermally throttling at the warm end of a loop) and from defects that only manifest under collective load. Commissioning is the cheapest possible time to find them, because in production a straggler hides inside aggregate throughput metrics and can persist for weeks.

The technique is an offline node-sweep qualification: run identical, fixed workloads (single-GPU compute, then pairwise and ring NCCL collectives) across every node and every link, and flag any node or link whose completion time deviates from the fleet median beyond a tight threshold. The fork is the threshold itself — set it too loose and you ship slow nodes that quietly erode goodput; too tight and you RMA nodes for noise and stall the ramp. The published research lineage (Guard and related straggler-detection work) frames this as the same problem at two timescales: an offline sweep at commissioning to qualify the fleet, and online straggler detection as a day-2 SLO-burn signal. The acceptance criterion is a per-node performance band — typically expressed as a maximum allowed deviation from the fleet median on a reference collective — and a node outside the band is remediated (re-seat, re-flash, re-cable) and re-tested, or rejected. The NCCL bandwidth acceptance gates that this feeds into are developed at cluster scale in Chapter 13.9; the fabric link-health baseline it builds on came from Chapter 13.7.

Power-behavior validation: the load-step the grid will actually see

The last acceptance dimension is the one that reaches back into the power chain, and it is uniquely an AI-cluster problem. GPU clusters do not draw smooth power — a synchronous training job produces violent synchronized load swings as thousands of GPUs simultaneously transition between compute (full draw) and communication (collective idle) phases, every few hundred milliseconds. At node scale, burn-in validates that an individual node's power draw, transient behavior, and power-capping response match the qualified envelope: that it holds its rated power limit under sustained load, that it responds correctly to power-capping commands, and that its draw does not exceed the budget the rack PDU and busbar were sized for. A node that over-draws or fails to honor a power cap is a node that can trip a breaker or push a rack past its provisioned budget — a real risk in a power-bound facility where the rack is sized close to its limit.

The decision that connects node burn-in to facility commissioning: node-level power validation establishes the per-node transient signature, but the aggregate synchronized-swing behavior only appears at cluster scale, when many nodes phase-lock. This is the dynamic-load-realism gap that Chapter 13.6 treats as canonical — resistive load banks during IST cannot reproduce the reactive, fast-transient, phase-correlated swing of a real collective, so node burn-in (real GPUs, real workloads, real transients) is the first hardware that produces the genuine electrical signature. It is also why the staged load ramp of Chapter 13.10 and the proxy training run of Chapter 13.9 are the only true validations of how the BBU/BESS/GPU-capacitance mitigation stack behaves under the load dynamics the cluster will actually impose. Accept a node whose power behavior is out of envelope and you have seeded a future breaker trip or a brownout-induced throttle into the production fleet.

Node acceptance criteria → accept / remediate / RMA decision boundary

Dimension	Signal measured	Pass	Remediate & re-soak	RMA / reject
Inventory / baseline	Firmware, VBIOS, driver, link width, power limit vs golden baseline	Exact match to baseline	Re-flash / re-seat to baseline, re-verify	Hardware mismatch that cannot be brought to baseline
Diagnostics	DCGM -r 4 (compute, memtest, EUD)	Clean pass at start and end of soak	Single transient fault: re-run, investigate	Reproducible diagnostic failure
HBM / ECC	Uncorrectable errors; correctable-error rate; row-remap events	Zero uncorrectable; correctable rate under ceiling; stable spare rows	Isolated correctable spike: extend soak, watch trajectory	Any uncorrectable; rising correctable trend; spare-row exhaustion
Thermal	Sustained clocks under realistic load over full soak	Throttle-free at rated clocks for the window	Re-seat GPU+cold-plate as a unit, re-soak	Throttling persists after mechanical remediation
SDC	Known-answer / bit-exact verified workloads	No arithmetic divergence in soak	Single anomaly: re-test under varied patterns/temps	Reproducible silent miscompute
Straggler / power	Node-sweep deviation from fleet median; power-cap response	Within performance band; honors power envelope	Re-cable / re-flash / re-seat NVLink, re-test	Persistent out-of-band perf or power behavior

Representative acceptance thresholds synthesized from DCGM diagnostics docs, Together AI's seven-phase guide, ClusterMAX 2.0, and operator practice. Exact numeric thresholds are operator- and generation-specific; treat as the structure of the decision, not fixed limits.

72–168 hr

GPU node burn-in / soak window (3-day minimum to 7-day strict acceptance)

2025Together AI seven-phase guide; Introl validation frameworks; ClusterMAX 2.0

3–4 weeks

bring-up burn-in period before a new cluster's failure rate decays toward the mature baseline

2025SemiAnalysis (100k H100 clusters)

~7 days / 512 GPUs

mature best-in-class H100 MTBF; freshly-racked clusters fail far more often

2025SemiAnalysis (100k H100 clusters)

419 in 54 days

unplanned interruptions on 16,384 H100s during Llama 3 405B — ~1 every 3 hr

2024Meta (Llama 3 paper) / Tom's Hardware

~30% + ~17%

Llama 3 interruptions attributed to faulty GPU and to HBM3 — together >½ of hardware faults

2024Meta (Llama 3 paper) / DataCenterDynamics

~1 in 1,000

machines affected by silent data corruption (SDC) at fleet scale

2025Meta Engineering (How Meta keeps its AI hardware reliable)

every 1–2 weeks

expected SDC events during a large-scale training run (Meta; Google reports similar for Gemini)

2025–2026Meta Engineering; IEEE / arXiv SDC studies

~1.5 hr

DCGM -r 4 (deep, incl. memtest + EUD) runtime per node, GPU-count dependent

2026NVIDIA DCGM Diagnostics documentation

The acceptance package: what burn-in must produce

Burn-in is done when it produces a signed acceptance artifact that the next phase and the eventual operator inherit, not when the soak timer expires. The package is the bridge from "the nodes survived a week" to "the fleet is qualified to schedule," and it is what distinguishes acceptance from a soak that was merely run. It records: the inventory reconciliation against baseline (every node at qualified firmware/driver/link state); the DCGM results at soak entry and exit; the HBM/ECC trajectory with the correctable-error ceiling and any remapping events; the thermal record proving throttle-free operation under load; the SDC-hunt results and the explicit hand-off to the day-2 detection regime; the straggler node-sweep with the per-node performance band; the power-behavior envelope; and the disposition of every node that failed — what it failed, what was remediated, what was RMA'd, and the re-soak result for anything repaired.

This artifact is the per-node analogue of the commissioning documentation discipline established in Chapter 13.2, and it feeds directly into two downstream gates: the cluster-scale benchmarking and reference training run of Chapter 13.9, which assumes a fleet of qualified nodes and will expose any straggler or SDC the node-level cull missed; and the day-2 reliability program seeded at handover in Chapter 13.10, which inherits the burn-in baselines as the reference against which fleet drift is measured. A node that was never burned-in does not have a baseline — which means the day-2 program has nothing to detect degradation against, and the first time anyone learns the node is marginal is when it corrupts a training run.

Set the accept / RMA / re-soak thresholds before the soak starts

Decide your accept / RMA / re-soak boundary as numeric thresholds, before the soak starts — the correctable-ECC ceiling, the throttle tolerance, the straggler performance band, the SDC verification regime — and commit to enforcing them even when the schedule is screaming. The pressure to accept a marginal node is maximal at exactly the moment burn-in finds the marginal node, because the alternative is a slipped go-live. Every marginal node you wave through to protect the schedule is a future checkpoint restart, a silent gradient corruption, or a permanent straggler that taxes goodput for the asset's entire 2–3 year economic life. Burn-in only works if the thresholds were set when no one was under pressure and are honored when everyone is. The cull you skip is not avoided; it is rescheduled into production at a far higher price.

This chapter qualifies the individual node; the fleet it produces is taken to scale in Chapter 13.9 (NCCL collective gates, reference/proxy training, goodput accounting). It inherits the management/firmware fabric from Chapter 8.7, the cooling-loop proof from Chapter 5.11, the fabric link-health baseline from Chapter 13.7, and the integrated-system and dynamic-load context from Chapter 13.6. Its documentation discipline follows Chapter 13.2; its output feeds the staged ramp and day-2 hand-off of Chapter 13.10. The goodput-versus-availability logic behind the soak-duration fork is developed in Chapter 12.2; the hard/transient/silent failure taxonomy and fleet reliability data live in Chapter 14.3; the operational KPIs that the burn-in baselines seed are in Chapter 14.1.