Guide › Day-2 Operations, Upgrades & Lifecycle › 14.4

Chapter 14.4

Reliability Engineering for Training (Operational)

At day-2 scale a frontier training job fails not as an exception but as a baseline rate — one interruption every few hours — so the operational job is not preventing failures, it is detecting, isolating, and recovering from them faster than they accumulate, because every minute of detect-to-recover comes straight out of goodput.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Your checkpoint tier mix and operational cadence — how much you spend on RAM/peer/persistent tiers to drop mean-time-to-recovery, given the Young/Daly interval is already fixed (the math lives in Chapter 9.4; here you tune the knobs that move ETTR).
Where the fault-detection boundary sits — what you catch with synchronous health checks at job launch versus passive monitoring mid-run versus a periodic offline node-sweep — and how aggressively you eject suspected lemon nodes before they have proven themselves bad.
Whether you run rigid (fail-stop-and-restart) or elastic/redundant training — buying continuity through spare capacity and reconfiguration against the goodput tax of running with hot spares idle.
Who owns the detect-to-recover loop and how automated it is — the facility-ops/ML-platform boundary, and how far you push autonomous remediation before a human is in the loop.
Which MTTR component you attack next — detection latency, isolation/scheduling delay, reconfiguration, or reload-and-replay — because they have wildly different costs to shave and the binding one moves with cluster size.

A frontier training job is a single synchronous computation spread across tens of thousands of accelerators, and it advances at the speed of its slowest healthy participant. When any one of those participants dies, the whole job stops. That is the structural fact that makes training reliability unlike any other operational discipline in the building: there is no graceful degradation, no shedding of load, no failing-over of a request. A 16,384-GPU job interrupts every ~3 hours; a 131,072-GPU job has a projected mean-time-to-failure of roughly fourteen minutes (Meta, Revisiting Reliability in Large-Scale ML Research Clusters, HPCA 2025). At that scale the question is no longer whether the run will be interrupted but how much wallclock you burn each time it is — and that number, summed over a multi-week run, is the difference between a 70% and a 95% effective-training-time job.

This chapter is the operational counterpart to the design-time reliability work in Part 12 and the failure-data catalog in Chapter 14.3. It is not where checkpoint interval math lives — the Young/Daly derivation is canonical in Chapter 9.4, and the goodput-vs-availability reframing is canonical in Chapter 12.2. Here we cover the four operational levers an ops team actually turns on day-2: the checkpoint tiering that sets how fast you recover, the detection and isolation that decides which node is at fault and ejects it, the elastic-vs-rigid recovery posture, and the MTTR decomposition that tells you which of these to spend the next dollar on.

The operational target: ETTR, not nines

The facility world measures itself in availability nines — Tier III at 99.982%, Tier IV at 99.995%. For a synchronous training job those numbers are nearly irrelevant, because the job does not care whether the facility was up; it cares whether its specific 16,384 GPUs all made forward progress on the same step. The operational metric is Effective Training Time Ratio (ETTR): productive runtime divided by allocated wallclock. Equivalently, the industry speaks of goodput (forward progress) net of badput (restarts, re-computation since the last checkpoint, idle time waiting for a replacement node, slow stragglers). Real large jobs land around 0.90 ETTR industry-average and ~0.96 best-in-class; well-run RSC-1 jobs at 2,048–4,096 GPUs exceed 0.90 even in a congested shared cluster, while an unoptimized 16,000-GPU job projects to only ~0.70 with naive checkpointing — a 23-point goodput gap that frequent checkpointing and fast restart close to ~0.93.

The reason this matters in 2026 is the same reason everything in this guide matters: the cluster is power-bound, and the GPUs are depreciating on a 2–3 year economic clock whether or not they are doing useful work. Reliability overhead runs 6–21% of cluster TCO. On a fleet earning on the order of $10–12B per GW per year, a 5-point ETTR improvement is not an engineering nicety — it is the difference between hitting and missing the run's compute budget, and it is recovered capital, not avoided cost.

Checkpoint tiering in operation

The optimal checkpoint interval is a solved problem — Young/Daly gives it as a function of checkpoint cost and MTBF, and that derivation is canonical in Chapter 9.4. What the interval math takes as an input, and what operations actually controls, is the cost of a checkpoint and the cost of a restart. Drive those down and the optimal interval shrinks, the work-at-risk between checkpoints shrinks, and ETTR rises — without buying a single additional GPU. The lever for that is tiering: keeping checkpoints at multiple storage levels so that the common-case recovery never has to touch the slowest, most durable tier.

A checkpoint is roughly 14 bytes per parameter (model weights plus optimizer state in mixed precision), so a 405B-parameter model checkpoints ~5.7 TB. Writing that to a persistent parallel file system on every interval is bandwidth-prohibitive and stalls the job; reloading it across the fabric on every failure is the dominant restart cost. Tiering solves both. The canonical three-tier scheme keeps the freshest checkpoint in node-local RAM/host memory (microsecond-to-second recovery for the common case where the job restarts on the same hardware), a redundant copy on a peer node or adjacent slice (survives a single-node loss without touching durable storage), and a periodic copy to durable object/parallel storage (survives a whole-job or facility event). Google reported a 6.59% goodput uplift on a 35K-chip TPU v5p workload purely from multi-tier checkpointing, with sub-5-minute save intervals; the multi-tier recovery path cuts MTTR from a 15–30 minute persistent-storage reload to under 2 minutes for the common case.

Checkpoint tier decision — what each tier buys and costs

Tier	Medium	Recovery latency	Survives	Operational cost
In-memory / host RAM	Node-local DRAM (and HBM staging)	Seconds — no network reload	Transient process/GPU error, same-node restart	DRAM capacity reserved off the training footprint; lost if node dies
Peer / in-cluster	Replica on a neighbor node or adjacent slice	Tens of seconds — intra-fabric copy	Single-node loss without a durable reload	Extra fabric traffic + 2x checkpoint memory; erasure-coding reduces it
Durable / persistent	Parallel FS or object store (GPUDirect Storage)	Minutes — full reload across the fabric	Whole-job, rack, or facility event	Storage bandwidth; the stall the async drain is hiding
Async drain (overlay)	Background copy from RAM to durable	N/A — hides write cost	Makes durable-tier cadence affordable	Engineering complexity; <10% compute overlap target

Tiers are cumulative: a production frontier run uses all three. Recovery latency and durability are 2025-26 practitioner ranges (VAST Data, Google Cloud multi-tier checkpointing, AWS); exact numbers are model- and fabric-dependent.

Two operational refinements sit on top of the tiers. Asynchronous checkpointing snapshots state to host memory in a brief synchronous pause, then drains it to durable storage in the background while training continues — the target is keeping the visible stall under ~10% of the checkpoint window, which is what makes a sub-5-minute durable cadence affordable at all. Sharded / distributed checkpoint formats let each rank write its own shard in parallel, so checkpoint wall-time scales with per-node bandwidth rather than aggregate model size — without it, the checkpoint itself becomes the straggler. The operational fork is how much memory and fabric you are willing to spend on the upper tiers: a job that checkpoints only to durable storage is simple and cheap to run but recovers slowly; a fully-tiered job recovers in seconds but reserves host RAM and burns fabric bandwidth on replica writes. The right answer is set by where your MTTR decomposition (below) says the time is going.

Deep dive: why the restart cost, not the checkpoint cost, is the operational lever

Operators new to training reliability instinctively optimize the write side — faster checkpoints, higher storage bandwidth. But Young/Daly shows the optimal interval is proportional to the square root of checkpoint cost over failure rate, which means halving checkpoint write-time only shrinks the interval by ~30% and the work-at-risk by a similar amount. The bigger lever is the restart cost, because it is paid in full on every single failure, and at scale failures are frequent. Restart cost has four serial components: detect the failure, isolate and reschedule onto healthy hardware, reconfigure the parallel topology, and reload-plus-replay from the last checkpoint. Tiering attacks the reload term (load from RAM, not from the parallel FS); fast detection attacks the first term; hot spares and elastic reconfiguration attack the middle two.

The empirical payoff is stark. Meta's modeling shows that reaching ETTR 0.9 on a 100,000-GPU run on an RSC-2-class failure rate requires checkpoint and restart overhead of roughly 2 minutes combined — not 2 minutes of checkpointing, 2 minutes end-to-end including the reload and replay. Hyper-checkpointing and in-memory recovery exist precisely because, at six-figure GPU counts, the only way to keep the restart term inside that budget is to never go to durable storage in the common case. The operational implication: if your detect-to-recover loop is dominated by the durable reload, you are tiering wrong, and no amount of faster checkpoint writes will save you. → Chapter 9.4 for the interval derivation.

Fault detection, isolation, and lemon-node ejection

A restart is only as good as the decision about where to restart. Resume the job on the same flaky node and it fails again within the hour; the run thrashes, ETTR collapses, and on-call burns out. So the second operational pillar is detection and isolation: catching that a failure occurred, attributing it to the right component, and removing the bad hardware from the schedulable pool before it poisons the next attempt. The hardest cases are not the hard failures (a GPU that falls off the bus is loud and easy) but the silent and gray ones — silent data corruption that produces wrong gradients with no error, and stragglers that are technically alive but running 20% slow and dragging the whole synchronous job down to their pace. The failure taxonomy (hard / transient / silent) is canonical in Chapter 14.3; here the concern is the operational detection program that sits on top of it.

Detection runs at three boundaries, and choosing where to invest is a real fork. Synchronous health checks at job launch (and after every restart) sweep the allocated nodes for known-bad signatures before training starts — cheap, but only catch what they test for and add latency to every recovery. Passive in-run monitoring watches per-step timing, collective-op latency, ECC counters, and thermal/throttle telemetry to flag stragglers and degrading nodes mid-run — catches the gray failures the launch check misses, but risks false positives that eject healthy nodes. A periodic offline node-sweep qualifies idle nodes against a benchmark to find slow/SDC-prone hardware before it is ever scheduled — the most thorough, but consumes capacity that could be training. Mature operators run all three; the question is the weighting.

The lemon-node fork: eject early or prove-it-bad?

A small number of nodes cause a disproportionate share of failures — the "lemons." Meta found 40 nodes (1.2–1.7% of footprint) touched 13% of daily jobs, and that proactively detecting and ejecting them dropped large-job (512+ GPU) failure rates from 14% to 4% — a 71% improvement in completion. The fork is the ejection threshold. Eject aggressively on the first suspicious signal and you keep the schedulable pool clean and goodput high, at the cost of pulling some healthy nodes (lost capacity) and needing a fast spares/RMA pipeline to refill (→ Chapter 14.6). Eject conservatively, requiring repeated confirmed failures, and you waste fewer healthy nodes but let lemons re-poison jobs in the meantime. At frontier scale the math favors aggressive ejection: the expected goodput loss from one lemon re-killing a 16,384-GPU job dwarfs the cost of mistakenly benching a healthy node for a re-test. Make the threshold a tuned parameter, not a default, and feed every ejection back into the FMEA catalog (Appendix F) and the failure-rate model in Chapter 14.3.

Isolation feeds automated remediation: once a node is flagged, the workflow drains the job off it, attempts an automated recovery ladder (reset the GPU, reload the driver, power-cycle the node, reflash firmware), re-qualifies it against the health suite, and either returns it to the pool or opens an RMA and pulls a spare. The degree of automation here is the facility-ops/ML-platform-ops boundary made concrete: the platform owns the detect-flag-eject loop and the job-level recovery, while facility ops owns the physical swap and the RMA logistics. Drawing that boundary cleanly — and deciding how far the autonomous ladder runs before a human is paged — is an organizational decision treated in Chapter 14.11; the autonomous-recovery mechanics overlap with Chapter 10.7.

419 / 54 days

unplanned interruptions on a 16,384-H100 Llama 3 run (~1 every 3 hr); 466 total, 78% hardware-caused

2024Meta (Llama 3 paper) / Tom's Hardware

GPU 30.1% / HBM3 17.2%

leading Llama 3 interruption root causes; network 8.4%, software 12.9% — yet >90% effective training time

2024Meta (Llama 3 paper) / DCD

1.8 hr → 0.23 hr

observed/projected MTTF at 16,384 vs 131,072 GPUs (failure rate rises roughly in proportion to job size; empirical, not a strict linear chain)

2025Meta, Revisiting Reliability (HPCA)

14% → 4%

large-job (512+ GPU) failure rate after lemon-node detection; 40 lemons = 1.2-1.7% of footprint touched 13% of jobs

2025Meta, Revisiting Reliability (HPCA)

~2 min

combined checkpoint+restart overhead needed to hit ETTR 0.9 on a 100K-GPU run (RSC-2-class failure rate)

2025Meta, Revisiting Reliability (HPCA)

15-30 min → <2 min

MTTR reduction from persistent-storage reload to multi-tier in-cluster recovery; +6.59% goodput on 35K chips

2025Google Cloud (multi-tier checkpointing)

~90% / ~96%

effective training time (goodput) industry-average vs best-in-class; reliability overhead 6-21% of TCO

2025SemiAnalysis ClusterMAX / CoreWeave

~7 days / 512 GPUs

best-in-class mature-cluster MTBF; new clusters fail far more during 3-4 week burn-in

2025SemiAnalysis (100k H100 clusters)

Elastic and redundant training

The default recovery posture is rigid: the job fails, stops entirely, the scheduler re-allocates a full healthy set of nodes, and the run reloads the last checkpoint and replays. Simple, robust, and the right choice for most jobs — but it pays the full detect-isolate-reschedule-reload cost on every failure, and at six-figure GPU counts the reschedule term alone (waiting for the scheduler to find and provision a clean replacement set) can dominate. The two operational alternatives buy faster recovery by spending capacity.

Elastic training lets the job continue on the surviving nodes at reduced width — drop the failed node's data-parallel replica, reshard, and keep stepping — then re-expand when a replacement is qualified. It eliminates the stop-and-reschedule stall for the common single-node case, at the cost of running slower (and at slightly different effective batch size) until the node is back. Redundant / spare-pool training keeps hot spares qualified and idle so a failed node is swapped in seconds rather than waiting on the scheduler or an RMA — recovery approaches the reload time alone, but the spares are a standing goodput tax (idle GPUs that depreciate without computing). The fork is a direct trade between recovery speed and reserved capacity, and it interacts with the elasticity of the parallelism scheme — nonuniform / elastic tensor-parallelism reduces the failure-amplification penalty of a node loss, which is what makes degrade-and-continue viable at all.

Recovery posture decision — rigid vs elastic vs redundant

Posture	On a node failure	Recovery latency	Standing cost	Best fit
Rigid (fail-stop)	Whole job stops, full re-allocation, reload+replay	Full detect+reschedule+reload	None — simplest	Smaller jobs; abundant clean capacity; simple ops
Elastic (degrade-and-continue)	Reshard onto survivors, keep stepping at reduced width	Near-zero stall; runs slower until refilled	Throughput dip while degraded	Large jobs with elastic parallelism; staleness-tolerant
Redundant (hot spares)	Swap a pre-qualified spare in seconds	≈ reload time only	Idle spare pool depreciating	Frontier runs where every restart is costly
Layered (elastic + spares)	Continue degraded, then swap spare, then re-expand	Lowest end-to-end	Spare pool + elasticity engineering	Six-figure-GPU jobs at best-in-class ETTR

Postures are not mutually exclusive; production runs often layer a small hot-spare pool under an elastic scheme. Cost is in reserved capacity (goodput tax) traded against recovery latency.

MTTR decomposition: where the time actually goes

Everything above is in service of one number — mean-time-to-recovery — and you cannot improve it without decomposing it. Treat MTTR as a serial sum of four terms, each with its own owner, its own cost to shave, and its own scaling behavior:

Detection latency — wallclock from the failure to the system knowing it failed. Dominated by how fast health monitoring fires; gray/silent failures (a slow straggler, an SDC) can take many steps to surface and are the worst offenders. Shave it with passive in-run monitoring and tighter step-time anomaly thresholds.
Isolation & scheduling — attributing the fault to a component, ejecting it, and provisioning a healthy replacement set. This term grows with cluster size (more nodes to schedule around) and is the one hot spares and elastic continuation directly attack.
Reconfiguration — rebuilding the parallel topology, re-establishing collectives, and re-warming the fabric on the new node set. Roughly fixed per restart; elastic/nonuniform parallelism reduces it by avoiding a full topology rebuild.
Reload & replay — loading the last checkpoint and recomputing the work since it. This is the term tiering attacks (load from RAM, not durable storage) and the term Young/Daly's interval bounds (less work-at-risk between checkpoints).

The operational discipline is to instrument all four and attack the binding one — because which term dominates shifts with scale. On a few-thousand-GPU job the reload-and-replay term usually dominates, so tiering and interval tuning pay off most. On a six-figure-GPU job the isolation-and-scheduling term grows until it rivals reload, which is why hot spares and lemon-node pre-ejection (keeping a clean pool so scheduling is instant) become the highest-leverage investments. Optimizing the wrong term is the classic waste: buying faster checkpoint storage when your time is actually going to a slow scheduler.

The badput you forget to count

Goodput accounting routinely undercounts two costs. First, straggler badput: a node that is alive but 15% slow doesn't trigger a restart, so it never shows up in the interruption log — it just silently drags every step, and a synchronous job inherits the slowest member's pace continuously. Second, replay badput: the work recomputed between the last checkpoint and the failure is real lost compute that the "number of interruptions" metric ignores entirely. A run with few interruptions but a long checkpoint interval can have worse goodput than a run with many interruptions and tight checkpointing. Measure ETTR end-to-end against allocated wallclock, attribute badput by category, and tie those categories to the contractual goodput baseline (→ Chapter 12.4) — or you will optimize the metric you can see and miss the goodput you are actually losing.

Deep dive: the detect-to-recover loop as a closed control system

The most useful way to think about operational training reliability is as a closed-loop controller running continuously over the cluster, not a sequence of incident responses. The loop has a sensing stage (health checks, step-time telemetry, ECC/throttle counters, collective-latency probes), a decision stage (is this a transient to retry, a node to eject, or a straggler to fence?), an actuation stage (the automated remediation ladder and the scheduler), and a learning stage (every event feeds the lemon-node classifier and the fleet failure-rate model). The loop's bandwidth — how fast it goes from a degraded signal to a clean recovered job — is exactly the MTTR you are trying to minimize, and its precision — how often it ejects a healthy node or fails to catch a sick one — sets your false-eject capacity loss and your missed-failure replays.

Two design principles fall out of the control framing. First, detection must lead failure where possible: passive monitoring that flags a degrading node before it hard-fails turns an unplanned restart into a scheduled drain, collapsing the detection-and-scheduling terms. Second, the loop must learn: a lemon classifier that updates from each ejection, and a failure-rate model that updates from each swap, is what lets the threshold tuning above stay calibrated as the fleet ages through its bathtub curve. This is also where the operational twin closes the loop with as-built reliability data — the same feedback discipline as the design-validation twin in Chapter 12.2, applied to the running job rather than the building.

Operational anti-patterns

The recurring failures in training-reliability operations come from optimizing one term in isolation or from treating a frequent-failure regime as if it were a rare-failure one. Three are worth naming:

Restarting on the same flaky node. No lemon-node ejection, so a degrading GPU re-kills the job every recovery cycle. The interruption count looks like bad luck; it is actually one bad node poisoning every attempt. The fix is cheap (eject before re-scheduling) and the payoff is large (14%→4% large-job failure).
Tiering only to durable storage. Every recovery pays a 15–30 minute reload across the fabric because there is no RAM or peer tier. At a few thousand GPUs this is survivable; at six figures it makes ETTR 0.9 mathematically unreachable. The reload term must come from memory in the common case.
Buying availability the job doesn't value. Spending on 2N facility power to prevent a restart that the job already tolerates via checkpointing — capital that returns far more as goodput (faster recovery, hot spares, lemon ejection) than as facility nines. This is the design-time anti-pattern of Chapter 12.2 showing up as an operational budgeting error.

The checkpoint-interval math this chapter tunes is canonical in Chapter 9.4; the goodput-vs-availability reframing that justifies spending on recovery over nines is canonical in Chapter 12.2, and the contractual measurement of goodput/badput lives in Chapter 12.4. The failure taxonomy and empirical fleet failure-rate data behind detection are in Chapter 14.3; the autonomous-recovery and elastic-training mechanics overlap with Chapter 10.7. Lemon-node ejection feeds the spares and RMA pipeline in Chapter 14.6; the facility-ops/ML-platform ownership boundary and the incident-command model are in Chapter 14.11; and the live power/thermal envelope the recovered job runs back into is managed in Chapter 14.7.