Chapter 10.7
Fleet Reliability, Fault Tolerance & Autonomous Recovery
At fleet scale a synchronous training job fails roughly every few hours and a 100k-GPU run more than once an hour, so reliability stops being a facility-availability number and becomes a software-and-control-plane problem: the operator that detects a fault, ejects the node, and restarts from a recent checkpoint in minutes keeps its goodput, and the one that waits for a human loses it.
What you'll decide here
- Whether your reliability target is facility availability (the legacy 'nines') or goodput/ETTR — for a checkpointable training fleet the two diverge sharply, and optimizing the wrong one buys redundancy the workload does not value (canonical rethink in Chapter 12.2).
- Where to spend the recovery budget along the detect → drain → diagnose → remediate → restart loop: faster detection and faster restart (multi-tier checkpointing, hot spares) almost always beat raw MTBF improvement at a given fleet size.
- Whether to absorb failures with hot spares and fast restart (operationally simple, capacity tax) or with elastic/redundant training and algorithmic fault tolerance (no spare tax, but couples reliability to the training framework).
- How autonomous the recovery loop is allowed to be — auto-drain and auto-restart on a confidence threshold versus human-gated remediation — and the blast-radius guardrails (rate limits, quarantine, lemon-node ejection) that keep an autonomous loop from amplifying a fault.
- The checkpoint cadence and restart-overhead target your failure rate actually requires (Young/Daly math canonical in Chapter 9.4), because at 100k GPUs an ETTR of 0.9 forces sub-2-minute checkpoint-and-restart, not the 'every few hours' habit most teams ship with.
A single GPU is a remarkably reliable device. A fleet of a hundred thousand of them, lashed into one synchronous job by a non-blocking fabric, is not — and the arithmetic is unforgiving. If any one node failing kills the whole job, the job's mean-time-to-failure is the per-node MTBF divided by the node count. Meta's Revisiting Reliability study put hard numbers on this: on their research clusters an 8-GPU job had a mean-time-to-failure of about 47.7 days, while a 1,024-GPU job failed every 7.9 hours — roughly two orders of magnitude worse, exactly as the single-point-of-failure model predicts. Extrapolate to a 100,000-GPU run and you are interrupted more than once an hour. This is the reliability problem at scale, and it is the reason fleet reliability is an engineering discipline, not a facility-uptime line item.
This chapter is about the system that turns that failure rate from a goodput catastrophe into a manageable tax. It is built from three moving parts that must be designed together: the detection-to-recovery loop (how fast you notice, isolate, and restart), the fault-tolerance strategy (hot spares versus elastic/redundant training versus algorithmic resilience), and the checkpoint/restore substrate that decides how much work a failure erases. Every reliability dollar can go to facility nines, to recovery speed, or to spare capacity, and the three buy very different amounts of goodput. We name the canonical homes for the supporting math — checkpoint intervals in Chapter 9.4, the failure taxonomy and fleet AFR data in Chapter 14.3, and the availability-vs-goodput rethink in Chapter 12.2 — and concentrate here on the operational loop that ties them together.
The reliability problem at scale
Two facts collide to produce the modern training-reliability problem. First, synchronous coupling makes every node a single point of failure: in a data-/tensor-/pipeline-parallel run the job advances at the speed of its slowest rank, and a dead rank halts all of them. Second, failure rate scales with node count, because the population of things that can break grows linearly while the job's tolerance for any one of them stays at zero. The product is a job-level MTBF that collapses as you scale — the headline reason a frontier run is interrupted on the order of once an hour even when each individual node is fine for weeks.
The empirical anchors are now public and consistent. Meta's Llama 3 405B run logged 419 unplanned interruptions over 54 days on 16,384 H100s — about one every three hours — of which roughly 78% were hardware-caused and 58.7% GPU-related, yet the team still achieved over 90% effective training time through aggressive automation and only three manual interventions. SemiAnalysis's teardown of mature 100k-H100 clusters puts best-in-class MTBF at around 7 days per 512 GPUs after burn-in, with new clusters failing far more during the first three to four weeks. Alibaba's Unicron production study found a 43.4% large-job failure rate, about 37% hardware-attributed and roughly 73% recoverable via restart. The throughline: at scale, failure is not an exception to plan around — it is the steady state to engineer for.
The consequence for design is that the legacy reliability target — facility availability, the Uptime Tier 'nines' — is the wrong number for a training fleet. A Tier IV facility at 99.995% availability is down about 26 minutes a year; a 100k-GPU synchronous job loses far more than that to internal hardware faults that the facility's 2N power and cooling do nothing to prevent. The metric that governs the return on a training cluster is goodput (equivalently ETTR, effective-training-time ratio): productive GPU-time divided by wall-clock GPU-time. This is the canonical pivot of Chapter 12.2; here it is the lens through which every recovery decision is scored.
The detection-to-recovery loop
Every interruption runs through the same five-stage loop, and the time spent in each stage is what you actually control. Detect the fault; drain the affected node or rack out of the job; diagnose the root cause; remediate (reboot, reseat, RMA, or replace); and restart the job from the last good checkpoint onto healthy hardware. The cluster's goodput is set by how fast this loop closes and how often it has to run. Crucially, the failure taxonomy that the loop must classify — hard faults, transient faults, and silent data corruption — is canonical in Chapter 14.3; here we treat detection as a given input and focus on the loop's economics.
The non-obvious lever is that detection latency and restart latency dominate, not repair latency. Repair (an RMA, a reseat) happens asynchronously on a drained node while the job runs on a spare; it is off the critical path. What sits on the critical path is the time to notice the fault (a hung collective can stall a job for minutes before a watchdog fires) plus the time to load a checkpoint and re-establish the fabric. This is why the highest-leverage reliability spend is rarely 'better hardware' — it is faster watchdogs, faster checkpoint loading, and a warm spare ready to slot in. Multi-tier checkpointing has driven restart from the legacy 15–30 minutes down toward under two minutes, and that single change can move goodput by several points at frontier scale.
| Stage | Typical latency | On critical path? | Primary lever | Failure mode if neglected |
|---|---|---|---|---|
| Detect | Seconds to several minutes | Yes — job is stalled while undetected | Heartbeats, collective watchdogs, health checks, SDC scanners | A hung rank silently burns GPU-hours until a timeout fires |
| Drain / isolate | Seconds to ~1 min | Yes | Topology-aware eject; quarantine the node/rack from the scheduler | Faulty node rejoins and re-fails; flapping job |
| Diagnose | Minutes to hours | No — runs on drained node | Automated triage workflows; XID/SXID classification; burn-in re-test | Mis-triage RMAs healthy parts or returns a lemon to service |
| Remediate | Minutes (reboot) to days (RMA) | No — off-line on a spare | Reboot/reseat/reflash; RMA logistics; lemon-node ejection | Repeat-offender 'lemon' nodes silently cap fleet goodput |
| Restart | Under 2 min (multi-tier) to 15–30 min (storage-only) | Yes — all GPUs idle until resumed | Multi-tier / in-memory checkpoint; hot spare; fast fabric re-init | A slow restore multiplies every failure into a large goodput loss |
The table is a budget allocator. The three rows marked 'on critical path' — detect, drain, restart — are where wall-clock goodput is won or lost; the two marked off-path can be slow and asynchronous as long as you have spare capacity to keep the job running while they complete. This is the structural argument for hot spares: they convert remediate from an on-path stall into an off-path background task. It is also the argument for multi-tier checkpointing: it attacks restart, the most leveraged on-path stage, by keeping a recent checkpoint in node-local memory or NVMe so a restore is a memory copy rather than a read across the storage fabric. The checkpoint cadence and tiering math that governs how much a failure erases is canonical in Chapter 9.4.
Autonomous hardware recovery: closing the loop without a human
At a fleet failing more than once an hour, a human-in-the-loop recovery process is a bottleneck — the operator becomes the MTTR. The 2025–2026 answer is to make the loop autonomous: a fleet control plane that detects a drained or unhealthy node, runs automated triage to classify the fault, attempts remediation (power-cycle, reflash, re-test) without a ticket, and only escalates to a human when it cannot resolve the fault itself. NVIDIA's Mission Control packages this for GB200/GB300 NVL72 as three coupled components — autonomous job recovery, autonomous hardware recovery, and the NVIDIA Resiliency Extension (NVRx) — running automated health checks at the tray, rack, and system level and executing break-fix workflows that open support tickets only for what cannot auto-resolve. Hyperscalers run their own equivalents (Meta's automation took Llama 3 to over 90% effective training time with only three manual interventions across 54 days), and neocloud operators differentiate on the maturity of exactly this loop.
The decision here is not whether to automate detection — everyone does — but how much authority to grant the loop to act. Full autonomy (auto-drain and auto-restart the job on a confidence threshold) maximizes goodput but can amplify a fault: a mis-classifying triage routine can eject healthy nodes, a restart storm can thrash the scheduler, and a shared firmware bug can be 'remediated' onto every node in turn. The guardrails that make autonomy safe are the same ones that make any control loop safe — rate limits on automated actions, quarantine of repeat offenders (lemon-node ejection), and a circuit breaker that hands control to a human when the action rate or failure rate spikes. The fork is real: an under-automated loop loses goodput to human latency; an over-automated loop without guardrails loses goodput to self-inflicted instability.
Fault tolerance: hot spares vs elastic/redundant training vs algorithmic resilience
Once detection and recovery are fast, the next fork is how the job survives the failure — and there are three families, each trading capacity, framework coupling, and recovery speed differently. The choice determines both the spare-capacity tax you pay and how deeply reliability is wired into the training stack.
Hot spares + fast restart is the operationally simplest and most common posture: hold a pool of healthy GPUs idle (typically a few percent of the fleet), and on failure drain the bad node, slot in a spare, and restart the job from the last checkpoint. The cost is the idle spare capacity and the full restart latency; the virtue is that it is framework-agnostic and easy to reason about. Elastic / redundant training lets the job continue on fewer nodes (shrinking the world size and re-sharding) or run with redundant replicas so a single failure does not require a full restart — eliminating the spare tax and much of the restart stall, at the price of coupling reliability tightly to the training framework (the scheduler, the parallelism plan, and the collective library must all cooperate to reshard live). Algorithmic fault tolerance goes further into the math: techniques like nonuniform/elastic tensor parallelism reduce the goodput amplification a single GPU failure causes, and redundant-computation or erasure-coded schemes let the job tolerate a fault without re-execution — the lowest overhead in principle, but the least mature and the most workload-specific. These map onto the operational reliability-engineering treatment in Chapter 14.4.
| Strategy | How it survives a failure | Spare-capacity tax | Recovery latency | Framework coupling | Best fit |
|---|---|---|---|---|---|
| Hot spares + fast restart | Swap in a healthy spare; restart from checkpoint | ~2–5% of fleet held idle | Restart-bound (sub-2 min with multi-tier; else 15–30 min) | Low — scheduler-level, framework-agnostic | Default for most operators; large stable runs |
| Elastic training | Shrink world size / re-shard onto survivors; resume | None (no idle reserve) | Re-shard + resume; no spare provisioning wait | High — scheduler + parallelism + collectives must reshard live | Long runs where spare capacity is scarce or costly |
| Redundant training | Redundant replicas absorb the loss; no full restart | Replica overhead (compute, not idle) | Near-zero stall on a single failure | High — requires replicated execution plan | Highest-value runs where any stall is unacceptable |
| Algorithmic fault tolerance | Nonuniform/elastic TP or coded redundancy bounds the loss | Low (math, not idle capacity) | Often no restart for the tolerated fault class | Highest — baked into the parallelism/algorithm | Frontier teams co-designing model, parallelism, and resilience |
The consequence to internalize: moving down this table trades operational simplicity for capacity efficiency and recovery speed, and pays for it in framework coupling. Hot spares are something an operations team can run with a stock framework; elastic and redundant training require the training stack itself to be reliability-aware, which is why they are most common at the labs that own their framework end-to-end. There is no universally right answer — the right one is a function of how scarce your spare capacity is (a power-bound fleet may not be able to afford 5% idle), how much a stall costs (a contracted run with a deadline cannot tolerate restart storms), and how much control you have over the training stack. Many fleets run a hybrid: hot spares as the baseline, elastic shrink as the fallback when the spare pool is exhausted.
Sizing the recovery: where the next reliability dollar goes
The reliability budget has three competing claims — raise MTBF (better hardware, more burn-in), shrink MTTR (faster detect/restart, hot spares), or add facility nines (2N power, redundant cooling) — and at frontier scale they are not equally productive. The goodput of a checkpointable job is, to first order, uptime fraction = MTBF / (MTBF + MTTR + lost-work-per-failure), where the lost work is set by the checkpoint interval. Because MTBF falls as you scale and is hard to move (the hardware is what it is), the dominant levers are MTTR (recovery speed) and lost-work-per-failure (checkpoint cadence) — both of which are software and control-plane investments, not facility ones.
This is why the canonical recommendation for a training fleet inverts the legacy data-center instinct. Spending the marginal dollar on 2N facility power to prevent a restart is largely wasted, because the job already tolerates restart via checkpointing and most interruptions are internal hardware faults the facility cannot prevent. The same dollar spent on multi-tier checkpointing (cutting lost-work and restart latency) or on a hot-spare pool (cutting MTTR) returns more goodput. The Young/Daly optimal-interval derivation that quantifies the checkpoint-cadence side of this is canonical in Chapter 9.4; the redundancy-spend tradeoff curve — where the next dollar buys the most goodput rather than the most nines — is the engine of Chapter 12.2 and is modeled quantitatively in Chapter 12.5.
Deep dive: why the optimal checkpoint interval shrinks as you scale (and what that demands of recovery)
The Young/Daly result gives the checkpoint interval that minimizes wasted work as roughly the square root of (2 × checkpoint-cost × MTBF). The intuition that matters for fleet reliability: as you scale the job, the effective MTBF collapses (a 1,024-GPU job fails every ~7.9 hours; a 100k-GPU job far more often), so the optimal interval shrinks with it. You are forced to checkpoint more frequently at exactly the scale where each checkpoint is largest and most expensive to write. Two things have to give. First, the checkpoint cost itself must drop — asynchronous and in-memory checkpointing overlap the write with computation so the interval can shrink without the overhead exploding (the <14 bytes/param and <10%-overlap targets are developed in Chapter 9.4). Second, the restart cost must drop, because at high failure rates you pay it often — which is the entire case for multi-tier checkpointing keeping a recent copy in node-local memory or NVMe.
The concrete target from Meta's analysis sharpens this: to hold an ETTR of about 0.9 on a hypothetical 100,000-GPU run at their RSC-2-like failure rate, both checkpoint interval and restart overhead need to be on the order of two minutes. That is a completely different operating point from the 'checkpoint every few hours, restart in twenty minutes' habit that works fine at single-cluster scale. The reliability problem at 100k GPUs is therefore not solved by any single component — it is solved by co-designing checkpoint cadence, checkpoint tiering, and the autonomous recovery loop so the whole detect→restart cycle fits inside a couple of minutes. Miss that, and goodput falls off a cliff that no amount of facility availability can arrest.
Inference fleets: a different reliability problem
Everything above is the training story, where one fault stalls one tightly-coupled job. An online-inference fleet inverts almost every assumption and therefore inverts the reliability strategy. Inference requests are loosely coupled and independent: a node failure does not halt the fleet, it drops the in-flight requests on that node, which a load balancer retries elsewhere. There is no checkpoint to restore and no synchronous restart — the unit of failure is a request, not a run. The reliability targets are correspondingly different: request success rate and SLO attainment (TTFT/TPOT within budget) rather than goodput-as-effective-training-time, and the redundancy posture is the opposite of training's — an always-on revenue workload justifies the 2N / Tier-IV-class facility availability and N+1 cooling that a checkpointable training job does not.
So the fault-tolerance toolkit shifts. For inference you invest in fast health-checking and load-balancer ejection (pull a sick replica out of rotation in seconds), geographic and zonal redundancy (so a rack, hall, or region failure degrades capacity rather than availability), and graceful degradation (shed or queue low-priority traffic, fall back to a smaller model, preserve the SLO for what remains) rather than checkpoint cadence and hot training spares. The KV-cache state that a failed inference node loses is generally cheap to reconstruct from the prompt, which is why inference recovery is a routing problem, not a restore problem — though prefix-cache locality (Chapter 9.7) means losing a node still costs you cache warmth and therefore some latency. The serving-engine and SLO mechanics that this reliability posture protects are owned in Chapter 10.11; the multi-region failover and DR design in Chapter 12.3; and the SLA/goodput-contract framing in Chapter 12.4.
| Dimension | Synchronous training fleet | Online inference fleet |
|---|---|---|
| Unit of failure | The whole job (one node stalls all ranks) | A single request (others unaffected) |
| Headline metric | Goodput / ETTR (effective training time) | Request success rate + SLO attainment (TTFT/TPOT) |
| Recovery primitive | Drain → restart from checkpoint | Eject replica from LB → retry request elsewhere |
| Facility redundancy | N or N+1 — checkpoint-and-resume tolerant | 2N / Tier-IV-class + N+1 cooling on standby |
| Spare strategy | Hot GPU spares / elastic reshard | Over-provisioned replicas + zonal/geo redundancy |
| Dominant lever | Checkpoint cadence + restart latency | Health-check speed + graceful degradation |
Deep dive: silent data corruption — the failure the recovery loop can't see
Every mechanism in this chapter assumes the fault announces itself: a node hangs, a NIC drops, a health check goes red, and the loop fires. Silent data corruption (SDC) is the failure class that breaks that assumption — a GPU that computes the wrong answer and keeps running, returning corrupted gradients that poison the model without tripping any watchdog. At fleet scale SDC is not theoretical: hyperscalers report it as a real and growing contributor, and an undetected SDC event can waste days of training before a loss spike or a downstream eval reveals that the weights are bad — a far larger goodput loss than any hard fault, because it corrupts work that looked productive.
SDC is therefore a reliability problem that the detect→drain→restart loop cannot solve on its own, because there is nothing to detect by conventional means. The answer is a separate detection program — fleet-wide hardware scanners run on idle cycles, in-production checkers that sample computations for correctness, and periodic re-validation against known-good references (Meta's Fleetscanner/Ripple/Hardware Sentinel family is the public reference design). When SDC is found, the node joins the lemon population and is quarantined like any other repeat offender, but the detection has to come from a dedicated program, not the recovery loop. The full SDC taxonomy, detection mechanisms, and fleet data are canonical in Chapter 14.3; the telemetry that surfaces it lives in Chapter 10.6.
Anti-patterns
The same reliability mistakes recur, because each comes from optimizing one number in isolation rather than goodput end-to-end:
- Buying nines the workload does not value. Commissioning 2N / Tier-IV facility power for a checkpointable training fleet that already tolerates restart-and-resume. The capital returns more as goodput — faster checkpointing, hot spares, more GPUs — than as facility availability. The cleanest example of optimizing availability when the workload is paid for goodput (Chapter 12.2).
- Optimizing MTBF instead of MTTR. Pouring the reliability budget into hardware screening to push MTBF up a few percent, while restart still takes twenty minutes. At a fleet failing every few hours, halving MTTR buys far more goodput than nudging MTBF — and it is a software investment, not a procurement one.
- An autonomous loop without lemon tracking or rate limits. Auto-rebooting and re-admitting nodes with no health history and no remediation rate limiter, so a single lemon flaps the job and a correlated firmware fault triggers a fleet-wide remediation storm. The loop built to protect goodput becomes the thing that destroys it.
- Treating SDC as someone else's problem. Relying on the hang-and-restart loop to catch corruption it is structurally blind to, and discovering days of poisoned training only at the next eval. SDC needs a dedicated detection program (Chapter 14.3), not a louder watchdog.