Guide › Software, Orchestration & Service Delivery › 10.7

Chapter 10.7

Fleet Reliability, Fault Tolerance & Autonomous Recovery

At fleet scale a synchronous training job fails roughly every few hours and a 100k-GPU run more than once an hour, so reliability stops being a facility-availability number and becomes a software-and-control-plane problem: the operator that detects a fault, ejects the node, and restarts from a recent checkpoint in minutes keeps its goodput, and the one that waits for a human loses it.

GOODPUTDENSITY-RAMPPOWER-BOUND

What you'll decide here

Whether your reliability target is facility availability (the legacy 'nines') or goodput/ETTR — for a checkpointable training fleet the two diverge sharply, and optimizing the wrong one buys redundancy the workload does not value (canonical rethink in Chapter 12.2).
Where to spend the recovery budget along the detect → drain → diagnose → remediate → restart loop: faster detection and faster restart (multi-tier checkpointing, hot spares) almost always beat raw MTBF improvement at a given fleet size.
Whether to absorb failures with hot spares and fast restart (operationally simple, capacity tax) or with elastic/redundant training and algorithmic fault tolerance (no spare tax, but couples reliability to the training framework).
How autonomous the recovery loop is allowed to be — auto-drain and auto-restart on a confidence threshold versus human-gated remediation — and the blast-radius guardrails (rate limits, quarantine, lemon-node ejection) that keep an autonomous loop from amplifying a fault.
The checkpoint cadence and restart-overhead target your failure rate actually requires (Young/Daly math canonical in Chapter 9.4), because at 100k GPUs an ETTR of 0.9 forces sub-2-minute checkpoint-and-restart, not the 'every few hours' habit most teams ship with.

A single GPU is a remarkably reliable device. A fleet of a hundred thousand of them, lashed into one synchronous job by a non-blocking fabric, is not — and the arithmetic is unforgiving. If any one node failing kills the whole job, the job's mean-time-to-failure is the per-node MTBF divided by the node count. Meta's Revisiting Reliability study put hard numbers on this: on their research clusters an 8-GPU job had a mean-time-to-failure of about 47.7 days, while a 1,024-GPU job failed every 7.9 hours — roughly two orders of magnitude worse, exactly as the single-point-of-failure model predicts. Extrapolate to a 100,000-GPU run and you are interrupted more than once an hour. This is the reliability problem at scale, and it is the reason fleet reliability is an engineering discipline, not a facility-uptime line item.

This chapter is about the system that turns that failure rate from a goodput catastrophe into a manageable tax. It is built from three moving parts that must be designed together: the detection-to-recovery loop (how fast you notice, isolate, and restart), the fault-tolerance strategy (hot spares versus elastic/redundant training versus algorithmic resilience), and the checkpoint/restore substrate that decides how much work a failure erases. Every reliability dollar can go to facility nines, to recovery speed, or to spare capacity, and the three buy very different amounts of goodput. We name the canonical homes for the supporting math — checkpoint intervals in Chapter 9.4, the failure taxonomy and fleet AFR data in Chapter 14.3, and the availability-vs-goodput rethink in Chapter 12.2 — and concentrate here on the operational loop that ties them together.

The reliability problem at scale

Two facts collide to produce the modern training-reliability problem. First, synchronous coupling makes every node a single point of failure: in a data-/tensor-/pipeline-parallel run the job advances at the speed of its slowest rank, and a dead rank halts all of them. Second, failure rate scales with node count, because the population of things that can break grows linearly while the job's tolerance for any one of them stays at zero. The product is a job-level MTBF that collapses as you scale — the headline reason a frontier run is interrupted on the order of once an hour even when each individual node is fine for weeks.

The empirical anchors are now public and consistent. Meta's Llama 3 405B run logged 419 unplanned interruptions over 54 days on 16,384 H100s — about one every three hours — of which roughly 78% were hardware-caused and 58.7% GPU-related, yet the team still achieved over 90% effective training time through aggressive automation and only three manual interventions. SemiAnalysis's teardown of mature 100k-H100 clusters puts best-in-class MTBF at around 7 days per 512 GPUs after burn-in, with new clusters failing far more during the first three to four weeks. Alibaba's Unicron production study found a 43.4% large-job failure rate, about 37% hardware-attributed and roughly 73% recoverable via restart. The throughline: at scale, failure is not an exception to plan around — it is the steady state to engineer for.

The consequence for design is that the legacy reliability target — facility availability, the Uptime Tier 'nines' — is the wrong number for a training fleet. A Tier IV facility at 99.995% availability is down about 26 minutes a year; a 100k-GPU synchronous job loses far more than that to internal hardware faults that the facility's 2N power and cooling do nothing to prevent. The metric that governs the return on a training cluster is goodput (equivalently ETTR, effective-training-time ratio): productive GPU-time divided by wall-clock GPU-time. This is the canonical pivot of Chapter 12.2; here it is the lens through which every recovery decision is scored.

Availability is a facility number; goodput is a fleet number

The single most expensive conceptual error in AI-cluster reliability is importing the data-center-uptime mindset wholesale. Facility availability measures whether power and cooling are present; it says nothing about whether a GPU returned the right bits, whether a NIC dropped a collective, or whether a silent error corrupted a gradient. A fleet can sit inside a Tier IV hall at 99.995% facility availability and run at 70% goodput because a node fails every few hours and recovery takes twenty minutes. Goodput is what the workload is paid for; availability is an input to it, not a proxy for it. Score reliability investments — redundancy, recovery speed, spares — by the goodput they return, not the nines they add. The full reframing, including where the marginal redundancy dollar should go, is canonical in Chapter 12.2 and quantified in Chapter 12.5.

The detection-to-recovery loop

Every interruption runs through the same five-stage loop, and the time spent in each stage is what you actually control. Detect the fault; drain the affected node or rack out of the job; diagnose the root cause; remediate (reboot, reseat, RMA, or replace); and restart the job from the last good checkpoint onto healthy hardware. The cluster's goodput is set by how fast this loop closes and how often it has to run. Crucially, the failure taxonomy that the loop must classify — hard faults, transient faults, and silent data corruption — is canonical in Chapter 14.3; here we treat detection as a given input and focus on the loop's economics.

The non-obvious lever is that detection latency and restart latency dominate, not repair latency. Repair (an RMA, a reseat) happens asynchronously on a drained node while the job runs on a spare; it is off the critical path. What sits on the critical path is the time to notice the fault (a hung collective can stall a job for minutes before a watchdog fires) plus the time to load a checkpoint and re-establish the fabric. This is why the highest-leverage reliability spend is rarely 'better hardware' — it is faster watchdogs, faster checkpoint loading, and a warm spare ready to slot in. Multi-tier checkpointing has driven restart from the legacy 15–30 minutes down toward under two minutes, and that single change can move goodput by several points at frontier scale.

The detection-to-recovery loop: stages, levers, and what they cost

Stage	Typical latency	On critical path?	Primary lever	Failure mode if neglected
Detect	Seconds to several minutes	Yes — job is stalled while undetected	Heartbeats, collective watchdogs, health checks, SDC scanners	A hung rank silently burns GPU-hours until a timeout fires
Drain / isolate	Seconds to ~1 min	Yes	Topology-aware eject; quarantine the node/rack from the scheduler	Faulty node rejoins and re-fails; flapping job
Diagnose	Minutes to hours	No — runs on drained node	Automated triage workflows; XID/SXID classification; burn-in re-test	Mis-triage RMAs healthy parts or returns a lemon to service
Remediate	Minutes (reboot) to days (RMA)	No — off-line on a spare	Reboot/reseat/reflash; RMA logistics; lemon-node ejection	Repeat-offender 'lemon' nodes silently cap fleet goodput
Restart	Under 2 min (multi-tier) to 15–30 min (storage-only)	Yes — all GPUs idle until resumed	Multi-tier / in-memory checkpoint; hot spare; fast fabric re-init	A slow restore multiplies every failure into a large goodput loss

Stage latencies are 2024–2026 practitioner ranges for frontier synchronous training; 'on critical path' indicates whether the stage stalls the running job. Figures synthesize Meta (Llama 3 / Revisiting Reliability), Google Cloud multi-tier checkpointing, and NVIDIA Mission Control.

The table is a budget allocator. The three rows marked 'on critical path' — detect, drain, restart — are where wall-clock goodput is won or lost; the two marked off-path can be slow and asynchronous as long as you have spare capacity to keep the job running while they complete. This is the structural argument for hot spares: they convert remediate from an on-path stall into an off-path background task. It is also the argument for multi-tier checkpointing: it attacks restart, the most leveraged on-path stage, by keeping a recent checkpoint in node-local memory or NVMe so a restore is a memory copy rather than a read across the storage fabric. The checkpoint cadence and tiering math that governs how much a failure erases is canonical in Chapter 9.4.

Autonomous hardware recovery: closing the loop without a human

At a fleet failing more than once an hour, a human-in-the-loop recovery process is a bottleneck — the operator becomes the MTTR. The 2025–2026 answer is to make the loop autonomous: a fleet control plane that detects a drained or unhealthy node, runs automated triage to classify the fault, attempts remediation (power-cycle, reflash, re-test) without a ticket, and only escalates to a human when it cannot resolve the fault itself. NVIDIA's Mission Control packages this for GB200/GB300 NVL72 as three coupled components — autonomous job recovery, autonomous hardware recovery, and the NVIDIA Resiliency Extension (NVRx) — running automated health checks at the tray, rack, and system level and executing break-fix workflows that open support tickets only for what cannot auto-resolve. Hyperscalers run their own equivalents (Meta's automation took Llama 3 to over 90% effective training time with only three manual interventions across 54 days), and neocloud operators differentiate on the maturity of exactly this loop.

The decision here is not whether to automate detection — everyone does — but how much authority to grant the loop to act. Full autonomy (auto-drain and auto-restart the job on a confidence threshold) maximizes goodput but can amplify a fault: a mis-classifying triage routine can eject healthy nodes, a restart storm can thrash the scheduler, and a shared firmware bug can be 'remediated' onto every node in turn. The guardrails that make autonomy safe are the same ones that make any control loop safe — rate limits on automated actions, quarantine of repeat offenders (lemon-node ejection), and a circuit breaker that hands control to a human when the action rate or failure rate spikes. The fork is real: an under-automated loop loses goodput to human latency; an over-automated loop without guardrails loses goodput to self-inflicted instability.

The lemon node and the autonomous-loop amplifier

A small fraction of nodes fail repeatedly — the 'lemons' — and they do disproportionate damage because each failure restarts the whole synchronous job. An autonomous recovery loop that reboots and re-admits a lemon without tracking its history will cheerfully feed it back into the job to fail again, turning one bad node into a flapping goodput sink. Two guardrails are non-negotiable: persistent per-node health history (so a node that has failed N times in a window is quarantined, not re-admitted) and a rate limiter on automated remediation (so a correlated fault — a bad firmware push, a fabric event — does not trigger a fleet-wide remediation storm). Without them, the loop you built to protect goodput becomes the mechanism that destroys it. Lemon detection and AFR-driven ejection thresholds are developed in Chapter 14.3 and operationalized in Chapter 14.4.

Fault tolerance: hot spares vs elastic/redundant training vs algorithmic resilience

Once detection and recovery are fast, the next fork is how the job survives the failure — and there are three families, each trading capacity, framework coupling, and recovery speed differently. The choice determines both the spare-capacity tax you pay and how deeply reliability is wired into the training stack.

Hot spares + fast restart is the operationally simplest and most common posture: hold a pool of healthy GPUs idle (typically a few percent of the fleet), and on failure drain the bad node, slot in a spare, and restart the job from the last checkpoint. The cost is the idle spare capacity and the full restart latency; the virtue is that it is framework-agnostic and easy to reason about. Elastic / redundant training lets the job continue on fewer nodes (shrinking the world size and re-sharding) or run with redundant replicas so a single failure does not require a full restart — eliminating the spare tax and much of the restart stall, at the price of coupling reliability tightly to the training framework (the scheduler, the parallelism plan, and the collective library must all cooperate to reshard live). Algorithmic fault tolerance goes further into the math: techniques like nonuniform/elastic tensor parallelism reduce the goodput amplification a single GPU failure causes, and redundant-computation or erasure-coded schemes let the job tolerate a fault without re-execution — the lowest overhead in principle, but the least mature and the most workload-specific. These map onto the operational reliability-engineering treatment in Chapter 14.4.

Fault-tolerance strategies for synchronous training

Strategy	How it survives a failure	Spare-capacity tax	Recovery latency	Framework coupling	Best fit
Hot spares + fast restart	Swap in a healthy spare; restart from checkpoint	~2–5% of fleet held idle	Restart-bound (sub-2 min with multi-tier; else 15–30 min)	Low — scheduler-level, framework-agnostic	Default for most operators; large stable runs
Elastic training	Shrink world size / re-shard onto survivors; resume	None (no idle reserve)	Re-shard + resume; no spare provisioning wait	High — scheduler + parallelism + collectives must reshard live	Long runs where spare capacity is scarce or costly
Redundant training	Redundant replicas absorb the loss; no full restart	Replica overhead (compute, not idle)	Near-zero stall on a single failure	High — requires replicated execution plan	Highest-value runs where any stall is unacceptable
Algorithmic fault tolerance	Nonuniform/elastic TP or coded redundancy bounds the loss	Low (math, not idle capacity)	Often no restart for the tolerated fault class	Highest — baked into the parallelism/algorithm	Frontier teams co-designing model, parallelism, and resilience

A practitioner comparison of the three families as of 2026. 'Spare tax' is idle capacity reserved purely for failover; 'framework coupling' is how tightly the strategy ties into the training stack.

The consequence to internalize: moving down this table trades operational simplicity for capacity efficiency and recovery speed, and pays for it in framework coupling. Hot spares are something an operations team can run with a stock framework; elastic and redundant training require the training stack itself to be reliability-aware, which is why they are most common at the labs that own their framework end-to-end. There is no universally right answer — the right one is a function of how scarce your spare capacity is (a power-bound fleet may not be able to afford 5% idle), how much a stall costs (a contracted run with a deadline cannot tolerate restart storms), and how much control you have over the training stack. Many fleets run a hybrid: hot spares as the baseline, elastic shrink as the fallback when the spare pool is exhausted.

7.9 hr

mean-time-to-failure of a 1,024-GPU job vs 47.7 days for an 8-GPU job — the single-point-of-failure penalty of scale

2025Meta, Revisiting Reliability in Large-Scale ML Clusters (arXiv 2410.21680)

6.50 / 1000

failures per thousand node-days on Meta's RSC-1 cluster (11 months, ~80%+ utilization)

2025Meta, Revisiting Reliability (arXiv 2410.21680)

419 / 54 days

unplanned interruptions on 16,384 H100s during Llama 3 405B (~1 every 3 hr); 78% hardware, 58.7% GPU-related

2024Meta (Llama 3 paper) / Tom's Hardware

~2 min

checkpoint-and-restart overhead required to hold ETTR ~0.9 on a 100,000-GPU run at RSC-2-like failure rates

2025Meta, Revisiting Reliability (arXiv 2410.21680)

~7 days

best-in-class MTBF per 512 GPUs on a mature 100k-H100 cluster (burn-in 3–4 weeks first)

2025SemiAnalysis, 100k H100 Clusters

43.4%

large-LLM-job failure rate; ~37% hardware-attributed; ~73% recoverable via restart

2024Alibaba Unicron production study

~90% / ~96%

goodput (effective training time): industry average / best-in-class; reliability overhead 6–21% of TCO

2025SemiAnalysis ClusterMAX / CoreWeave

15–30 min → <2 min

training restart latency, storage-only vs multi-tier/in-memory checkpointing

2025Google Cloud multi-tier checkpointing

Sizing the recovery: where the next reliability dollar goes

The reliability budget has three competing claims — raise MTBF (better hardware, more burn-in), shrink MTTR (faster detect/restart, hot spares), or add facility nines (2N power, redundant cooling) — and at frontier scale they are not equally productive. The goodput of a checkpointable job is, to first order, uptime fraction = MTBF / (MTBF + MTTR + lost-work-per-failure), where the lost work is set by the checkpoint interval. Because MTBF falls as you scale and is hard to move (the hardware is what it is), the dominant levers are MTTR (recovery speed) and lost-work-per-failure (checkpoint cadence) — both of which are software and control-plane investments, not facility ones.

This is why the canonical recommendation for a training fleet inverts the legacy data-center instinct. Spending the marginal dollar on 2N facility power to prevent a restart is largely wasted, because the job already tolerates restart via checkpointing and most interruptions are internal hardware faults the facility cannot prevent. The same dollar spent on multi-tier checkpointing (cutting lost-work and restart latency) or on a hot-spare pool (cutting MTTR) returns more goodput. The Young/Daly optimal-interval derivation that quantifies the checkpoint-cadence side of this is canonical in Chapter 9.4; the redundancy-spend tradeoff curve — where the next dollar buys the most goodput rather than the most nines — is the engine of Chapter 12.2 and is modeled quantitatively in Chapter 12.5.

Deep dive: why the optimal checkpoint interval shrinks as you scale (and what that demands of recovery)

The Young/Daly result gives the checkpoint interval that minimizes wasted work as roughly the square root of (2 × checkpoint-cost × MTBF). The intuition that matters for fleet reliability: as you scale the job, the effective MTBF collapses (a 1,024-GPU job fails every ~7.9 hours; a 100k-GPU job far more often), so the optimal interval shrinks with it. You are forced to checkpoint more frequently at exactly the scale where each checkpoint is largest and most expensive to write. Two things have to give. First, the checkpoint cost itself must drop — asynchronous and in-memory checkpointing overlap the write with computation so the interval can shrink without the overhead exploding (the <14 bytes/param and <10%-overlap targets are developed in Chapter 9.4). Second, the restart cost must drop, because at high failure rates you pay it often — which is the entire case for multi-tier checkpointing keeping a recent copy in node-local memory or NVMe.

The concrete target from Meta's analysis sharpens this: to hold an ETTR of about 0.9 on a hypothetical 100,000-GPU run at their RSC-2-like failure rate, both checkpoint interval and restart overhead need to be on the order of two minutes. That is a completely different operating point from the 'checkpoint every few hours, restart in twenty minutes' habit that works fine at single-cluster scale. The reliability problem at 100k GPUs is therefore not solved by any single component — it is solved by co-designing checkpoint cadence, checkpoint tiering, and the autonomous recovery loop so the whole detect→restart cycle fits inside a couple of minutes. Miss that, and goodput falls off a cliff that no amount of facility availability can arrest.

Inference fleets: a different reliability problem

Everything above is the training story, where one fault stalls one tightly-coupled job. An online-inference fleet inverts almost every assumption and therefore inverts the reliability strategy. Inference requests are loosely coupled and independent: a node failure does not halt the fleet, it drops the in-flight requests on that node, which a load balancer retries elsewhere. There is no checkpoint to restore and no synchronous restart — the unit of failure is a request, not a run. The reliability targets are correspondingly different: request success rate and SLO attainment (TTFT/TPOT within budget) rather than goodput-as-effective-training-time, and the redundancy posture is the opposite of training's — an always-on revenue workload justifies the 2N / Tier-IV-class facility availability and N+1 cooling that a checkpointable training job does not.

So the fault-tolerance toolkit shifts. For inference you invest in fast health-checking and load-balancer ejection (pull a sick replica out of rotation in seconds), geographic and zonal redundancy (so a rack, hall, or region failure degrades capacity rather than availability), and graceful degradation (shed or queue low-priority traffic, fall back to a smaller model, preserve the SLO for what remains) rather than checkpoint cadence and hot training spares. The KV-cache state that a failed inference node loses is generally cheap to reconstruct from the prompt, which is why inference recovery is a routing problem, not a restore problem — though prefix-cache locality (Chapter 9.7) means losing a node still costs you cache warmth and therefore some latency. The serving-engine and SLO mechanics that this reliability posture protects are owned in Chapter 10.11; the multi-region failover and DR design in Chapter 12.3; and the SLA/goodput-contract framing in Chapter 12.4.

Reliability posture: training fleet vs inference fleet

Dimension	Synchronous training fleet	Online inference fleet
Unit of failure	The whole job (one node stalls all ranks)	A single request (others unaffected)
Headline metric	Goodput / ETTR (effective training time)	Request success rate + SLO attainment (TTFT/TPOT)
Recovery primitive	Drain → restart from checkpoint	Eject replica from LB → retry request elsewhere
Facility redundancy	N or N+1 — checkpoint-and-resume tolerant	2N / Tier-IV-class + N+1 cooling on standby
Spare strategy	Hot GPU spares / elastic reshard	Over-provisioned replicas + zonal/geo redundancy
Dominant lever	Checkpoint cadence + restart latency	Health-check speed + graceful degradation

The two archetypes optimize different reliability targets and therefore spend the reliability budget on different mechanisms. Mapped to the workload archetypes of Chapter 1.1.

Deep dive: silent data corruption — the failure the recovery loop can't see

Every mechanism in this chapter assumes the fault announces itself: a node hangs, a NIC drops, a health check goes red, and the loop fires. Silent data corruption (SDC) is the failure class that breaks that assumption — a GPU that computes the wrong answer and keeps running, returning corrupted gradients that poison the model without tripping any watchdog. At fleet scale SDC is not theoretical: hyperscalers report it as a real and growing contributor, and an undetected SDC event can waste days of training before a loss spike or a downstream eval reveals that the weights are bad — a far larger goodput loss than any hard fault, because it corrupts work that looked productive.

SDC is therefore a reliability problem that the detect→drain→restart loop cannot solve on its own, because there is nothing to detect by conventional means. The answer is a separate detection program — fleet-wide hardware scanners run on idle cycles, in-production checkers that sample computations for correctness, and periodic re-validation against known-good references (Meta's Fleetscanner/Ripple/Hardware Sentinel family is the public reference design). When SDC is found, the node joins the lemon population and is quarantined like any other repeat offender, but the detection has to come from a dedicated program, not the recovery loop. The full SDC taxonomy, detection mechanisms, and fleet data are canonical in Chapter 14.3; the telemetry that surfaces it lives in Chapter 10.6.

Anti-patterns

The same reliability mistakes recur, because each comes from optimizing one number in isolation rather than goodput end-to-end:

Buying nines the workload does not value. Commissioning 2N / Tier-IV facility power for a checkpointable training fleet that already tolerates restart-and-resume. The capital returns more as goodput — faster checkpointing, hot spares, more GPUs — than as facility availability. The cleanest example of optimizing availability when the workload is paid for goodput (Chapter 12.2).
Optimizing MTBF instead of MTTR. Pouring the reliability budget into hardware screening to push MTBF up a few percent, while restart still takes twenty minutes. At a fleet failing every few hours, halving MTTR buys far more goodput than nudging MTBF — and it is a software investment, not a procurement one.
An autonomous loop without lemon tracking or rate limits. Auto-rebooting and re-admitting nodes with no health history and no remediation rate limiter, so a single lemon flaps the job and a correlated firmware fault triggers a fleet-wide remediation storm. The loop built to protect goodput becomes the thing that destroys it.
Treating SDC as someone else's problem. Relying on the hang-and-restart loop to catch corruption it is structurally blind to, and discovering days of poisoned training only at the next eval. SDC needs a dedicated detection program (Chapter 14.3), not a louder watchdog.

This chapter is the operational hub of the reliability stack; the depth lives in its neighbors. The checkpoint anatomy and Young/Daly optimal-interval math are canonical in Chapter 9.4. The failure taxonomy (hard / transient / silent), the SDC detection program, and the empirical fleet AFR data are canonical in Chapter 14.3, with operational tuning — lemon-node ejection, elastic training in production, MTTR decomposition — in Chapter 14.4. The telemetry that feeds detection (DCGM/NVML, XID/SXID, goodput as the headline metric) is owned by Chapter 10.6. The availability-vs-goodput rethink that this chapter takes as its scoring lens is the subject of Chapter 12.2, quantified in Chapter 12.5 and contracted in Chapter 12.4. For inference fleets, multi-region failover is in Chapter 12.3 and the serving SLOs in Chapter 10.11. The redundancy-tier choices these reliability postures imply trace back to the workload archetypes of Chapter 1.1.