Chapter 9.4

Checkpointing for Large-Scale Training

Checkpointing is not a backup chore — it is the goodput control knob of a training cluster: the interval, the tier, and the bandwidth you size for it decide how many GPU-hours the next failure erases, and at frontier scale a failure is never far away.

GOODPUTPOWER-BOUND

What you'll decide here

How much checkpoint state you actually carry — the ~14 bytes/parameter rule and what is in it — because that number, times your save frequency, sets the checkpoint-path bandwidth you must provision and the fabric you must isolate it on.
The checkpoint interval: the Young/Daly optimum derived from your cluster's real MTBF, save cost, and restart cost — not a round number copied from a tutorial.
The tiering architecture — synchronous vs asynchronous vs in-memory/peer, and how many storage tiers — that determines whether a failure costs you minutes or tens of minutes of replay.
The on-disk format: monolithic single-rank gather vs sharded/distributed (per-rank) checkpoints, and whether the format survives a reshard to a different parallelism layout on restart.
Whether compression or sparsification is worth the CPU and the fidelity risk, versus simply provisioning more drain bandwidth — usually the latter, except in specific MoE and optimizer-state cases.

At small scale, checkpointing is an afterthought: save the model every so often, restore if something breaks. At frontier scale it becomes one of the load-bearing decisions in the whole training system, because the arithmetic flips. A 16,384-GPU run failed roughly once every three hours in Meta's Llama 3 405B snapshot — 419 unplanned interruptions over 54 days, 78% of them hardware-caused (Chapter 10.7). A 100,000-GPU cluster pushing for a multi-month run can expect a hardware fault on a cadence measured in tens of minutes. Every one of those faults is, for a synchronous job, a full-job stall: the whole machine rewinds to the last good checkpoint and replays the work in between. The question is no longer whether to checkpoint but how often, to where, and at what bandwidth — and getting those three wrong is the difference between 96% goodput and 80%.

This chapter is the canonical home for the checkpoint math that the rest of the guide cross-references. We build it decision-by-decision: the anatomy of a checkpoint (why ~14 bytes/parameter, and what is in those bytes), the failure model that makes checkpointing a goodput problem rather than a durability one, the Young/Daly optimal-interval derivation that turns MTBF and save-cost into a frequency, the tiering spectrum from synchronous-to-object up through in-memory peer copies, the format fork between monolithic and sharded checkpoints, compression and sparsification, and finally how to size the bandwidth and isolate the incast it creates.

Checkpoint anatomy: why ~14 bytes per parameter

The first sizing number you need is how big a checkpoint is, and the answer is governed by a rule of thumb that holds remarkably well across mixed-precision LLM training: ~14 bytes of checkpoint state per model parameter (VAST Data, from a survey of 85k+ checkpoints). It is not the model weights — those are the smallest part. The breakdown, for the standard BF16-compute / FP32-master / Adam-optimizer recipe, is:

2 bytes — the BF16 (or FP16) model weights used in the forward/backward pass.
4 bytes — the FP32 master copy of the weights the optimizer actually updates (mixed precision keeps a high-precision master to avoid drift).
4 bytes — Adam's first moment (the running mean of gradients), one FP32 value per parameter.
4 bytes — Adam's second moment (the running variance), again FP32 per parameter.

That sums to 14. Gradients are usually not checkpointed — they are recomputed on restart from the restored weights and a forward/backward pass, so they cost replay time, not storage. The implication that catches teams out: a checkpoint is roughly 7x the size of the model you think you are saving. A 405B-parameter model is a ~0.8 TB weight file but a ~5.7 TB checkpoint; a 1T-parameter model is a ~14 TB checkpoint. Optimizer state, not weights, dominates the I/O — which is why "just save the weights" is the wrong mental model for resumable training, and why optimizer-state sharding (ZeRO/FSDP) is as much a checkpoint-I/O strategy as a memory strategy.

The decision this forces is upstream of storage: if you change the optimizer or the precision recipe, you change the checkpoint size and therefore the bandwidth budget. An 8-bit optimizer state halves the per-parameter cost to ~7 bytes; a memory-efficient optimizer that drops the second moment changes the arithmetic again. These are training-science choices with a direct, often-unnoticed consequence on the storage and fabric you must provision. The parameter-to-bytes mapping is the bridge between the two.

The failure model: checkpointing is a goodput problem

To size a checkpoint interval you need a failure model, and the one that matters for synchronous training is starkly simple: any single component failure kills the whole job. One GPU falling off the bus, one optical transceiver degrading, one HBM uncorrectable error — and a job spanning tens of thousands of GPUs must roll back to its last checkpoint and replay. This is the defining property of training-shaped facilities (Chapter 1.2) and the reason their redundancy posture is N or N+1 rather than 2N: the job already tolerates an interruption by design, so the rational defense is not preventing the interruption but minimizing what each interruption costs — and that is exactly what a checkpoint interval controls.

The cost of a single failure, in lost GPU-hours, has three parts: the wasted compute since the last checkpoint (on average half the interval), the detection-and-restart latency (the time to notice the fault, evict the bad node, and reload state), and the checkpoint overhead you paid continuously to have a checkpoint at all. Push the interval long and you save overhead but risk losing a lot of work per failure; push it short and you lose little work but pay overhead constantly. The optimum is where these balance — and it moves as the cluster scales, because MTBF shrinks roughly inversely with GPU count. A 512-GPU pod might see ~7 days between failures at a top-tier operator; scale that to 100k GPUs and the effective MTBF collapses toward the sub-hour regime, dragging the optimal interval down with it.

~14 bytes/param

checkpoint state per parameter (2B BF16 weights + 4B FP32 master + 4B+4B Adam moments); ~7x the weight file

2025VAST Data (Optimizing Checkpoint Bandwidth; 85k+ checkpoint survey)

1 every ~3 hr

unplanned interruptions, Llama 3 405B on 16,384 H100s (419 over 54 days; 78% hardware-caused)

2024Meta (Llama 3 paper) / Tom's Hardware

~7 days / 512 GPUs

best-in-class H100 MTBF; effective MTBF falls roughly inversely with cluster size

2025SemiAnalysis (100k H100 clusters)

50–200 GB/s

observed async checkpoint drain rates across 40 production runs; no correlation to model size

2025VAST Data (Optimizing Checkpoint Bandwidth)

10% overlap

checkpoint-vs-compute overlap target found sufficient; sets the drain-bandwidth floor

2025VAST Data (Optimizing Checkpoint Bandwidth)

+6.59% goodput

ML goodput gain from multi-tier checkpointing on a 35K-chip TPU v5p workload (~$1M/wk on 1K-VM jobs)

2025Google Cloud (multi-tier checkpointing)

30 min → <1 min

restore time, object-only vs multi-tier checkpointing across thousands of nodes

2025Google Cloud (multi-tier checkpointing)

~90% / ~96%

industry-average vs best-in-class training goodput; reliability overhead 6–21% of TCO

2025SemiAnalysis ClusterMAX / CoreWeave

The Young/Daly optimal interval, derived

The interval is not a matter of taste. There is a closed-form optimum — the Young/Daly formula, the canonical result this guide refers back to from Chapters 1.2, 10.7, and 14.4 — and it falls straight out of the failure model above. The intuition: the wasted-work cost per unit time grows with the interval (longer interval, more to replay on failure), while the checkpoint-overhead cost per unit time shrinks with the interval (fewer, more widely-spaced saves). Minimize the sum and you get a square-root law:

τ_opt ≈ √( 2 · C · MTBF )

where τ_opt is the optimal interval between checkpoints, C is the cost (in wall-clock time) of writing one checkpoint, and MTBF is the mean time between failures for the whole job. The structure is the lesson: the optimal interval scales with the square root of both the save cost and the MTBF. Halve your checkpoint write time (faster tier, more drain bandwidth) and the optimal interval shrinks only by √2 — but because you also lose less work per failure, total overhead drops faster than the interval. Double your cluster size and MTBF roughly halves, so the optimal interval shrinks by √2; the practical effect is that hyperscale runs check point far more often than intuition suggests.

The Young/Daly optimum is only as good as the MTBF you feed it, and most teams feed it a fantasy. A burning-in cluster fails 3–4x more often in its first weeks than at steady state; a run that hits a lemon node (Chapter 10.7) sees its effective MTBF crater. The right practice is to measure MTBF continuously from the fleet's own fault telemetry and re-derive τ as conditions change — early in a run, interval shorter; once burn-in clears, interval can lengthen. A static interval set at scoping time is wrong on day one and wrong again at steady state.

Deep dive: working the Young/Daly numbers for a 100k-GPU run

Take a frontier run: 1T parameters, so a ~14 TB checkpoint. Suppose the cluster's effective MTBF is 30 minutes (1,800 s) — realistic for ~100k accelerators — and you have provisioned enough drain bandwidth that an asynchronous checkpoint blocks training for only C = 20 s (the GPU-to-host-RAM copy), with the slow drain to durable storage overlapped behind continued compute.

Young/Daly gives τ_opt ≈ √(2 · 20 · 1,800) ≈ √72,000 ≈ 268 s — check point roughly every 4.5 minutes. The blocking overhead is then 20 / 268 ≈ 7.5% of wall-clock if you only had the synchronous stall — but because the 20 s is a fast host-RAM copy and the real durable write overlaps compute, the realized overhead is a small fraction of that. Now break the asynchrony: force a synchronous write of 14 TB to a parallel filesystem at, say, 1 TB/s aggregate, and C jumps to ~14 s of pure stall plus the gather — but if your format gathers to rank 0 first, C can balloon to minutes. With C = 180 s the optimum stretches to ≈√(2·180·1800) ≈ 805 s (~13 min), and now each failure costs ~6.5 minutes of average replay plus restart. The architecture decision (async + sharded vs synchronous + monolithic) moves C by an order of magnitude, and through the square root, moves both the optimal interval and the goodput it buys.

The takeaway: you do not chase a smaller interval directly — you attack C (make checkpoints cheap to write) and the interval optimization follows, while lost-work-per-failure also drops. That is why the entire tiering and format apparatus below exists: it is all in service of driving C down so the Young/Daly optimum can be small and cheap at the same time.

Synchronous vs asynchronous vs in-memory: the tiering spectrum

The single biggest lever on checkpoint cost C is where the checkpoint lands and how much of the write blocks training. There is a clear spectrum, and most large runs use several tiers at once.

Synchronous checkpointing stops training, writes the full state to durable storage, and resumes. It is simple and it is correct, but it is the most expensive: the GPUs sit idle for the entire write, so C is the full write time and it grows with model size. At frontier scale a synchronous-to-object checkpoint can stall the machine for tens of minutes — unacceptable when MTBF is also tens of minutes. Synchronous-only is now a small-cluster pattern.

Asynchronous checkpointing splits the write into a fast blocking phase and a slow overlapped phase. The blocking phase is a GPU→host-RAM (or GPU→local-NVMe) snapshot that takes seconds; training resumes immediately, and a background process drains the snapshot to durable storage while compute continues. C collapses to the snapshot time, and the durable write hides behind the next few training steps. VAST's survey of 40 production runs found a 10% compute-overlap target is sufficient, implying drain rates of 50–200 GB/s — and notably, the required drain rate did not correlate with model size, because larger models also take longer per step, leaving proportionally more time to drain. Asynchronous checkpointing is the default for any serious training run today.

In-memory / peer checkpointing goes further: it keeps the most recent checkpoint in the CPU DRAM of peer nodes rather than (or in addition to) durable storage. Because aggregate host-memory bandwidth across the cluster dwarfs any storage tier, the checkpoint can be taken every single step with near-zero throughput overhead (the GEMINI/in-memory line of work demonstrated optimal per-iteration checkpointing and >13x faster recovery). On a failure, a surviving peer holds a copy of the dead node's shard, so recovery is a memory-to-memory restore in seconds rather than a storage read. The cost is RAM: you reserve host memory across the fleet to hold redundant copies, and you need a placement strategy that guarantees the failure of any node leaves its shard recoverable from a survivor. This is the frontier-operator pattern — checkpoint frequency approaching the failure rate, so that almost no work is ever lost.

Checkpoint tiering: where it lands, what it costs, what it buys

Tier	Write target	Blocking cost C	Recovery latency	Best for
Synchronous to durable	Parallel FS / object store	Full write (minutes at scale)	Full read from durable	Small clusters; final / milestone checkpoints
Async drain to durable	Local snapshot → background drain to FS/object	Seconds (snapshot only)	Read from durable (minutes)	The default working tier for large runs
Local NVMe scratch	Node-local NVMe (per-rank shard)	Seconds (local write)	Local re-read if node survives; else next tier	Fast restart when the fault is recoverable in place
In-memory / peer copy	Peer-node host DRAM (redundant shard)	Near-zero (every-step capable)	Memory-to-memory (seconds)	Hyperscale runs where MTBF approaches step time

Blocking cost C and recovery time are order-of-magnitude for a frontier-scale run; absolute numbers depend on model size, drain bandwidth, and parallelism layout. Most large runs combine the bottom three rows into a single multi-tier policy.

Multi-tier is the answer, not a single tier

The fork is not "which tier" — it is whether you run a tiered policy at all. The winning architecture keeps a frequent in-memory/local copy for the common case (a single node fails, restore from a peer or local NVMe in seconds) and a less-frequent durable copy for the catastrophic case (a rack, a power event, or the whole job dies and must be re-read from the filesystem or object store). Google reported a +6.59% goodput gain on a 35K-chip workload and a restore time falling from ~30 minutes to under a minute by layering in-cluster RAM, local replication, and Cloud Storage rather than checkpointing only to object. The single most expensive checkpoint mistake at scale is using one tier — durable-only is too slow to write frequently, memory-only is too volatile to survive a correlated failure. Tier them.

Sharded vs monolithic: the format fork

Independent of where the checkpoint lands is the question of how it is laid out, and this is a genuine fork with a sharp scaling consequence. A monolithic checkpoint gathers the full model and optimizer state onto one rank (historically rank 0), which serializes a single large file. It is easy to reason about and easy to load anywhere, but the all-gather is a bottleneck that does not scale: as the cluster grows, more ranks funnel state through one writer, and C grows with model size while the write bandwidth stays pinned to a single node's link. Monolithic checkpoints are a small-model convenience that breaks down past a few billion parameters.

A sharded / distributed checkpoint writes each rank's own shard in parallel — every GPU drains its slice of weights and optimizer state to storage simultaneously. Aggregate write bandwidth now scales with the cluster, which is the only way to keep C small at frontier scale. This is the format behind PyTorch Distributed Checkpoint (DCP), DeepSpeed/Megatron sharded checkpoints, and the Orbax format on the JAX/TPU side. The cost is complexity: a sharded checkpoint is a directory of many files plus metadata describing the parallelism layout that produced it, and reading it back is a distributed operation, not a single load.

The consequence that bites teams hardest is resharding. A sharded checkpoint written under one parallelism layout — say TP=8, PP=16, DP=64 — is not trivially loadable under a different layout. If a failure forces you to restart on fewer nodes, or you want to resume a run with a different tensor/pipeline split, a naive sharded format requires a slow offline reshard or cannot load at all. Modern distributed-checkpoint libraries solve this with a layout-independent logical format: the checkpoint records the global tensor shapes and each shard's coordinates, so the loader can re-partition on the fly to whatever parallelism the restart cluster has. Verify your checkpoint format survives a reshard before you need it to — discovering at 3 a.m. that your only checkpoint cannot load on the 12,000 GPUs you have left (instead of the 16,000 you started with) is a goodput catastrophe that elastic, reshard-capable formats exist specifically to prevent.

Deep dive: the metadata and consistency trap in sharded checkpoints

A sharded checkpoint is only valid if every shard corresponds to the same training step. That sounds obvious until you add asynchronous draining: rank 17 finishes its drain in 4 seconds, rank 4,002 takes 9, and a failure lands in between. Now you have a directory where most shards are from step N and a few are mid-write — a torn checkpoint that will load garbage. Distributed checkpoint libraries handle this with a commit protocol: shards write to a staging path, and only once a barrier confirms every rank's shard is durably written does the checkpoint atomically become the new "latest" (typically by writing a final metadata/manifest file or renaming the directory). Recovery always reads the last committed checkpoint, never a partial one.

The operational consequences: (1) you keep at least the last two committed checkpoints, because the very latest may be the one that was mid-commit when the job died; (2) the metadata file is the single point that defines validity, so it must be written last and fsync'd; and (3) the commit barrier adds a small synchronous tail to even an asynchronous checkpoint — one more reason C is never exactly zero. Teams that skip the commit protocol and just "write all the shards" eventually load a torn checkpoint, lose the step, and learn the lesson the expensive way.

Compression, sparsification, and when they pay

Because the checkpoint is dominated by optimizer state, an obvious lever is to make it smaller — compress it or save only part of it. The honest answer for most runs is: provision more drain bandwidth before you reach for compression, because lossless compression of BF16/FP32 tensors yields modest ratios at real CPU cost, and that CPU competes with the data loader (Chapter 9.5) for the same host cores. The square-root structure of Young/Daly also means halving checkpoint size only shrinks C by a factor of two, which moves the optimal interval by √2 — a smaller win than the bandwidth and CPU spend often justify.

There are specific cases where reducing checkpoint volume does pay. Optimizer-state quantization (8-bit Adam moments) roughly halves the 8 bytes of moment state to 4, cutting the per-parameter cost from 14 toward ~10 bytes with well-characterized convergence impact — a training-science decision with a storage payoff. Sparse / incremental checkpointing snapshots only the subset of state that changed (or a rotating subset of operators) each interval rather than the full state every time, which is especially effective for Mixture-of-Experts models where only a fraction of experts are touched per step — most of the optimizer state is unchanged between intervals, so re-writing all of it is pure waste. Lines of work like MoEtion exploit exactly this to get high-frequency, low-overhead checkpoints for MoE training. The fork is clear: for dense models, buy bandwidth; for MoE and 8-bit-optimizer regimes, the structure of the state itself makes reducing checkpoint volume the better lever.

Restart, recovery, and sizing the bandwidth

Writing checkpoints is half the system; recovery is the half that determines how long a failure actually costs. A complete recovery is: detect the fault, isolate and evict the bad node (ideally swapping in a hot spare so the parallelism layout is preserved), reload state to every rank, recompute the gradients the checkpoint did not store, and resume. Detection latency is its own discipline (Chapter 10.7) — silent data corruption and slow-degrading hardware can corrupt steps for a long time before a checkpoint is even triggered. The reload itself is where checkpoint architecture pays off: a peer/in-memory restore is seconds, a local-NVMe re-read is seconds-to-a-minute, and a cold read of a multi-terabyte sharded checkpoint from the durable tier is minutes — the same 30-minutes-to-under-a-minute spread the tiering decision controls.

The bandwidth you must provision falls out of the interval and the size. The async drain bandwidth floor is set by the 10%-overlap target: drain rate ≥ checkpoint_size / (0.10 × interval). For a 5.7 TB (405B) checkpoint on a 40-minute interval, that is ~5.7 GB/s of sustained durable-write bandwidth per concurrent checkpoint — modest. But the read bandwidth for recovery is the harder number, because recovery is an incast: every rank reads its shard at once, and you want the full multi-terabyte checkpoint back into GPU memory in well under the interval so recovery is cheap. That can demand hundreds of GB/s to multiple TB/s of aggregate read bandwidth from the storage tier (Chapter 9.2, Chapter 9.3), concentrated into a burst.

And that burst is the part that breaks naively-designed clusters. A synchronized, all-ranks-at-once checkpoint write or recovery read is a textbook incast that, if it shares the back-end fabric with the training collectives, will collide with the all-reduce traffic and tank the very goodput it exists to protect. The co-design rule is to isolate checkpoint I/O from the compute fabric — a dedicated storage rail, or strict QoS separation on a converged fabric — so a checkpoint storm never starves an all-gather. This is the fabric-and-storage co-design problem treated in Chapter 8.5 and revisited for storage sizing in Chapter 9.8; the point here is that the checkpoint interval you choose creates a periodic incast whose magnitude you computed above, and the fabric must be designed to absorb it without disturbing training.

The checkpoint incast will eat your collectives if you let it

A checkpoint is a synchronized burst: thousands of ranks drain (or read) at the same instant. Put that burst on the same fabric as your all-reduce, on the same interval as your training step, and you have engineered a periodic collision that shows up as mysterious goodput loss correlated with checkpoint cadence. The failure mode is subtle because the checkpoint succeeds — it just steals bandwidth from the collectives every time it runs. Provision the checkpoint path on a dedicated storage rail or behind hard QoS, size both the drain (write) and the recovery (read) incast, and confirm under load that a checkpoint storm does not perturb step time. Checkpointing exists to protect goodput; an un-isolated checkpoint path quietly destroys it.

Putting it together: the checkpoint policy as a goodput contract

A defensible checkpoint policy is not a single setting; it is a coupled set of decisions that together define how much of a failure is recoverable. Size the state (~14 bytes/param × your optimizer recipe). Measure MTBF from live fault telemetry and re-derive the Young/Daly interval as burn-in clears. Make C small with asynchronous drains and a sharded, reshard-capable format, and add an in-memory/peer tier so the common single-node failure recovers in seconds. Keep a less-frequent durable copy for correlated failures, with a commit protocol so you never load a torn checkpoint. Then size both the drain bandwidth and the recovery incast, and isolate them from the compute fabric. Do all of that and you are operating in the 96% goodput regime; skip the tiering or mis-size the interval and you are leaving 6–16 points of goodput — and the proportional fraction of a power-bound, depreciating GPU fleet — on the floor.

This chapter is the canonical home for the checkpoint math referenced across the guide. The training archetype whose interruption-tolerance this enables is built out in Chapter 1.2; the failure detection, lemon-node eviction, and autonomous recovery that surround checkpointing are in Chapter 10.7; the operational reliability discipline that tunes the interval in production is in Chapter 14.4; and the goodput-vs-availability reframing that justifies N/N+1 over 2N for checkpointable jobs is in Chapter 12.2. The storage tiers a checkpoint lands on are engineered in Chapter 9.2 (parallel file systems) and Chapter 9.3 (NVMe and GPUDirect); the checkpoint I/O personality in the broader storage picture is framed in Chapter 9.1; isolating the checkpoint incast from training collectives is the fabric co-design problem of Chapter 8.5 and the sizing problem of Chapter 9.8; and the host-CPU contention between checkpoint drains and the data loader is in Chapter 9.5.