Chapter 9.8

Sizing, Data Gravity & Resilience

Storage is sized to a per-GPU bandwidth budget, not a capacity number; gravity decides where the compute goes; and resilience is bought as goodput — get the ratio, the geography, or the isolation wrong and the cost shows up as idle accelerators on a depreciation clock you cannot stop.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

Whether the hot tier is sized to a per-GPU bandwidth target (GB/s/GPU) or a capacity target (PB) — and the capacity-tier ratio behind it — because picking the wrong primary axis strands either accelerators or flash.
How you isolate the periodic checkpoint incast from the training collectives — dedicated storage rail, converged-with-QoS, or shared-and-pray — since this is a fabric-and-storage co-design problem, not a storage problem alone.
Whether you move data to compute or compute to data, given egress economics and gravity — the decision that quietly sets your multi-site strategy and your cloud lock-in.
How much of the storage budget buys resilience (durability, multi-tenancy QoS, security) versus raw speed — and whether that resilience spend returns more as goodput than as nines.
What you actually benchmark and accept on (MLPerf Storage, mixed-pipeline replay, checkpoint save/restore under load) before you sign — because datasheet peaks and pipeline reality diverge.

By the time you reach this chapter you have the pieces: the parallel filesystem (Chapter 9.2), the NVMe data path (Chapter 9.3), the checkpoint math (Chapter 9.4), the loader (Chapter 9.5), the object tier (Chapter 9.6), and the inference KV hierarchy (Chapter 9.7). This chapter assembles them into a sized, sited, survivable subsystem. The forks here are unusually unforgiving because storage failure is invisible in the worst way. A mis-sized fabric throws an error; a mis-sized storage tier does not fail — it simply starves the GPUs, and the loss appears as a utilization number a few points below where it should be, indistinguishable at a glance from a hundred other causes.

Four decisions structure the chapter. Sizing: the storage:compute ratio and the bandwidth budget that decide whether the accelerators stay fed. Network co-design: how the checkpoint incast is kept off the training collectives, which is the single most consequential placement decision in the storage subsystem. Data gravity: whether you move the corpus to the compute or the compute to the corpus, and the egress economics that govern multi-site strategy. Resilience and TCO: durability, multi-tenancy QoS, security, and the benchmarking-and-acceptance regime that turns a vendor claim into a contractual commitment. Each is a place where a strategist's choice and an engineer's number meet.

Sizing: the per-GPU bandwidth budget governs everything

The first and most consequential fork is the one Chapter 9.1 opened: is each tier bandwidth-sized or capacity-sized? Almost every AI hot tier is bandwidth-sized — its job is to keep accelerators fed, and the governing number is GB/s per GPU, not petabytes. Almost every AI capacity tier is capacity-sized — its job is to hold the corpus and the checkpoint history, and the governing number is cost per TB. Size the hot tier by capacity and you buy a vast array that cannot saturate the GPUs; size the capacity tier by bandwidth and you overpay for flash speed the cold corpus never touches. The error in either direction strands the asset you under-prioritized.

The hot-tier budget starts from a per-GPU read target. NVIDIA's reference floor is ~1 GB/s/GPU; vision and multimodal training references run ~4 GB/s/GPU read; checkpoint-heavy and shuffle-heavy pipelines push 4–10 GB/s/GPU. Multiply by GPU count and you have an aggregate read requirement; apply NVIDIA's write ≥ ½ read rule and you have the write floor that the checkpoint drain must clear. This is why the SuperPOD reference architectures publish bandwidth per scalable unit rather than per array: the B300 'Enhanced' tier lists ~250 GB/s read / ~124 GB/s write per SU, scaling to ~2,000 / ~992 GB/s at eight SUs. You size the tier to the SU count, not to a capacity quote.

Capacity sizing is the other axis, and it is governed by a small set of ratios rather than a single number. The corpus itself (raw plus tokenized plus versioned), the checkpoint retention depth (how many historical checkpoints you keep, times ~14 bytes/param/checkpoint), and a working-set multiplier for shuffled access together set the petabytes. Practitioner ratios for storage-to-compute capacity cluster in the single-digit-to-low-double-digit PB per thousand GPUs for large pre-training, but the honest answer is that the corpus drives it and the corpus is workload-specific — a code/text run and a frontier multimodal run differ by an order of magnitude. The discipline is to size the hot tier to bandwidth and the cold tier to corpus, and to never let a capacity datasheet talk you into buying the hot tier by the petabyte.

Sizing axis by tier and workload — what governs the spend

Tier	Primary axis	Governing number	Typical 2026 target	Failure mode if you size on the wrong axis
Local NVMe (per-node fast tier)	Bandwidth + latency	GB/s/node, checkpoint drain rate	Tens of GB/s/node; absorbs checkpoint burst	Capacity-sized: too few drives, drain bursts saturate, checkpoints stall the step
Hot parallel FS (shared)	Bandwidth	GB/s/GPU read; write ≥ ½ read	1–10 GB/s/GPU read; ~250 GB/s/SU read class	Capacity-sized: huge array, GPUs starve at the per-GPU feed target
Capacity / object (data lake)	Capacity ($/TB)	PB of corpus + checkpoint retention	Single-to-double-digit PB / 1k GPUs (corpus-driven)	Bandwidth-sized: overpay for flash speed the cold corpus never uses
Inference KV tier	Latency (tail) + capacity	p99 read latency; GB of reusable KV	Microsecond-class; Ethernet-flash for cold KV	Throughput-sized: great GB/s, but tail latency breaks the TTFT SLO

Per-GPU bandwidth targets are NVIDIA-class 2026 references; ratios are practitioner ranges and corpus-dependent. The point is which number governs, not a universal constant.

The master sizing fork: feed-rate first, capacity second

Derive the hot tier from the per-GPU feed rate, not the corpus size. Compute the aggregate read requirement (GPUs × per-GPU target), apply the write ≥ ½ read rule, and that is your hot-tier bandwidth floor — full stop. Capacity is a second, separate computation against the corpus and the retention policy, and it lands on a cheaper tier. Operators who collapse these two into one number — buying a single large array to do both jobs — overpay for capacity that is too slow and bandwidth that is too small, and they discover it only when GPU utilization sits stubbornly below target with no fabric error to explain it. Name the two budgets separately before you read a datasheet, because the datasheet will quote you the petabytes and stay quiet about the per-GPU GB/s.

Network co-design: isolating checkpoint incast from the collectives

The storage subsystem does not live on its own wire. It shares — or must be deliberately kept from sharing — the fabric that carries the training collectives, and the most damaging storage mistake in a training cluster is a placement error, not a sizing error. A checkpoint is a synchronized burst: thousands of ranks drain (write) or read at the same instant (Chapter 9.4). That is a textbook incast — many senders, converging traffic, transient congestion at the receiver's switch ports. Put that incast on the same fabric as your all-reduce, on the same cadence as your checkpoint interval, and you have engineered a periodic collision that shows up as goodput loss correlated with checkpoint timing. The checkpoint succeeds every time; it just steals bandwidth from the collectives whenever it runs.

The fork is a placement decision with three options, and it is a co-design problem spanning storage and network rather than either alone (the fabric side is engineered in Chapter 8.5). You can run a dedicated storage rail — a physically separate set of NICs and switches for storage traffic, so a checkpoint storm can never touch the back-end. You can converge storage onto the back-end fabric with strict QoS — cheaper in NICs and switches, but you are now relying on PFC/ECN priority classes and queue separation to keep the incast from starving the all-gather, and you must prove it holds under load. Or you can share with no isolation — which works until the first concurrent checkpoint-and-collective collision, at which point step time degrades mysteriously and intermittently.

Storage-fabric placement — the isolation fork

Placement	Incremental cost	Isolation guarantee	Operational risk	Best fit
Dedicated storage rail	Highest (separate NICs + switches)	Physical — checkpoint incast cannot touch collectives	Lowest; more cabling and ports to manage	Frontier synchronous training; goodput is the headline metric
Converged + strict QoS	Moderate (shared NICs, priority classes)	Logical — depends on PFC/ECN config holding under load	Mis-tuned QoS lets incast starve all-reduce; must be load-proven	Cost-sensitive clusters with disciplined fabric engineering
Converged, no isolation	Lowest	None	Highest; intermittent goodput loss on checkpoint cadence	Inference / loosely-coupled only; never frontier training

The recurring training-cluster decision. RoCE vs TCP and rail-vs-converged detail in Chapters 9.3 and 8.5. 'Cost' is incremental NIC/switch cost over a converged baseline.

The un-isolated checkpoint path destroys the goodput it exists to protect

Checkpointing exists to protect goodput — it bounds how much work a failure can cost (Chapter 9.4). An un-isolated checkpoint path quietly does the opposite. Because the checkpoint succeeds, monitoring shows green; the only symptom is step time that is a few percent worse than it should be, every time a checkpoint runs, with no error anywhere. At frontier scale a few points of goodput is a few points of a power-bound, depreciating fleet — real money, invisible. The acceptance gate is explicit: provision the checkpoint path on a dedicated rail or behind hard QoS, size both the write drain and the recovery read incast, and confirm under concurrent load that a checkpoint storm does not perturb collective step time. If you cannot demonstrate that, you have not finished commissioning the storage subsystem.

Data gravity: move the data, or move the compute?

Data has gravity: the larger and more active a dataset, the harder and more expensive it is to move, and the more it pulls services and compute toward it. For an AI program this stops being a metaphor at the petabyte scale and becomes a hard economic constraint, because the cloud bill for moving data — egress — is asymmetric by design. Ingest is free; egress is metered. The hyperscalers price internet egress at roughly $0.087–$0.12/GB at the first tier in 2026 (Azure ~$0.087, AWS ~$0.09, GCP Premium ~$0.12), with cross-region and cross-AZ transfer adding their own per-GB charges on top. At petabyte scale these are not rounding errors: moving 1 PB out at ~$0.09/GB is on the order of ~$90,000, and a standard AI workload with weekly retraining can see egress reach 70–80% of the total cloud storage bill (EgressCost.com, 2026).

That asymmetry forces the central gravity fork: move the data to the compute, or move the compute to the data? Move-data-to-compute is the default and is fine while the data is small or born where the GPUs are. But once the corpus is large, born in one place (a customer's region, a regulated jurisdiction, an on-prem lake), and accessed repeatedly, gravity flips the economics: it is cheaper and faster to bring the accelerators to the data than to repeatedly pay egress to stream the data to a distant cluster. This is why frontier operators co-locate the prep supercomputer (Chapter 9.9) with the corpus, why regulated workloads pin compute to the jurisdiction where the data must legally remain (Chapter 10.10), and why multi-site strategy is downstream of gravity rather than the other way around.

The egress structure also creates a deliberate lock-in: because leaving costs money the incumbent keeps, gravity is a moat. The 2026 wrinkle is regulatory — under the EU Data Act the major clouds now waive egress for customers who are leaving the platform, which dents the moat for exits but not for the day-to-day cross-region streaming that dominates an active AI pipeline. The strategic move is to design the data's birthplace and residence deliberately: land the corpus where the compute will live, replicate selectively rather than stream repeatedly, and treat every cross-region copy as a recurring egress liability, not a one-time cost.

Data gravity — when to move data vs move compute

Condition	Move data to compute	Move compute to data	Why
Small corpus, born near GPUs	Yes (default)	—	Egress is negligible; gravity is weak
Large corpus, accessed once	Maybe (one-time egress)	Maybe	Single move may beat standing up remote compute
Large corpus, accessed repeatedly	—	Yes	Repeated egress dominates; co-locate compute with the lake
Data residency / sovereignty constraint	—	Yes (mandatory)	Data legally cannot leave the jurisdiction (Ch. 10.10)
On-prem lake, cloud burst desired	Selective replication	Hybrid	Replicate the hot working set; keep cold corpus put

2026 hyperscaler first-tier internet egress ~$0.087–$0.12/GB; cross-region/cross-AZ extra. The fork is governed by corpus size, access frequency, and residency, not by habit.

Resilience, multi-tenancy, QoS & security

Resilience in an AI storage subsystem is not the same problem as resilience in enterprise IT, because the workload values a different thing. Traditional storage optimizes durability and availability — never lose a byte, never be unreachable. An AI training storage tier optimizes goodput: keep the GPUs fed and let a failure cost as few GPU-hours as possible (Chapter 12.2). The two diverge sharply on the hot tier. A scratch/checkpoint hot tier does not need eleven-nines durability — its contents are reproducible from the last durable checkpoint and the corpus — so spending on heavy erasure coding and cross-site replication there buys durability the workload does not value, at the cost of the write bandwidth it does. The capacity tier is the inverse: the corpus and the canonical checkpoint history are the irreplaceable assets, and that is where durability spend belongs (Chapter 9.6).

Multi-tenancy and QoS is the next axis, and it is where the shared storage tier earns or loses its keep. A storage fabric serving many tenants or many jobs is a noisy-neighbor problem in slow motion: one tenant's metadata storm or checkpoint burst can starve another's ingestion reads, and the victim sees it as unexplained GPU idle. The controls are tenant-level bandwidth and IOPS quotas, metadata-rate fairness, and priority classes that protect the latency-sensitive (KV, ingestion) from the bursty (checkpoint). This is the storage-tier mirror of the compute-side sharing spectrum in Chapter 10.3 and the congestion engineering in Chapter 8.6. Security closes the set: encryption at rest and in flight, tenant isolation strong enough that one tenant cannot read another's corpus or checkpoints, and — for confidential workloads — attestation-gated access consistent with the TEE model in Chapter 11.5. Weights and training data are among the most valuable and most regulated assets in the building; the storage tier is where they sit at rest.

1–10 GB/s/GPU

hot-tier read target: ~1 GB/s/GPU floor, ~4 GB/s/GPU vision, 4–10 GB/s/GPU multimodal/checkpoint-heavy

2025NVIDIA DGX SuperPOD storage architecture; Introl

250 / 124 GB/s

per scalable unit read / write, B300 SuperPOD 'Enhanced' tier (8-SU: 2,000 / 992 GB/s); write ≥ ½ read rule

2025NVIDIA DGX B300 SuperPOD Storage Architecture

~14 bytes/param

checkpoint state per parameter; the incast you must isolate from the collectives is sized from this × rank count

2025VAST Data (85k+ checkpoint survey); NVIDIA guidance

$0.087–0.12/GB

first-tier internet egress (Azure ~$0.087, AWS ~$0.09, GCP Premium ~$0.12); cross-region/AZ extra; ~$90k to move 1 PB

2026EgressCost.com / SpendArk egress comparison

70–80%

egress share of total cloud storage bill for a standard weekly-retrain AI workload (74% AWS / 78% Azure / 79% GCP)

2026EgressCost.com

>200 results / 26 orgs

MLPerf Storage v2.0 submissions across 7 countries; v2.0 added real-world checkpoint save/restore tests

2025MLCommons (MLPerf Storage v2.0)

~2x accelerators

accelerators served per storage system in MLPerf Storage v2.0 vs v1.0 — storage scaling tracking compute scaling

2025MLCommons (MLPerf Storage v2.0)

~7 days / 512 GPUs

best-in-class MTBF — sets the failure cadence the checkpoint interval and storage resilience are sized against

2025SemiAnalysis (100k H100 clusters)

TCO: what storage actually costs the program

Storage is a small line on the bill of materials and a large lever on the outcome, which is exactly the asymmetry that gets it under-funded. The capex is real but bounded — an all-flash parallel filesystem sized to the per-GPU feed of a large cluster is a single-digit percentage of the GPU capex it serves. The consequential cost is not the storage; it is the GPUs the storage strands. A hot tier sized one notch too small to keep a fleet fed, or a checkpoint path that perturbs collectives a few percent, costs goodput on the entire accelerator fleet — and on a power-bound, depreciating fleet, every point of goodput is a proportional fraction of the most expensive asset in the building. The TCO calculation that matters is not the price of the array; it is the price of the array plus the GPU-hours a wrong array burns.

The recurring costs split by tier and by venue. On-prem, the hot tier is dominated by flash media and the parallel-FS software/support; the capacity tier by $/TB and the power-and-space of dense flash or remaining HDD. In the cloud, the headline storage rate is rarely the largest number — egress and cross-region transfer are, as the gravity section showed, which is why a cloud storage TCO that ignores data movement is fiction. The honest model prices three things together: the media, the bandwidth (and the fabric it rides), and the movement. Optimize any one in isolation and the other two punish you.

Deep dive: benchmarking and acceptance — MLPerf Storage and the mixed-pipeline replay

The single most common storage-procurement error is benchmarking the wrong primitive (Chapter 9.1). A vendor quoting 10 TB/s of sequential read has told you nothing about whether their metadata service survives a 50-million-file dataset, whether their write path sustains a checkpoint drain, or whether p99 read latency holds under a KV-offload load. Datasheet peaks and mixed-pipeline reality diverge, and that divergence is precisely why MLPerf Storage exists. The v2.0 round (2025) drew >200 results from 26 organizations across seven countries, and — critically for this chapter — added real-world checkpoint save/restore tests, because the benchmark community recognized that the checkpoint incast is a first-class storage workload, not an afterthought. The headline finding was that tested systems served roughly twice the accelerators of the v1.0 round: storage scaling is tracking compute scaling, but only on systems that were actually designed for it.

MLPerf is the neutral reference, but it is not your acceptance test. Acceptance is a replay of your pipeline: your file-size distribution, your shuffle pattern, your per-GPU feed target sustained across the real GPU count, and your checkpoint cadence run concurrently with synthetic collective traffic to prove the isolation holds. The gates are explicit and should be contractual: (1) sustained per-GPU read at the design target across the full fleet, not a sub-cluster; (2) write drain that clears a real checkpoint within the overlap budget; (3) recovery read incast that reloads a multi-terabyte checkpoint in well under the checkpoint interval; (4) p99 latency under mixed load for any KV/inference tier; and (5) no measurable perturbation of collective step time when checkpoints fire. A system that passes MLPerf and fails (5) on your fabric is a system you have not finished commissioning. → commissioning and acceptance discipline in Chapter 13.5.

Deep dive: multi-site strategy as a consequence of gravity

Multi-site is usually framed as a resilience or capacity decision; for AI it is more often a gravity decision, and getting the causality right changes the design. The three drivers that force more than one site are (1) power — a single interconnection cannot energize the fleet, so capacity spills to a second campus (Chapter 3.4); (2) residency — the data legally cannot leave a jurisdiction, so compute must follow it (Chapter 10.10); and (3) proximity — inference must sit near users (Chapter 1.3). In every case the storage question is the same: where does the corpus live, and what gets replicated versus streamed?

The expensive mistake is treating multi-site as symmetric replication of everything. Egress economics forbid it: replicating a petabyte-scale corpus to every site, and re-replicating on every update, is a recurring egress bill that dwarfs the storage. The disciplined pattern is asymmetric — a canonical corpus home (where prep and the bulk of training run), selective replication of only the hot working set to satellite sites, and checkpoints written locally with a less-frequent durable copy to the canonical home for correlated-failure protection (Chapter 9.4). Cross-site failover for training is rarely worth the bandwidth; cross-site failover for inference often is, because inference is loosely coupled and latency-driven (Chapter 12.3). Decide which workload actually needs geographic redundancy before you pay to replicate the data that feeds it.

Where this is heading (2026 forward pointer)

Four trends are reshaping the sizing-and-resilience calculus, and each is treated in the consolidated roadmap (Chapter 16.2) rather than here. GPU- and DPU-initiated I/O moves the data path off the host CPU entirely — the accelerator (or the BlueField-class DPU) issues storage I/O directly, collapsing a CPU bottleneck that has capped the per-GPU feed and shifting where the storage rail terminates (Chapter 9.3). All-flash everywhere — QLC-backed object and capacity tiers displacing HDD — changes the capacity-tier TCO from a $/TB-and-spindle calculation to a $/TB-and-watt one, and narrows the bandwidth gap between hot and cold tiers. File/object convergence erodes the hard line between the parallel-FS hot tier and the object capacity tier, letting a single namespace span both and simplifying placement. And deeper CXL tiering (Chapter 9.7) inserts a memory-class tier between DRAM and NVMe that reshapes the inference KV hierarchy and, increasingly, the training scratch tier. None of these repeals the chapter's logic: you still size to a feed rate, still isolate the incast, still pay for gravity. They change the numbers, not the forks.

This chapter assembles the Part 9 components into a sized, sited, survivable subsystem. The per-GPU bandwidth framing and the bandwidth-vs-capacity fork originate in Chapter 9.1; the parallel filesystem that serves the hot tier is Chapter 9.2; the CPU-bypass data path and storage-rail placement is Chapter 9.3; the checkpoint math that sizes the incast is canonical in Chapter 9.4; the loader path in Chapter 9.5; the object/capacity tier in Chapter 9.6; the inference KV hierarchy in Chapter 9.7; and the prep supercomputer that gravity co-locates with the corpus in Chapter 9.9. The fabric isolation that keeps checkpoint incast off the collectives is engineered in Chapter 8.5 and Chapter 8.6; multi-tenancy and QoS mirror the compute-side spectrum in Chapter 10.3; storage security and confidential access tie to Chapter 11.5; the goodput-vs-availability reframing that makes resilience a GPU-efficiency problem is Chapter 12.2; geographic failover is Chapter 12.3; acceptance discipline is Chapter 13.5; the energy-supply driver of multi-site is Chapter 3.4; the legal-residency driver is Chapter 10.10; and the 2026 storage roadmap is consolidated in Chapter 16.2.