Chapter 13.9

Cluster-Scale Benchmarking, Reference Training & Storage/Scheduler Validation

The cluster is not accepted when every component passes its own test — it is accepted when ten thousand GPUs, a non-blocking fabric, a parallel filesystem, and a gang-scheduler together hold a measured goodput number under a real training load, and that number, not a sum of green checkmarks, is what you sign.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

What busbw floor (per collective, per message size) the back-end fabric must clear before the cluster is declared performant — and what you do with the 5-15% of nodes that pass node-level burn-in but drag the collective.
Whether acceptance is gated on a synthetic benchmark suite (NCCL/OSU/MLPerf) alone, or on a reference/proxy training run that exercises storage, scheduler, and fabric together at scale for 24-72 hours.
The contractual goodput SLA — the single number (effective-training-time fraction) the operator commits to — and the badput taxonomy that decomposes every percentage point you lose below it.
Where storage acceptance sets its bar: sustained checkpoint write bandwidth (the failure-recovery path), data-loader read throughput (the steady-state path), and the metadata/small-file rate that LOSF workloads quietly destroy.
Whether the scheduler is validated for gang/topology-aware placement and preemption-without-corruption before handover — because a scheduler that fragments NVLink domains strands bandwidth no benchmark will catch.

By the time a cluster reaches this chapter it has been heavily tested in pieces. The power chain demonstrated its failure modes under load (Chapter 13.6), the fabric was commissioned link-by-link with point-to-point bandwidth and PTP time-sync gates (Chapter 13.7), and every node was burned in, diagnosed, and screened for silent data corruption over a 72-168 hour soak (Chapter 13.8). All of it passed. And none of it tells you whether the machine you actually bought — a single, tightly-coupled supercomputer — works. That is the gap this chapter closes: cluster-scale validation, where the unit under test is the whole, and the acceptance criterion is no longer "does each part meet spec" but "does the assembled system deliver the goodput the contract assumes."

The discipline here is to validate in three widening rings, and to refuse to advance a ring until the prior one clears. Ring one is synthetic collective benchmarking — NCCL and OSU at scale — which isolates the fabric and the communication libraries from any application noise and gives you a hard, repeatable busbw number to gate on. Ring two is subsystem acceptance — storage (checkpoint write, data-loader read, metadata) and the scheduler/orchestrator (gang placement, topology-awareness, preemption) — each validated against its own bandwidth and correctness bar. Ring three is a reference (proxy) training run that fuses all of it: a real model, real optimizer state, real checkpoints to real storage, scheduled the way production will schedule, run long enough to surface stragglers, thermal drift, and the slow leaks that a five-minute benchmark never sees. The number that comes out of ring three — measured goodput — is the number you put in the SLA. Everything before it is instrumentation.

Ring one: NCCL collective benchmarking and the busbw gate

The fabric was already proven point-to-point in Chapter 13.7 — every link runs at line rate, every cable seated, no errored symbols. That proves the wires. It does not prove the collective, and the collective is what training actually runs: all-reduce, all-gather, reduce-scatter, and broadcast across the full job on every step. A fabric that is flawless link-by-link can still collapse on an all-reduce because of a single mis-seated transceiver three hops away, a routing imbalance that congests one rail, an adaptive-routing misconfiguration, or a SHARP/NVLS offload that failed to engage. The collective benchmark is the only test that exercises the fabric the way the workload will.

The canonical tool is the NCCL tests suite (all_reduce_perf, all_gather_perf, and relatives), run as a multi-node MPI/SLURM job across the full cluster and a sweep of message sizes from a few KB to multiple GB. The metric that matters is bus bandwidth (busbw) — not algorithm bandwidth — because busbw normalizes for the collective's data-movement factor and is directly comparable to the hardware's theoretical peak. You read it two ways: the plateau busbw at large message sizes (does the fabric reach its bandwidth ceiling) and the small-message latency floor (does the collective's fixed cost stay bounded, which governs strong-scaling efficiency). On a healthy GB200 NVL72, in-domain all-reduce busbw runs roughly 870-928 GB/s across large buffers against the 900 GB/s/GPU NVLink5 unidirectional ceiling (NCCL tests on GB200 NVL72, 2025); the scale-out, multi-rack number over InfiniBand or Spectrum-X is the gate that matters for the full job, and it is set as a percentage of the in-domain figure.

Set the busbw gate as a floor, then hunt the slow nodes

The acceptance gate is not "the benchmark ran." It is a numeric floor: every full-scale collective must clear, say, ≥90% of the reference busbw for that topology, message size, and collective, with no individual rail or sub-domain more than a few percent below the mean. The consequence of a soft gate is the most expensive failure mode in the building — a cluster that passes acceptance, enters production, and then runs every training job at 85% of the fabric it paid for because nobody isolated the 3% of nodes that drag the all-reduce. Run the sweep, then run it again with nodes systematically excluded (the classic ring-bisection hunt) to localize stragglers. A node that burned in clean (Chapter 13.8) but pulls the collective down is a real defect — a marginal optic, a PCIe link negotiating down, a NIC on the wrong NUMA node — and it must be fixed before sign-off, not discovered in week three of a customer's run.

Two companion suites round out ring one. OSU Micro-Benchmarks (osu_bw, osu_latency, osu_allreduce) give a vendor-neutral, MPI-level cross-check on NCCL's numbers — useful precisely because they are a different code path; if NCCL looks healthy but OSU does not, the problem is in the collective library configuration, not the fabric. MLPerf Training (and, for storage, MLPerf Storage, below) provides an industry-comparable, full-workload reference if the operator wants a number that benchmarks against published results rather than only against the cluster's own reference. The fork is whether MLPerf is worth the engineering cost: it is heavyweight to run correctly, but for a neocloud selling capacity, a clean MLPerf submission is a marketing-grade external attestation that an internal NCCL number is not. → external-rating context in Chapter 12.2.

Cluster-scale validation rings — what each tests, the tool, and the gate

Ring / target	Primary tool	Metric gated on	Acceptance gate	What it catches that lower rings miss
Ring 1 — fabric collective	NCCL tests; OSU; (opt.) MLPerf Training	Bus bandwidth (busbw) + small-msg latency	≥90% of reference busbw at scale; no rail >few % low	Routing imbalance, congestion, failed SHARP/NVLS, straggler nodes
Ring 2 — storage (write path)	MLPerf Storage (checkpoint); fio; mdtest	Sustained checkpoint write GB/s; drain time	Write ≥ ½ read; checkpoint stall <10% of step time	Aggregate write collapse, metadata bottleneck, LOSF
Ring 2 — storage (read path)	MLPerf Storage (training); fio; loader replay	Sustained data-loader read GB/s; per-GPU target	Meets per-GPU read target at full node count	Data-loader starvation, cache thrash, small-file IOPS wall
Ring 2 — scheduler / orchestration	SLURM block / topology.yaml; KAI/Run:ai; gang tests	Gang placement correctness; topology fidelity	Jobs land NVLink-domain-aligned; preempt w/o corruption	Fragmented scale-up domains, broken gang semantics, quota leaks
Ring 3 — reference training run	Real/proxy model (e.g. GPT/Llama-class) at scale	Measured goodput (effective-training-time fraction)	Sustained goodput ≥ contractual SLA over 24-72 hr	Thermal drift, slow leaks, real badput, end-to-end interactions

The discipline is to advance rings only when the prior gate clears. Busbw and bandwidth figures are 2026-current reference points for a GB200 NVL72-class cluster; see keynumbers for sources.

Ring two: storage acceptance — three different bars, not one

Storage is where cluster acceptance most often goes wrong, because operators test it as one thing when it is three, with three different failure modes and three different bandwidth bars. → architecture in Chapter 9.5 and Chapter 9.6; checkpoint math in Chapter 9.4.

The write path is checkpointing, and it gates failure recovery. When a synchronous training job checkpoints, every rank drains optimizer and model state to the parallel filesystem in a burst; the whole job stalls until the slowest writer finishes (unless async/multi-tier checkpointing overlaps it). The acceptance question is sustained aggregate write under a realistic concurrent burst, not peak write bandwidth, because that is what sets how long the job is frozen every checkpoint interval. The reference rule from production checkpoint surveys is ~14 bytes per parameter of checkpoint state and a target of keeping checkpoint stall under ~10% of step time; a write subsystem that benchmarks beautifully on a single stream but collapses when 128 nodes write simultaneously fails the only test that matters. MLPerf Storage v2.0 (Aug 2025) added a dedicated checkpoint benchmark precisely because, in a 100,000-accelerator cluster at full utilization, failures can land roughly every 30 minutes (MLCommons), and checkpoint write/restore bandwidth is what converts those failures from minutes of lost work into hours.

The read path is the data loader, and it gates steady-state goodput. If the loader cannot feed the GPUs, they idle — invisible to a fabric benchmark, fatal to MFU. The bar is a sustained per-GPU read throughput target met at full node count (not a four-node extrapolation), validated by replaying the actual loader or by MLPerf Storage's training workloads. The DGX SuperPOD reference points put aggregate storage in the ~250-400 GB/s range per ~1,024 GPUs with the write ≥ ½ read design rule; the acceptance run must hit the per-GPU read target, not just the aggregate, because aggregate numbers hide hot-spotting.

The metadata path is the silent killer. Lots-of-small-files (LOSF) workloads — image datasets, sharded tokenized corpora — stress metadata operations and small-file IOPS, not bandwidth, and a parallel filesystem that delivers terabytes-per-second of sequential read can still fall over on a million-file-per-second metadata storm. mdtest and a representative LOSF replay belong in acceptance, because LOSF degrades scaling in a way no large-block fio run will reveal.

Deep dive: why you put the storage fabric where the rebuild traffic cannot collide with the collective

A subtle acceptance failure that only a full-cluster run exposes: storage traffic and collective traffic sharing a fabric. The reference architectures are explicit that bulk storage and checkpoint I/O should ride the front-end Ethernet, not the back-end InfiniBand/Spectrum-X fabric that carries all-reduce — and the reason is a correlated-failure interaction. When a node fails mid-run, two things happen at once: the parallel filesystem may begin rebuilding/rebalancing (a large, sustained read/write storm), and the surviving training job restarts from checkpoint (a large read storm to reload state). If both of those land on the same fabric the collective uses, the restart all-reduce contends with rebuild traffic exactly when the cluster is trying to recover, and goodput craters at the worst possible moment.

The acceptance implication is that you cannot validate storage and fabric in isolation and call it done. The ring-three reference run must include at least one injected node failure under checkpoint load, so you observe what the restart actually costs when storage and fabric interact under stress. A cluster that passes every static benchmark but has co-mingled storage and collective traffic will show a clean acceptance report and a punishing real-world recovery profile. This is the canonical argument for physically separating the storage and compute fabrics, and the canonical reason ring three exists. → fabric topology in Chapter 8.5; checkpoint/restart economics in Chapter 9.4 and operational tuning in Chapter 14.4.

Ring two: scheduler and orchestration acceptance

The scheduler is the subsystem most likely to be hand-waved at acceptance — "SLURM works, ship it" — and the most likely to silently strand the fabric you just spent a chapter validating. On rack-scale, NVLink-domain hardware, a scheduler that places a job's ranks without respecting the topology will scatter a tightly-coupled job across NVLink domains and force collectives that should have stayed on the 130 TB/s in-rack fabric out onto the comparatively narrow scale-out fabric. The benchmark busbw is fine; the delivered bandwidth to a real job is not, because the scheduler fragmented the domain. Topology-aware, gang-scheduled placement is therefore an acceptance criterion, not an optimization.

The 2026 landscape forks along a well-worn line. SLURM dominates dedicated training clusters (roughly ~70% share by practitioner estimates) and brings mature gang scheduling, fair-share/QoS priority, preemption, and — critically for Blackwell — block scheduling that allocates whole NVLink domains via a topology.yaml so coherent-memory jobs land aligned. Kubernetes (with KAI Scheduler, Run:ai, Volcano, or the SLURM-on-K8s bridge) brings multi-tenancy, fractional GPUs, and a cloud-native control plane, at the cost of needing an explicit gang-scheduler bolted on because vanilla K8s will happily start half a gang and deadlock. The acceptance fork is which control plane the operator validates against — and for a multi-tenant neocloud the answer is increasingly both, converged. → the build-out treatment lives in the orchestration chapters; here the bar is correctness, not architecture.

Scheduler acceptance — SLURM vs Kubernetes-native, what each must demonstrate

Scheduler model	Gang / topology	Multi-tenancy	Acceptance must prove	Failure mode if skipped
SLURM (block scheduling)	Native gang; topology.yaml NVLink-domain blocks	Fair-share, QoS, accounts; coarser tenancy	Jobs land domain-aligned; preempt/requeue w/o state loss	Fragmented NVLink domains; stranded scale-up bandwidth
Kubernetes + gang scheduler (KAI/Volcano/Run:ai)	Gang via add-on; topology-aware via DRA/labels	Strong: namespaces, quotas, fractional GPU	All-or-nothing gang admission; no partial-gang deadlock	Half-started gangs deadlock; quota leaks across tenants
Converged (SLURM-on-K8s bridge)	SLURM semantics over K8s pods	K8s tenancy + SLURM job model	Both paths schedule the same hardware without conflict	Two control planes fight over the same GPUs

Acceptance is about correctness under load, not feature lists. Share figures are 2026 practitioner estimates (HPCwire).

Preemption that corrupts is worse than no preemption

The single most dangerous scheduler defect to miss at acceptance is preemption or requeue that does not cleanly quiesce a job before reclaiming its GPUs. On a multi-tenant cluster, a high-priority job preempting a lower one must signal the victim (SIGTERM), let it checkpoint, and only then reclaim — otherwise it tears down a job mid-collective, can leave the fabric or NVLink domain in an inconsistent state, and in the worst case corrupts a checkpoint that another job is mid-write to. Validate the full preempt-checkpoint-requeue-resume cycle under concurrent load, including the case where the preempted job is checkpointing at the moment it is killed. A scheduler that has never been tested through that exact sequence will pass acceptance and then, weeks later, eat a customer's training run.

Ring three: the reference (proxy) training run

This is the acceptance test that the whole chapter builds toward, because it is the only one that runs the machine as a machine. A reference training run — a real model of representative scale (a GPT- or Llama-class transformer, often a deliberately-sized proxy rather than a frontier model) — is launched across the full cluster, through the production scheduler, checkpointing real optimizer state to the production storage, for a sustained window of typically 24-72 hours. It is not run to convergence; it is run to measure goodput and surface the slow failures that short benchmarks structurally cannot see: thermal drift as the room heats, a single GPU that clocks down after six hours, a memory leak in the loader, a checkpoint that gets slower as the filesystem fills, a straggler that only manifests under sustained collective pressure.

The output is a single, defensible number: measured goodput, the fraction of wall-clock time the cluster spent making real forward training progress. That number — not the busbw, not the storage GB/s, not the green scheduler dashboard — is what the operator signs into the SLA, because it is the only metric that already prices in every interaction the lower rings tested in isolation. Industry-average goodput sits around ~90%, with best-in-class operators marketing ~96% (SemiAnalysis ClusterMAX / CoreWeave). The acceptance gate is whether the sustained goodput over the reference window clears the contractual floor, and whether the residual badput decomposes into causes you understand and can attribute.

Goodput / badput accounting and the contractual SLA

Goodput is only useful as an acceptance metric if you can decompose what you lost. The formal model (Google Cloud's ML Productivity Goodput) factors it into three multiplicative layers, and the acceptance report should attribute every point below 100% to one of them:

Scheduling goodput — the fraction of time all required resources were actually available. On a freshly-accepted dedicated cluster this should be ~100%; it drops in preemptible/on-demand consumption, and a low number here at acceptance points at the scheduler (ring two), not the hardware.
Runtime goodput — forward progress as a fraction of time resources were available, i.e. the badput from failures, restarts, and checkpoint reload. This is the layer the reliability of the cluster shows up in: every node failure costs the work since the last checkpoint (t_ch) plus the time to resume (t_rm). With a top-tier H100 operator seeing MTBF on the order of ~7 days per 512 GPUs (SemiAnalysis), a 10,000-GPU job fails often enough that checkpoint cadence and restart speed dominate this term.
Program goodput (MFU) — the fraction of peak FLOPs the job actually extracts, governed by compute/communication overlap, kernels, and the fabric. Hopper-class MFU of ~35-50% is typical; this is where ring-one fabric health and ring-two data-loader throughput cash out.

Multiply the three and you have goodput. The reason the decomposition matters contractually is that it tells you who owns each shortfall: scheduling badput is the orchestration layer, runtime badput is reliability and checkpoint design, program badput is fabric and kernels. An SLA that commits to a goodput number without the decomposition is unenforceable — when goodput misses, nobody can say why. The 2026 best practice is to write the SLA as a goodput floor plus a badput attribution model, so a miss is diagnosable and the remedy is assignable. → reliability-economics framing in Chapter 12.2 and the operational goodput stack in Chapter 14.1.

870-928 GB/s

in-domain all-reduce busbw on GB200 NVL72 (vs 900 GB/s/GPU NVLink5 ceiling); scale-out gate set as % of this

2025NCCL tests on GB200 NVL72 (Crusoe / Nebius / NVIDIA tuning guide)

~14 bytes/param

checkpoint state to size the write path; keep checkpoint stall <10% of step time

2025VAST Data checkpoint survey (85k+ checkpoints)

~every 30 min

failure cadence in a 100k-accelerator cluster at full utilization — why checkpoint bandwidth is an acceptance gate

2025MLCommons MLPerf Storage v2.0

250-400 GB/s

aggregate storage bandwidth per ~1,024 GPUs (write ≥ ½ read design rule)

2025NVIDIA DGX SuperPOD reference architecture

~90% / ~96%

industry-average vs best-in-class measured training goodput — the number the SLA is set against

2025SemiAnalysis ClusterMAX 2.0 / CoreWeave

~7 days

MTBF per 512 GPUs at a top-tier H100 operator — drives the runtime-goodput (restart) term

2025SemiAnalysis (100k H100 clusters)

~70% / ~20%

SLURM vs Kubernetes scheduler share on AI training clusters in 2026 (rest: converged/other)

2026HPCwire (Slurm vs Kubernetes)

~35-50%

typical Hopper-class MFU (program goodput); best-in-class >50%

2025SemiAnalysis; domain reliability surveys

The acceptance package: what ring three produces

The deliverable from cluster-scale validation is a signed acceptance package that becomes the contractual baseline and the seed of the day-2 reliability program, not a pass/fail. It contains: the full-scale NCCL/OSU busbw sweep with the straggler-hunt log and the final node-exclusion list; the storage acceptance results across write, read, and metadata paths with the per-GPU targets met; the scheduler validation including the preempt-checkpoint-resume trace; and the reference training run's sustained goodput curve with its badput decomposition. Crucially it also captures the fabric and node baseline — the busbw and per-node performance fingerprints — so that day-2 operations can detect drift against a known-good reference rather than guessing. → the baseline handoff into operations and the operational-readiness gate live in Chapter 13.10; the failure-rate data this seeds feeds Chapter 14.3.

Acceptance is the first measurement of a curve, not a one-time gate

The most useful reframe for cluster-scale validation: the goodput, busbw, and per-node numbers you capture at acceptance are not a gate you pass once — they are the t=0 point of a curve that operations will track for the cluster's whole life. A cluster degrades: optics fail, nodes go lemon, the fabric accumulates errors, storage fills and slows. The only way to know whether month-six goodput of 88% is normal aging or a real regression is to have a clean, instrumented, full-scale baseline from acceptance to compare against. Operators who treat acceptance as a checkbox throw that baseline away and spend the next two years arguing about whether the cluster was ever as good as the brochure. Operators who treat it as the first data point on a reliability curve can attribute every later drop. The power-bound reality makes this non-negotiable: you cannot easily get more megawatts, so the goodput you preserve on the megawatts you have is what determines the cluster's return. → day-2 economics in Chapter 14.1.

Anti-patterns

The recurring cluster-acceptance failures all share a root cause: validating components instead of the system.

Summing green checkmarks. Every node passed burn-in, every link passed point-to-point, the storage passed single-stream fio, the scheduler started a hello-world job — and the cluster is declared accepted without a single full-scale collective or a reference training run. The interactions (straggler-on-collective, storage-rebuild-vs-restart, scheduler-fragmenting-domains) are exactly the failures that only appear at the system level, and they are the ones that bite in production.
Peak bandwidth instead of sustained-under-burst. Accepting storage on a hero single-stream number, then watching the checkpoint freeze the whole job for minutes when 128 nodes write concurrently. The write path must be validated under realistic concurrent burst, not best-case streaming.
A goodput SLA with no badput decomposition. Committing to a goodput floor without the three-layer attribution model, so when the number misses there is no way to assign the shortfall to scheduling, runtime, or program causes — and the SLA becomes unenforceable.
Throwing away the baseline. Treating acceptance as a one-time gate and capturing no instrumented full-scale reference, leaving day-2 operations with no known-good to detect drift against. → Chapter 14.3.

This chapter is the convergence point of Part 13's earlier validation: power-chain failure demonstration in Chapter 13.6, link-level fabric commissioning and PTP time-sync in Chapter 13.7, and node burn-in / SDC hunting in Chapter 13.8 all feed the cluster-scale rings here; the accepted baseline then hands off through the staged ramp and operational-readiness gate in Chapter 13.10. The fabric topology and oversubscription that ring-one busbw validates is engineered in Chapter 8.5; the storage architecture behind ring two lives in Chapter 9.5 and Chapter 9.6, with checkpoint math in Chapter 9.4; the goodput-vs-availability reliability rethink that justifies a goodput SLA is in Chapter 12.2; quantitative availability modeling in Chapter 12.5; and the operational goodput stack, failure-rate data, and training-reliability tuning that this acceptance baseline seeds are in Chapter 14.1, Chapter 14.3, and Chapter 14.4.