Chapter 13.9
Cluster-Scale Benchmarking, Reference Training & Storage/Scheduler Validation
The cluster is not accepted when every component passes its own test — it is accepted when ten thousand GPUs, a non-blocking fabric, a parallel filesystem, and a gang-scheduler together hold a measured goodput number under a real training load, and that number, not a sum of green checkmarks, is what you sign.
What you'll decide here
- What busbw floor (per collective, per message size) the back-end fabric must clear before the cluster is declared performant — and what you do with the 5-15% of nodes that pass node-level burn-in but drag the collective.
- Whether acceptance is gated on a synthetic benchmark suite (NCCL/OSU/MLPerf) alone, or on a reference/proxy training run that exercises storage, scheduler, and fabric together at scale for 24-72 hours.
- The contractual goodput SLA — the single number (effective-training-time fraction) the operator commits to — and the badput taxonomy that decomposes every percentage point you lose below it.
- Where storage acceptance sets its bar: sustained checkpoint write bandwidth (the failure-recovery path), data-loader read throughput (the steady-state path), and the metadata/small-file rate that LOSF workloads quietly destroy.
- Whether the scheduler is validated for gang/topology-aware placement and preemption-without-corruption before handover — because a scheduler that fragments NVLink domains strands bandwidth no benchmark will catch.
By the time a cluster reaches this chapter it has been heavily tested in pieces. The power chain demonstrated its failure modes under load (Chapter 13.6), the fabric was commissioned link-by-link with point-to-point bandwidth and PTP time-sync gates (Chapter 13.7), and every node was burned in, diagnosed, and screened for silent data corruption over a 72-168 hour soak (Chapter 13.8). All of it passed. And none of it tells you whether the machine you actually bought — a single, tightly-coupled supercomputer — works. That is the gap this chapter closes: cluster-scale validation, where the unit under test is the whole, and the acceptance criterion is no longer "does each part meet spec" but "does the assembled system deliver the goodput the contract assumes."
The discipline here is to validate in three widening rings, and to refuse to advance a ring until the prior one clears. Ring one is synthetic collective benchmarking — NCCL and OSU at scale — which isolates the fabric and the communication libraries from any application noise and gives you a hard, repeatable busbw number to gate on. Ring two is subsystem acceptance — storage (checkpoint write, data-loader read, metadata) and the scheduler/orchestrator (gang placement, topology-awareness, preemption) — each validated against its own bandwidth and correctness bar. Ring three is a reference (proxy) training run that fuses all of it: a real model, real optimizer state, real checkpoints to real storage, scheduled the way production will schedule, run long enough to surface stragglers, thermal drift, and the slow leaks that a five-minute benchmark never sees. The number that comes out of ring three — measured goodput — is the number you put in the SLA. Everything before it is instrumentation.
Ring one: NCCL collective benchmarking and the busbw gate
The fabric was already proven point-to-point in Chapter 13.7 — every link runs at line rate, every cable seated, no errored symbols. That proves the wires. It does not prove the collective, and the collective is what training actually runs: all-reduce, all-gather, reduce-scatter, and broadcast across the full job on every step. A fabric that is flawless link-by-link can still collapse on an all-reduce because of a single mis-seated transceiver three hops away, a routing imbalance that congests one rail, an adaptive-routing misconfiguration, or a SHARP/NVLS offload that failed to engage. The collective benchmark is the only test that exercises the fabric the way the workload will.
The canonical tool is the NCCL tests suite (all_reduce_perf, all_gather_perf, and relatives), run as a multi-node MPI/SLURM job across the full cluster and a sweep of message sizes from a few KB to multiple GB. The metric that matters is bus bandwidth (busbw) — not algorithm bandwidth — because busbw normalizes for the collective's data-movement factor and is directly comparable to the hardware's theoretical peak. You read it two ways: the plateau busbw at large message sizes (does the fabric reach its bandwidth ceiling) and the small-message latency floor (does the collective's fixed cost stay bounded, which governs strong-scaling efficiency). On a healthy GB200 NVL72, in-domain all-reduce busbw runs roughly 870-928 GB/s across large buffers against the 900 GB/s/GPU NVLink5 unidirectional ceiling (NCCL tests on GB200 NVL72, 2025); the scale-out, multi-rack number over InfiniBand or Spectrum-X is the gate that matters for the full job, and it is set as a percentage of the in-domain figure.
Two companion suites round out ring one. OSU Micro-Benchmarks (osu_bw, osu_latency, osu_allreduce) give a vendor-neutral, MPI-level cross-check on NCCL's numbers — useful precisely because they are a different code path; if NCCL looks healthy but OSU does not, the problem is in the collective library configuration, not the fabric. MLPerf Training (and, for storage, MLPerf Storage, below) provides an industry-comparable, full-workload reference if the operator wants a number that benchmarks against published results rather than only against the cluster's own reference. The fork is whether MLPerf is worth the engineering cost: it is heavyweight to run correctly, but for a neocloud selling capacity, a clean MLPerf submission is a marketing-grade external attestation that an internal NCCL number is not. → external-rating context in Chapter 12.2.
| Ring / target | Primary tool | Metric gated on | Acceptance gate | What it catches that lower rings miss |
|---|---|---|---|---|
| Ring 1 — fabric collective | NCCL tests; OSU; (opt.) MLPerf Training | Bus bandwidth (busbw) + small-msg latency | ≥90% of reference busbw at scale; no rail >few % low | Routing imbalance, congestion, failed SHARP/NVLS, straggler nodes |
| Ring 2 — storage (write path) | MLPerf Storage (checkpoint); fio; mdtest | Sustained checkpoint write GB/s; drain time | Write ≥ ½ read; checkpoint stall <10% of step time | Aggregate write collapse, metadata bottleneck, LOSF |
| Ring 2 — storage (read path) | MLPerf Storage (training); fio; loader replay | Sustained data-loader read GB/s; per-GPU target | Meets per-GPU read target at full node count | Data-loader starvation, cache thrash, small-file IOPS wall |
| Ring 2 — scheduler / orchestration | SLURM block / topology.yaml; KAI/Run:ai; gang tests | Gang placement correctness; topology fidelity | Jobs land NVLink-domain-aligned; preempt w/o corruption | Fragmented scale-up domains, broken gang semantics, quota leaks |
| Ring 3 — reference training run | Real/proxy model (e.g. GPT/Llama-class) at scale | Measured goodput (effective-training-time fraction) | Sustained goodput ≥ contractual SLA over 24-72 hr | Thermal drift, slow leaks, real badput, end-to-end interactions |
Ring two: storage acceptance — three different bars, not one
Storage is where cluster acceptance most often goes wrong, because operators test it as one thing when it is three, with three different failure modes and three different bandwidth bars. → architecture in Chapter 9.5 and Chapter 9.6; checkpoint math in Chapter 9.4.
The write path is checkpointing, and it gates failure recovery. When a synchronous training job checkpoints, every rank drains optimizer and model state to the parallel filesystem in a burst; the whole job stalls until the slowest writer finishes (unless async/multi-tier checkpointing overlaps it). The acceptance question is sustained aggregate write under a realistic concurrent burst, not peak write bandwidth, because that is what sets how long the job is frozen every checkpoint interval. The reference rule from production checkpoint surveys is ~14 bytes per parameter of checkpoint state and a target of keeping checkpoint stall under ~10% of step time; a write subsystem that benchmarks beautifully on a single stream but collapses when 128 nodes write simultaneously fails the only test that matters. MLPerf Storage v2.0 (Aug 2025) added a dedicated checkpoint benchmark precisely because, in a 100,000-accelerator cluster at full utilization, failures can land roughly every 30 minutes (MLCommons), and checkpoint write/restore bandwidth is what converts those failures from minutes of lost work into hours.
The read path is the data loader, and it gates steady-state goodput. If the loader cannot feed the GPUs, they idle — invisible to a fabric benchmark, fatal to MFU. The bar is a sustained per-GPU read throughput target met at full node count (not a four-node extrapolation), validated by replaying the actual loader or by MLPerf Storage's training workloads. The DGX SuperPOD reference points put aggregate storage in the ~250-400 GB/s range per ~1,024 GPUs with the write ≥ ½ read design rule; the acceptance run must hit the per-GPU read target, not just the aggregate, because aggregate numbers hide hot-spotting.
The metadata path is the silent killer. Lots-of-small-files (LOSF) workloads — image datasets, sharded tokenized corpora — stress metadata operations and small-file IOPS, not bandwidth, and a parallel filesystem that delivers terabytes-per-second of sequential read can still fall over on a million-file-per-second metadata storm. mdtest and a representative LOSF replay belong in acceptance, because LOSF degrades scaling in a way no large-block fio run will reveal.
Deep dive: why you put the storage fabric where the rebuild traffic cannot collide with the collective
A subtle acceptance failure that only a full-cluster run exposes: storage traffic and collective traffic sharing a fabric. The reference architectures are explicit that bulk storage and checkpoint I/O should ride the front-end Ethernet, not the back-end InfiniBand/Spectrum-X fabric that carries all-reduce — and the reason is a correlated-failure interaction. When a node fails mid-run, two things happen at once: the parallel filesystem may begin rebuilding/rebalancing (a large, sustained read/write storm), and the surviving training job restarts from checkpoint (a large read storm to reload state). If both of those land on the same fabric the collective uses, the restart all-reduce contends with rebuild traffic exactly when the cluster is trying to recover, and goodput craters at the worst possible moment.
The acceptance implication is that you cannot validate storage and fabric in isolation and call it done. The ring-three reference run must include at least one injected node failure under checkpoint load, so you observe what the restart actually costs when storage and fabric interact under stress. A cluster that passes every static benchmark but has co-mingled storage and collective traffic will show a clean acceptance report and a punishing real-world recovery profile. This is the canonical argument for physically separating the storage and compute fabrics, and the canonical reason ring three exists. → fabric topology in Chapter 8.5; checkpoint/restart economics in Chapter 9.4 and operational tuning in Chapter 14.4.
Ring two: scheduler and orchestration acceptance
The scheduler is the subsystem most likely to be hand-waved at acceptance — "SLURM works, ship it" — and the most likely to silently strand the fabric you just spent a chapter validating. On rack-scale, NVLink-domain hardware, a scheduler that places a job's ranks without respecting the topology will scatter a tightly-coupled job across NVLink domains and force collectives that should have stayed on the 130 TB/s in-rack fabric out onto the comparatively narrow scale-out fabric. The benchmark busbw is fine; the delivered bandwidth to a real job is not, because the scheduler fragmented the domain. Topology-aware, gang-scheduled placement is therefore an acceptance criterion, not an optimization.
The 2026 landscape forks along a well-worn line. SLURM dominates dedicated training clusters (roughly ~70% share by practitioner estimates) and brings mature gang scheduling, fair-share/QoS priority, preemption, and — critically for Blackwell — block scheduling that allocates whole NVLink domains via a topology.yaml so coherent-memory jobs land aligned. Kubernetes (with KAI Scheduler, Run:ai, Volcano, or the SLURM-on-K8s bridge) brings multi-tenancy, fractional GPUs, and a cloud-native control plane, at the cost of needing an explicit gang-scheduler bolted on because vanilla K8s will happily start half a gang and deadlock. The acceptance fork is which control plane the operator validates against — and for a multi-tenant neocloud the answer is increasingly both, converged. → the build-out treatment lives in the orchestration chapters; here the bar is correctness, not architecture.
| Scheduler model | Gang / topology | Multi-tenancy | Acceptance must prove | Failure mode if skipped |
|---|---|---|---|---|
| SLURM (block scheduling) | Native gang; topology.yaml NVLink-domain blocks | Fair-share, QoS, accounts; coarser tenancy | Jobs land domain-aligned; preempt/requeue w/o state loss | Fragmented NVLink domains; stranded scale-up bandwidth |
| Kubernetes + gang scheduler (KAI/Volcano/Run:ai) | Gang via add-on; topology-aware via DRA/labels | Strong: namespaces, quotas, fractional GPU | All-or-nothing gang admission; no partial-gang deadlock | Half-started gangs deadlock; quota leaks across tenants |
| Converged (SLURM-on-K8s bridge) | SLURM semantics over K8s pods | K8s tenancy + SLURM job model | Both paths schedule the same hardware without conflict | Two control planes fight over the same GPUs |
Ring three: the reference (proxy) training run
This is the acceptance test that the whole chapter builds toward, because it is the only one that runs the machine as a machine. A reference training run — a real model of representative scale (a GPT- or Llama-class transformer, often a deliberately-sized proxy rather than a frontier model) — is launched across the full cluster, through the production scheduler, checkpointing real optimizer state to the production storage, for a sustained window of typically 24-72 hours. It is not run to convergence; it is run to measure goodput and surface the slow failures that short benchmarks structurally cannot see: thermal drift as the room heats, a single GPU that clocks down after six hours, a memory leak in the loader, a checkpoint that gets slower as the filesystem fills, a straggler that only manifests under sustained collective pressure.
The output is a single, defensible number: measured goodput, the fraction of wall-clock time the cluster spent making real forward training progress. That number — not the busbw, not the storage GB/s, not the green scheduler dashboard — is what the operator signs into the SLA, because it is the only metric that already prices in every interaction the lower rings tested in isolation. Industry-average goodput sits around ~90%, with best-in-class operators marketing ~96% (SemiAnalysis ClusterMAX / CoreWeave). The acceptance gate is whether the sustained goodput over the reference window clears the contractual floor, and whether the residual badput decomposes into causes you understand and can attribute.
Goodput / badput accounting and the contractual SLA
Goodput is only useful as an acceptance metric if you can decompose what you lost. The formal model (Google Cloud's ML Productivity Goodput) factors it into three multiplicative layers, and the acceptance report should attribute every point below 100% to one of them:
- Scheduling goodput — the fraction of time all required resources were actually available. On a freshly-accepted dedicated cluster this should be ~100%; it drops in preemptible/on-demand consumption, and a low number here at acceptance points at the scheduler (ring two), not the hardware.
- Runtime goodput — forward progress as a fraction of time resources were available, i.e. the badput from failures, restarts, and checkpoint reload. This is the layer the reliability of the cluster shows up in: every node failure costs the work since the last checkpoint (
t_ch) plus the time to resume (t_rm). With a top-tier H100 operator seeing MTBF on the order of ~7 days per 512 GPUs (SemiAnalysis), a 10,000-GPU job fails often enough that checkpoint cadence and restart speed dominate this term. - Program goodput (MFU) — the fraction of peak FLOPs the job actually extracts, governed by compute/communication overlap, kernels, and the fabric. Hopper-class MFU of ~35-50% is typical; this is where ring-one fabric health and ring-two data-loader throughput cash out.
Multiply the three and you have goodput. The reason the decomposition matters contractually is that it tells you who owns each shortfall: scheduling badput is the orchestration layer, runtime badput is reliability and checkpoint design, program badput is fabric and kernels. An SLA that commits to a goodput number without the decomposition is unenforceable — when goodput misses, nobody can say why. The 2026 best practice is to write the SLA as a goodput floor plus a badput attribution model, so a miss is diagnosable and the remedy is assignable. → reliability-economics framing in Chapter 12.2 and the operational goodput stack in Chapter 14.1.
The acceptance package: what ring three produces
The deliverable from cluster-scale validation is a signed acceptance package that becomes the contractual baseline and the seed of the day-2 reliability program, not a pass/fail. It contains: the full-scale NCCL/OSU busbw sweep with the straggler-hunt log and the final node-exclusion list; the storage acceptance results across write, read, and metadata paths with the per-GPU targets met; the scheduler validation including the preempt-checkpoint-resume trace; and the reference training run's sustained goodput curve with its badput decomposition. Crucially it also captures the fabric and node baseline — the busbw and per-node performance fingerprints — so that day-2 operations can detect drift against a known-good reference rather than guessing. → the baseline handoff into operations and the operational-readiness gate live in Chapter 13.10; the failure-rate data this seeds feeds Chapter 14.3.
Anti-patterns
The recurring cluster-acceptance failures all share a root cause: validating components instead of the system.
- Summing green checkmarks. Every node passed burn-in, every link passed point-to-point, the storage passed single-stream fio, the scheduler started a hello-world job — and the cluster is declared accepted without a single full-scale collective or a reference training run. The interactions (straggler-on-collective, storage-rebuild-vs-restart, scheduler-fragmenting-domains) are exactly the failures that only appear at the system level, and they are the ones that bite in production.
- Peak bandwidth instead of sustained-under-burst. Accepting storage on a hero single-stream number, then watching the checkpoint freeze the whole job for minutes when 128 nodes write concurrently. The write path must be validated under realistic concurrent burst, not best-case streaming.
- A goodput SLA with no badput decomposition. Committing to a goodput floor without the three-layer attribution model, so when the number misses there is no way to assign the shortfall to scheduling, runtime, or program causes — and the SLA becomes unenforceable.
- Throwing away the baseline. Treating acceptance as a one-time gate and capturing no instrumented full-scale reference, leaving day-2 operations with no known-good to detect drift against. → Chapter 14.3.