Guide › Software, Orchestration & Service Delivery › 10.2

Chapter 10.2

Topology-Aware & Rack-Scale Scheduling

On a rack-scale machine the placement of a job is a hard performance contract: land a tightly-coupled job inside one NVLink domain and it runs at full bandwidth; split it across the domain boundary and you fall off an order-of-magnitude bandwidth cliff that no amount of tuning recovers. The scheduler's real job is to defend topology, not to pack nodes.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Whether the atomic unit your scheduler allocates is the node or the NVLink domain (the rack) — and therefore whether a job can ever be split across the bandwidth cliff in the first place.
How you express topology to the scheduler: Slurm block/segment plus the IMEX switch plugin, or Kubernetes ComputeDomains via DRA with clique labels — and who owns the IMEX-channel-per-node constraint.
How you map each parallelism dimension (TP, EP, PP, DP) onto the fabric tiers, so the most bandwidth-hungry collectives stay inside the scale-up domain and only DP crosses the scale-out fabric.
Your fragmentation policy: whether you accept stranded GPUs to keep domains whole, defragment with preemption, or relax locality with segments — and what that costs in goodput versus utilization.
Whether discovery is trusted (vendor labels, clique IDs) or verified (NCCL/nvbandwidth probes at admission), because a mislabeled or degraded NVLink link places jobs onto a cliff the scheduler thinks isn't there.

Chapter 10.1 established the scheduling plane as the control loop that turns a fleet of accelerators into a service. This chapter is about the single hardest constraint that loop has to respect on 2026-era hardware: the network is not flat. Between any two GPUs in a modern AI cluster there are three radically different classes of link, and the bandwidth between them differs by roughly an order of magnitude at each step. Inside a node, GPUs talk over NVLink at terabytes per second. Inside a rack-scale NVLink domain — a GB200 NVL72 is 72 GPUs across 18 compute trays sharing 130 TB/s of aggregate NVLink — they still talk at scale-up speeds as if they were on one giant board. The moment a collective has to leave that domain, it drops onto the scale-out fabric: a ~400G or 800G NIC, an order of magnitude less per-GPU bandwidth, and a different failure and congestion regime. A scheduler that ignores this hierarchy will cheerfully place a tightly-coupled job with eight GPUs in one rack and eight in the next, and the job will run — at a fraction of its rightful throughput, forever, with no error logged.

The fork is topology-blind vs topology-aware placement, and the downstream cost of choosing blind is paid in goodput every step of every job for the life of the run. We walk the bandwidth cliffs and how you discover them; the two production mechanisms for defending them — Slurm block scheduling with IMEX, and Kubernetes ComputeDomains over Dynamic Resource Allocation; how to map a parallelism strategy onto the fabric so the right collectives stay local; and the fragmentation problem that makes the rack, not the node, the natural unit of allocation on rack-scale systems.

Why topology matters: the bandwidth cliffs

Hold three numbers in mind, because everything in this chapter is a consequence of the gaps between them. NVLink (scale-up, within the coherent domain): NVLink 5 on Blackwell delivers 1.8 TB/s bidirectional per GPU; an NVL72 rack aggregates to 130 TB/s. PCIe (host-local, GPU-to-CPU or GPU-to-NIC): Gen5 x16 is ~64 GB/s each way — the path KV-cache and host staging traverse, and a bottleneck the moment data must touch the CPU. Scale-out NIC (inter-node, over InfiniBand or RoCE Ethernet): a single ~400 Gb/s port is 50 GB/s; even an 8-rail node at 3,200 Gb/s is ~400 GB/s aggregate. The rule of thumb practitioners carry is that scale-up per-GPU bandwidth is roughly 5–18x scale-out, and the consequence is unambiguous: the most bandwidth-hungry collectives must be fit inside the scale-up domain, and the scheduler is the thing that decides whether they are.

Bandwidth does not degrade gracefully as a job spreads; it steps down discontinuously at each boundary. A tensor-parallel all-reduce that fits in one NVLink domain runs at NVLink speed; the same all-reduce split 9-and-9 across two domains runs at NIC speed for the cross-domain portion and stalls every other rank waiting on it. NVIDIA's own Slurm guidance states the case flatly: when a job crosses the NVLink-domain boundary, performance "drops sharply," which is why their block plugin treats the domain as a hard constraint rather than a best-effort preference. The failure mode is insidious because it is silent — the job completes, the loss curve descends, nothing alarms — but MFU sits 20–40% below where the hardware should deliver, and the bill for that gap compounds across the entire run.

Placement is a performance contract, not a hint

The mental shift this chapter demands: in classic HPC and in CPU-cloud scheduling, locality is an optimization — nice to have, worth a few percent. On a rack-scale AI machine it is a contract. The difference between a job placed inside one NVLink domain and the same job straddling two is not a few percent; it is the difference between scale-up and scale-out bandwidth on the hot collective, which sets the throughput of the whole synchronous job to its slowest cross-domain link. A scheduler that exposes the GPU as a fungible, location-free resource — the default in most container orchestrators until recently — is structurally incapable of honoring that contract. Topology-awareness is a precondition for the rack-scale hardware to deliver the FLOPS you paid for, not a feature you bolt on afterward.

Discovery: how the scheduler learns the fabric

A scheduler can only defend a topology it can see, and discovery is where most topology-aware deployments quietly fail. There are two postures, and they correspond to a trust decision. Trusted discovery reads the fabric from labels the platform asserts: NVIDIA exposes an NVLink clique identifier (the nvidia.com/gpu.clique node label in Kubernetes; equivalent block definitions in Slurm's topology.yaml) that tells the scheduler which nodes share a coherent NVLink partition. PCIe and NUMA affinity come from the driver and from the node's hardware topology. Scale-out topology — which leaf switch a node hangs off, which rail a NIC belongs to — comes from cabling databases, LLDP, or a topology file the operator maintains. Trusted discovery is fast and is how production schedulers run day-to-day.

Verified discovery distrusts the labels and measures. At node admission and after every repair you run point-to-point bandwidth probes (nvbandwidth, NCCL all-reduce/all-gather benchmarks, p2pBandwidthLatencyTest) and compare against the expected matrix for the asserted topology. This matters because the labels can be right while the physical link is degraded: a marginal NVLink lane that has fallen back to a lower width, a transceiver running hot, a cable seated badly. The label still says "one clique, full bandwidth"; the hardware delivers half. A scheduler trusting that label places a tightly-coupled job onto a cliff it believes isn't there. The discipline that separates a 96%-goodput fleet from a 90% one is verifying the topology the scheduler will trust — gating nodes into the schedulable pool only after they pass the bandwidth matrix, not merely after they boot. Verification belongs to the burn-in and health pipeline of Chapter 10.6; the scheduler consumes its verdict.

The interconnect hierarchy a scheduler must respect

Tier	Reach	Per-GPU bandwidth (order of magnitude)	What rides it	Scheduling implication
NVLink (scale-up)	Within the coherent NVLink domain (node → NVL72 rack)	~1.8 TB/s (NVLink 5); 130 TB/s aggregate per NVL72	TP and EP all-reduce / all-to-all; KV-cache over NVLink	Keep the hot collective inside one domain — the domain is the allocation unit
PCIe (host-local)	GPU ↔ CPU / GPU ↔ NIC inside a node	~64 GB/s (Gen5 x16, each direction)	Host staging, GPUDirect setup, KV spill to host	NUMA/PCIe affinity: pin GPU to its local NIC and CPU socket
Scale-out NIC (inter-node)	Across the back-end fabric (IB / RoCE), domain to domain	~50 GB/s per 400G port; ~400 GB/s per 8-rail node	Data-parallel all-reduce; pipeline-stage point-to-point	Only DP/PP should cross here; honor rail alignment and leaf locality

Per-GPU bandwidths are 2026-current Blackwell-class reference points; see keynumbers for sources and vintages. The scheduling implication column is the load-bearing one.

Slurm block scheduling: the rack as an atomic block

Slurm remains the dominant scheduler for large training fleets, and its answer to rack-scale topology is block scheduling (the topology/block plugin, hardened for NVL72 in 2025). The model is simple and powerful: you declare, in topology.yaml, that a set of nodes forms a block corresponding to one NVLink domain — for an NVL72 that is 18 compute nodes carrying 72 GPUs. The scheduler then treats the block as an atomic locality unit. A job requesting up to 18 nodes is consolidated into a single block, so its tensor- and expert-parallel collectives never leave NVLink. A job larger than one block spans whole blocks rather than being smeared arbitrarily, so the cross-domain traffic is confined to the parallelism dimensions that tolerate scale-out bandwidth.

Two refinements make this practical rather than rigid. The --segment argument lets a user declare the atomic group size their job needs — for example --segment=4 lets a 12-node job be split into three 4-node segments that the scheduler can pack flexibly across blocks while still keeping each segment's locality intact. This is the release valve for fragmentation: instead of demanding one contiguous 12-node hole, the scheduler can fill three smaller holes, lifting utilization without abandoning topology. The second refinement is the switch/nvidia_imex plugin (Slurm 24.05+), which manages IMEX channels — the driver-level memory-export/import access control that lets GPUs in a multi-node NVLink domain address each other's memory. Slurm provisions and tears down the IMEX channel per job, so two jobs sharing a physical NVLink domain cannot accidentally read each other's GPU memory. Block scheduling without IMEX management is a security and correctness gap on shared NVL72 hardware, not merely a tuning omission.

Deep dive: what topology.yaml actually encodes, and why segments beat raw node requests

The topology.yaml file is the scheduler's map of the bandwidth cliffs. For a two-rack NVL72 cluster it declares two blocks — say block01 and block02 — each binding a named range of 18 nodes to a block size of 18. That single declaration changes the scheduler's behavior from "find me 16 free nodes anywhere" to "find me 16 free nodes that share an NVLink domain, and if you can't, span whole domains." The plugin, introduced in Slurm 23.11 and matured for rack-scale systems through 2025, enforces this as a hard constraint: jobs at or below the block size stay consolidated; larger jobs grow in block-sized increments.

The reason --segment matters is the difference between locality and contiguity. A naive topology scheduler will only place a job if it can find one contiguous hole large enough — which, on a busy fleet, strands capacity because the holes are the wrong shape. Segments decouple the two: the user tells the scheduler "my job needs groups of N that each stay NVLink-local, but the groups themselves can be anywhere." A 12-node job with --segment=4 becomes three independent 4-node locality requirements, far easier to satisfy from a fragmented fleet than one 12-node block. The cost is that cross-segment traffic now rides scale-out, so you only request segments smaller than your job when the parallelism crossing the segment boundary (typically DP or PP) can tolerate it. Choosing the segment size is choosing which parallelism dimension you are willing to push onto the scale-out fabric — a mapping decision dressed up as a scheduler flag. → parallelism mapping below; oversubscription budget in Chapter 8.5.

Kubernetes: ComputeDomains, DRA, and the IMEX-per-node constraint

Kubernetes arrived at rack-scale topology from the opposite direction. Its original device-plugin model exposed a GPU as a flat, countable, location-free resource — nvidia.com/gpu: 8 — which is structurally blind to NVLink domains. The fix is Dynamic Resource Allocation (DRA), the resource API that lets a workload request a structured, parameterized resource rather than an opaque count, and on top of it NVIDIA's ComputeDomains abstraction. A ComputeDomain represents reachability between the distributed workers of a multi-node job: when its pods are scheduled, the platform dynamically creates the IMEX domain that lets those pods' GPUs address one another's memory over NVLink, and tears it down when the job ends. The IMEX channel surfaces inside each container as a device file, so a CUDA or NCCL application behaves as if all the GPUs sat on one board — the rack-scale illusion, delivered through the scheduler.

Two constraints define how you design around this. First, placement still needs an affinity rule: pods must be steered onto nodes that share an NVLink partition, expressed by matching the nvidia.com/gpu.clique node label, or the ComputeDomain spans a boundary it cannot bridge. The clique label is the Kubernetes analogue of the Slurm block. Second — and this is the sharp edge of the 2026 implementation — there is at most one ComputeDomain (one IMEX channel) per node, because only one IMEX daemon runs per node. If a ComputeDomain claims only part of a node's GPUs, the remaining GPUs on that node cannot join a different ComputeDomain and are effectively stranded. That single constraint pushes you hard toward whole-node, and in practice whole-rack, allocation for multi-node NVLink jobs: partial-node sharing across distinct multi-node domains is not expressible today. The platform stack to run this is specific — Kubernetes 1.32+ with the DRA APIs enabled, recent GPU Operator and driver — and getting the versions wrong yields silent fallback to flat scheduling, which is exactly the cliff you were trying to avoid.

Decision fork: Slurm block scheduling vs Kubernetes ComputeDomains for rack-scale jobs

Dimension	Slurm block scheduling	Kubernetes + ComputeDomains (DRA)
Topology unit	Block in topology.yaml = one NVLink domain (18 nodes / 72 GPUs)	ComputeDomain over nodes sharing the gpu.clique label
Locality enforcement	Hard constraint; --segment relaxes contiguity, not locality	Pod affinity on gpu.clique; scheduler must keep pods in-partition
IMEX management	switch/nvidia_imex plugin provisions channel per job	IMEX domain created/torn down dynamically per ComputeDomain
Partial-node sharing	Supported within a node; gang semantics per job	Blocked: one IMEX channel/node strands unused GPUs in a partial claim
Native gang scheduling	Yes — all-or-nothing allocation is intrinsic	Needs a gang-aware scheduler (KAI, Volcano, Kueue) atop DRA
Best fit	Large synchronous training; HPC-heritage fleets	Mixed train+serve estates; cloud-native, multi-tenant platforms

Both are valid in 2026; the choice usually follows the existing control plane (HPC-heritage training fleet vs cloud-native platform). Convergence (Slurm-on-K8s) is blurring the line — see Chapter 10.1.

Gang scheduling is not optional for synchronous jobs

A topology-aware scheduler that places pods or tasks incrementally will deadlock a fleet. A synchronous training job needs all N workers running simultaneously or none of them make progress — yet a greedy scheduler will happily admit 6 of your 8 ranks, hold those GPUs idle waiting for the last 2, while a neighboring job does the same, until both are stuck holding half their allocation forever. Gang scheduling — all-or-nothing admission of the whole job — is the fix, and it is intrinsic to Slurm but must be added to Kubernetes via KAI Scheduler, Volcano, or Kueue. Topology-awareness and gang scheduling are a pair: gang guarantees the job runs at all, topology guarantees it runs at full bandwidth. Ship one without the other and you get either deadlock or a silent cliff.

Mapping parallelism onto the fabric

Topology-aware scheduling is only half the bargain; the other half is the job declaring its parallelism so the scheduler can place the right dimension on the right tier. 3D (and 4D) parallelism uses four collective patterns with sharply different bandwidth appetites, and the entire art is matching appetite to tier. Tensor parallelism (TP) all-reduces activations on every layer — the most bandwidth-hungry pattern, and the one that must live inside the NVLink domain. Expert parallelism (EP) for MoE models does all-to-all token routing; wide EP (EP32 and beyond) is precisely what a 72-GPU NVLink domain unlocks, because the all-to-all stays on NVLink instead of hitting the scale-out fabric. Pipeline parallelism (PP) passes activations point-to-point between stages — low volume, latency-sensitive, tolerant of crossing the domain boundary. Data parallelism (DP) all-reduces gradients once per step — high volume but infrequent and overlappable, the canonical dimension to push onto scale-out.

The placement rule that falls out: TP and EP inside the scale-up domain; PP and DP across the scale-out fabric. A correctly mapped GB200 NVL72 job sets TP+EP to fit within the 72-GPU domain, runs PP across racks, and lets DP all-reduce ride the rail-optimized fat-tree where SHARP in-network reduction can absorb it. Get the mapping backwards — DP inside the domain, TP across racks — and you have inverted the bandwidth hierarchy: the cheap-to-cross dimension hogs NVLink while the expensive-to-cross dimension chokes on NIC bandwidth. This is why the scheduler needs more than a node count; it needs the parallelism shape, so block sizes and segment sizes line up with the TP/EP group rather than slicing through it. The framework that emits this shape is the subject of Chapter 10.8; the fabric that carries it is Chapter 8.5.

72 GPUs / 130 TB/s

GB200 NVL72 coherent NVLink domain — the rack-scale block the scheduler treats as atomic

2025NVIDIA GB200 NVL72 / NVLink product page

~1.8 TB/s

NVLink 5 per-GPU bidirectional bandwidth (scale-up); ~3.6 TB/s on Rubin (roadmap)

2026NVIDIA NVLink

~5–18x

scale-up (NVLink) vs scale-out (~400G NIC) per-GPU bandwidth — the cliff the scheduler defends

2025NVIDIA / SemiAnalysis

18 nodes

Slurm block size for one NVL72 NVLink domain in topology.yaml (topology/block plugin)

2025NVIDIA Developer — Slurm block scheduling on GB200 NVL72

1 / node

max IMEX channels (ComputeDomains) per node in Kubernetes DRA — strands partial-node GPUs

2025NVIDIA Developer — MNNVL on Kubernetes

K8s 1.32+

minimum Kubernetes with DRA APIs enabled for ComputeDomains; GPU Operator 25.3+

2025NVIDIA / AWS EKS GB200 guidance

~10.7%

share of Llama-3 training job interruptions traced to network/config issues — the cost of getting topology wrong

2024Meta (via Introl topology analysis)

~90% → ~96%

training goodput, industry average vs best-in-class — topology-aware placement is a lever on the gap

2025SemiAnalysis ClusterMAX / CoreWeave

Fragmentation and the rack as the scheduling unit

Topology-awareness creates a tension that does not exist in flat scheduling: the better you defend locality, the worse your bin-packing gets. If a job can only land inside one NVLink domain, then a domain with 6 of 72 GPUs busy has 66 GPUs that are useless to any job needing more than 66 NVLink-local — they are stranded by fragmentation, not by lack of capacity. Flat schedulers never see this problem because they will place a job anywhere; they pay for it instead in silent cliffs. Topology schedulers make the fragmentation visible and force a policy choice, which is the honest trade.

There are three levers, and a 2026 fleet uses all three. Accept the strand: keep domains whole, leave the 66 GPUs idle until a job that wants them arrives, and price the lost utilization as the cost of full-bandwidth placement — the right call when goodput per job dominates, i.e. large training. Defragment: use preemption and migration to compact small jobs out of partially-used domains, freeing whole domains for large ones — effective but expensive on synchronous jobs because preemption forces a checkpoint-and-restart. Relax locality with segments: let jobs declare smaller atomic groups (Slurm --segment, or sizing the ComputeDomain below a full rack) so they fit the holes you have, accepting scale-out traffic on the boundary the segment crosses. On rack-scale hardware the rack — the NVLink domain — becomes the natural unit of allocation, not the GPU and not the node. You schedule, bill, drain, repair, and reserve in units of domains, because any policy that subdivides a domain has to reckon with the IMEX-per-node constraint, the bandwidth cliff, and the fragmentation it creates.

The fragmentation fork: utilization vs goodput

This is the fork that defines your scheduler's personality. Optimize utilization and you pack aggressively, subdivide domains, tolerate cross-cliff placement, and keep the GPUs hot — every accelerator earning, fragmentation minimized, some jobs quietly running below peak. Optimize goodput and you defend domains whole, accept stranded GPUs, preempt to defragment, and guarantee every job runs at full bandwidth — peak per-job throughput, lower fleet utilization. The two pull in opposite directions and there is no universal answer: a training fleet running a handful of enormous synchronous jobs lives on the goodput side (one cliff taxes the whole run); a serving or batch fleet running thousands of small loosely-coupled jobs lives on the utilization side (no single job spans a domain anyway). Most real estates run both, which means the scheduler needs per-queue policy — goodput-defending for the training pool, utilization-packing for the inference pool — not one global knob. → multi-tenant fairness and queue policy in Chapter 10.3; goodput as the headline metric in Chapter 10.6.

Deep dive: why discovery errors are worse than placement errors

A placement error — putting a job on a cliff — is at least diagnosable: the bandwidth is wrong, a benchmark reveals it, and a better policy fixes the next run. A discovery error is worse, because it makes the scheduler confidently wrong. If a node's gpu.clique label or Slurm block assignment says it shares an NVLink domain with neighbors it does not actually reach at full bandwidth — a degraded NVLink lane that fell back to half-width, a partially-seated backplane connector, a transceiver throttling on heat — then the scheduler will place a tightly-coupled job into what it believes is one coherent domain, and the job will run on a hidden cliff that no policy can see. The scheduler did everything right against a map that was wrong.

This is why the 2026 discipline is to verify the topology before trusting it, and to make the schedulable pool a function of measured bandwidth, not asserted labels. The mechanics: at admission and after every repair, run an NCCL all-reduce or nvbandwidth probe across the asserted domain and compare against the expected matrix; if a link is below threshold, mark the node degraded and pull it from the topology the scheduler trusts, even though it boots and passes basic health. The Meta figure in the keynumbers — roughly a tenth of large-training job interruptions traced to network and configuration issues — is in large part this class of problem: the fabric was not what the control plane thought it was. The scheduler is only as good as its map, so the map has to be measured, not declared. The verification pipeline lives in Chapter 10.6; the fault-domain and recovery framing in Chapter 10.7.

Anti-patterns

The same mistakes recur because each comes from treating the GPU as fungible when the fabric says it is not:

Flat scheduling on rack-scale hardware. Exposing nvidia.com/gpu: 8 with no clique or block awareness, then wondering why a 16-GPU job that landed 8-and-8 across two racks runs at 60% MFU. The hardware is non-blocking inside the domain and an order of magnitude slower across it; a location-free resource model cannot see the difference and so cannot avoid the cliff.
Topology without gang scheduling. Placing the right pods on the right nodes but admitting them incrementally, so synchronous jobs deadlock holding partial allocations. Locality and all-or-nothing admission are a pair; shipping one without the other trades a silent cliff for an outright stall.
Inverted parallelism mapping. Putting data-parallel all-reduce inside the NVLink domain and tensor-parallel across racks — using the scarcest bandwidth on the dimension that least needs it. The mapping must follow the appetite: TP/EP local, PP/DP remote.
Trusting labels over measurements. Letting a node into the schedulable pool because it booted and its clique label is present, without verifying the NVLink bandwidth matrix — placing jobs onto a degraded domain the scheduler believes is healthy.
Subdividing domains under the IMEX-per-node limit. Designing a Kubernetes multi-tenant scheme that hands partial nodes to distinct ComputeDomains, then discovering the one-IMEX-channel-per-node constraint strands the remaining GPUs. On rack-scale NVLink hardware, allocate in whole domains and bill in racks.

This chapter sits inside the scheduling plane defined in Chapter 10.1 and feeds the multi-tenancy and isolation policy of Chapter 10.3, where the fragmentation-vs-utilization fork becomes a fairness and quota question. The bandwidth cliffs it defends are engineered in the fabric chapters: scale-out topology, sizing, and oversubscription in Chapter 8.5, transport and protocols in Chapter 8.4, and the congestion control that governs cross-domain traffic in Chapter 8.6. The parallelism strategy that the scheduler must be told about is the province of Chapter 10.8; the topology verification and goodput telemetry that decide which nodes the scheduler may trust live in Chapter 10.6; fault domains and recovery in Chapter 10.7; and the goodput-vs-availability reframe that underlies the whole fragmentation trade in Chapter 12.2. Why training is the archetype that most rewards full-bandwidth placement is established in Chapter 1.2.