Guide › Software, Orchestration & Service Delivery › 10.1

Chapter 10.1

Orchestration Architecture & the Scheduling Plane

The scheduler is the layer that decides whether your GPUs run jobs or sit idle, and the choice between an HPC batch scheduler and a cloud-native one is a bet on what your fleet actually does, how much it costs to run two control planes, and whether you can place work on the fabric without leaving bandwidth on the floor.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Whether your workload-scheduling plane is Slurm (HPC-native gang scheduling), Kubernetes plus a batch scheduler (Volcano / KAI / Run:ai), or a deliberate hybrid via Slinky — and the operational cost of running two control planes if you cannot pick one.
Whether to adopt Kubernetes Dynamic Resource Allocation now (GA since 1.34, Sept 2025) or stay on the device-plugin + custom-scheduler model for one more cycle.
Where the boundary sits between the three planes — workload scheduler, node runtime, fleet/control plane — and which team owns each, because blurred ownership is where reliability incidents hide.
Build-vs-buy on the platform: assemble open-source (Slurm/K8s + Volcano/KAI + GPU Operator + Prometheus) yourself, or buy a managed fleet product (Mission Control, Base Command Manager, a neocloud's managed Slurm/K8s).
Which scheduling decisions are reversible (queue policy, fair-share weights, oversubscription) versus baked-in (the scheduler itself, the topology model, the isolation boundary) and must be chosen for the fleet you will have in three years, not the one you have today.

You can build a flawless data center — perfect PUE, a non-blocking fabric, golden-image nodes — and still ship 60% goodput because the layer that places work on the hardware is mis-designed. The orchestration stack is where megawatts of energized, depreciating silicon become either useful tokens or idle heat. It is the most software-defined decision in the facility and, perversely, the one most often inherited by accident: the HPC team brings Slurm because that is what they know, the platform team brings Kubernetes because that is what the rest of the company runs, and the two never reconcile until a frontier training run and an inference fleet are fighting over the same GPUs through two schedulers that cannot see each other.

This chapter establishes the architecture of that layer. We define the three planes every fleet must design coherently — the workload scheduler, the node runtime, and the fleet/control plane — and then take up the central fork: Slurm versus Kubernetes, the batch schedulers that retrofit gang-scheduling onto K8s (Volcano, KAI, Run:ai), the Slinky hybrids that try to run both at once, and Dynamic Resource Allocation (DRA), the substrate change that finally lets Kubernetes describe a GPU as something richer than an opaque countable unit. Each fork is scored by what it costs downstream in goodput, utilization, operational headcount, and lock-in. The deeper mechanics — topology-aware placement, multi-tenant isolation, the node software stack, autonomous recovery — each get their own chapter in Part 10; this one is the map that tells you which plane owns which problem.

The three planes

A working AI fleet is not one scheduler — it is three loosely-coupled control loops, each operating at a different timescale and owned, ideally, by a different on-call rotation. Conflate them and you get the recurring failure mode of GPU operations: a node-health problem that looks like a scheduler bug, a fabric event that looks like a job failure, a driver mismatch that surfaces as a mysterious collective hang three layers up.

The workload-scheduling plane answers which job runs on which GPUs, when. It owns queues, priority, fair-share, gang/co-scheduling, preemption, and backfill. This is Slurm, or Kubernetes-plus-a-batch-scheduler. It operates at the timescale of seconds-to-minutes and is the plane this chapter is principally about.
The node runtime plane answers can this node actually run the work correctly. It owns the GPU driver, CUDA/ROCm, NCCL/RCCL, firmware, the container runtime, and the device exposure (device plugin or DRA driver). It operates at the timescale of a node's lifecycle and is treated canonically in Chapter 10.4. The scheduler trusts this plane; when the trust is misplaced — a single node with a mismatched NCCL or a degrading NVLink — the scheduler happily places a 1,024-GPU job onto the rotten node and the whole gang hangs.
The fleet / control plane answers is the fleet healthy, and how do we keep it that way. It owns provisioning and bring-up (Chapter 10.5), observability and health telemetry (Chapter 10.6), and autonomous fault detection and recovery (Chapter 10.7). It operates at the timescale of hours-to-days and is where products like NVIDIA Mission Control and Base Command Manager live.

The seam that matters most is between the workload plane and the fleet plane, and it is mediated by a single verb: cordon-and-drain. When the fleet plane detects a sick node — via DCGM, an XID error, a fabric counter — it must be able to tell the workload plane to stop scheduling onto it and to migrate or checkpoint the work that is already there, without a human in the loop. A fleet whose three planes cannot perform that handoff autonomously cannot hit high goodput at a ~3-hour cluster MTBF, full stop. That number is not hyperbole: with a per-GPU MTBF around 80,000 hours, a 16,000-GPU cluster experiences a failure roughly every three hours, and Meta's Llama 3 run logged 419 unplanned interruptions over 54 days — about 7.8 per day — yet still sustained >90% effective training time, precisely because the planes were wired to recover without paging a human for all but three of those events.

Goodput is the SLO, not GPU utilization

The headline metric for a scheduling plane is not "GPU utilization" — a number that can read 100% while a synchronous job spins on a hung collective, accomplishing nothing. It is goodput: the fraction of wall-clock GPU time spent on useful, non-wasted, non-recomputed forward/backward progress. Google formalized it as effective-training-time divided by total time, with everything else accounted as badput — scheduling gaps, failed steps, checkpoint stalls, recovery, and stragglers. Industry-average goodput sits near 90%; best-in-class operators market ~96%. Every scheduling decision in this chapter — gang semantics, topology-awareness, preemption policy, the recovery handoff — is ultimately a lever on goodput, and the reliability overhead of hitting it runs 6–21% of TCO. Design the plane to maximize goodput, instrument it to measure goodput, and write your SLAs against goodput. Raw utilization is a vanity number.

The master fork: Slurm vs Kubernetes

Two schedulers dominate the workload plane, and they were born to solve different problems. Slurm comes from the supercomputing world: it assumes a long-lived, tightly-coupled batch job that wants all of its nodes at once, communicates over a fast fabric, runs to completion or fails as a unit, and is submitted by a researcher who thinks in sbatch scripts and partitions. Kubernetes comes from the web-services world: it assumes a fleet of independent, restartable, horizontally-scalable containers behind a load balancer, declaratively reconciled toward a desired state, with rolling upgrades and self-healing as first-class verbs. Neither was designed for the other's workload, and the friction at the boundary is the defining operational story of AI orchestration in 2026.

The rule of thumb the market settled on: roughly 70% of AI/HPC scheduling runs on Slurm, ~20% on Kubernetes, ~10% in-house. Training — synchronous, gang-scheduled, fabric-bound — gravitates to Slurm because gang scheduling is native: sbatch -N 256 gets you all 256 nodes simultaneously or the job waits, which is exactly the all-or-nothing semantics a synchronous all-reduce job needs. Inference — bursty, loosely-coupled, autoscaled, behind a service mesh — gravitates to Kubernetes because that is what Kubernetes was built for. A widely-cited operator data point captures the split cleanly: roughly 90% of GPU-cloud customers use Kubernetes for inference, while ~50% use Slurm for training. The two workloads pull toward opposite schedulers, which is why so many fleets end up running both.

The real fork is not "which scheduler is better." It is whether your fleet's workload mix justifies one plane or two, and if two, whether you federate them or converge them. That is the decision with real downstream cost: headcount, stranded capacity at the boundary, and the blast radius of a control-plane outage.

Workload-scheduling plane → the three real options

Dimension	Slurm (HPC-native)	Kubernetes + batch scheduler	Slinky / hybrid
Native model	Batch jobs, partitions, gang allocation by default	Declarative pods/services; gang requires Volcano / KAI / Run:ai	Both: Slurm semantics atop a K8s substrate (slurm-bridge / slurm-operator)
Best-fit workload	Synchronous pre-training, large HPC, tightly-coupled jobs	Inference serving, post-training pipelines, CI, cloud-native apps	Mixed fleets that must run training and inference side by side
Gang scheduling	Built-in; all-or-nothing is the default semantic	Add-on; Volcano/KAI provide gang + co-scheduling plugins	Inherited from Slurm for batch; K8s gang for services
Topology-awareness	topology.yaml, block scheduling, switch/nvidia_imex (→ NVL72)	Scheduler plugins + DRA attributes; maturing fast in 2026	Best of both, at the cost of two topology models to keep in sync
Multi-tenancy / fairness	Fair Tree, QoS, preemption — mature, fine-grained	Namespaces, quotas, RBAC + scheduler fair-share (KAI/Run:ai)	Two quota systems to reconcile
Ecosystem / elasticity	Thin; great at batch, weak at services, autoscaling, rollouts	Vast; service mesh, autoscaling, GitOps, operators	Full K8s ecosystem plus Slurm batch
Operational cost	Lightweight to run; familiar to HPC staff	Heavier control plane; needs K8s + GPU platform expertise	Highest — you operate and debug both stacks at once

A decision table, not a feature checklist. Market-share figures are 2026 practitioner rules of thumb (HPCwire, ClusterMAX); capabilities reflect Slurm 25.05 and Kubernetes 1.34-era batch schedulers.

The honest reading of that table: Slurm wins on training simplicity, Kubernetes wins on everything that isn't training, and the hybrid wins on flexibility at the price of complexity you must staff for. The trap is choosing the hybrid by default — "we'll run both, best of both worlds" — without budgeting for the fact that you now have two schedulers, two notions of a node, two quota systems, two failure surfaces, and two on-call rotations that must agree on what "this node is unhealthy" means. The convergence tooling has gotten genuinely good, but convergence is not free; it is a deliberate investment justified only when the workload mix truly demands both planes on the same physical fleet.

Retrofitting gang scheduling onto Kubernetes: Volcano, KAI, Run:ai

Vanilla Kubernetes schedules pods independently — which is fatal for a distributed training job. If a 64-pod job gets 60 pods placed and then blocks waiting on the last 4, those 60 GPUs sit idle holding the gang hostage, and two such half-placed jobs can deadlock each other indefinitely. This is the partial-gang / resource-deadlock problem, and solving it is the entire reason the K8s batch-scheduler ecosystem exists. Each of the three main options solves it with gang (co-)scheduling — place all pods of a job atomically or none — and then layers AI-specific scheduling on top.

Volcano is the CNCF-incubating batch system for Kubernetes — the de facto open-source default. It brings gang scheduling, queue/fair-share, preemption, backfill, and topology-aware plugins, and integrates with the training operators (PyTorch, MPI, Ray). It is the safe, vendor-neutral choice and the one you reach for if you want to avoid lock-in.
KAI Scheduler is NVIDIA's Kubernetes-native AI scheduler, open-sourced under Apache-2.0 in April 2025 — the engine extracted from the Run:ai acquisition. It adds gang scheduling, hierarchical fair-share quotas, bin-packing and spread strategies, GPU sharing/fractioning, and native DRA and topology-awareness. Open-sourcing it pushed these capabilities toward a commodity baseline.
Run:ai is the commercial platform (now NVIDIA, bundled into NVIDIA AI Enterprise) layered on KAI: fractional GPU, policy/quota governance, a multi-tenant UI, and integration with the broader NVIDIA fleet stack. You buy it when you want the managed product and the support contract, not the assembly job.

The decision here is a build-vs-buy in miniature. Volcano is the open, neutral baseline you operate yourself. KAI is the open NVIDIA-blessed engine — more AI-specific, but you are now standing on NVIDIA's roadmap. Run:ai is the supported product on top of KAI. Picking KAI or Run:ai buys you fractional-GPU and quota machinery out of the box (relevant the moment you have multi-tenancy — Chapter 10.3); picking Volcano keeps you maximally portable across vendors and avoids deepening NVIDIA platform dependence at the orchestration layer.

The vendor-gravity problem at the scheduling layer

In 2025–2026 NVIDIA assembled a near-complete vertical stack at the orchestration layer: it open-sourced KAI, folded Run:ai into NVIDIA AI Enterprise, ships Mission Control and Base Command Manager for fleet operations — and in December 2025 acquired SchedMD, the developer of Slurm itself. NVIDIA has committed to keeping Slurm open-source and vendor-neutral, but supercomputing practitioners have openly flagged the risk that roadmap and code prioritization could tilt toward NVIDIA hardware over time. The strategic point for an operator: the scheduling plane is increasingly a place where you can sleepwalk into single-vendor gravity. If a credible second-source GPU path (AMD ROCm 7.x — Chapter 10.4) matters to your procurement leverage, weight your scheduler choice toward genuinely vendor-neutral options (Slurm-the-open-project, Volcano) and treat NVIDIA-specific orchestration as a convenience you can exit, not a foundation you cannot.

Dynamic Resource Allocation: the substrate change

For a decade, Kubernetes treated a GPU as a countable integer: nvidia.com/gpu: 8, an opaque unit, all alike, node-local. That model breaks the moment GPUs stop being fungible — when you care which GPUs (two on the same NVLink island), in what shape (a specific MIG geometry), with what attributes (≥40 GB free, a particular topology), or shared across pods. Dynamic Resource Allocation (DRA) replaces counting with claiming: a workload describes the properties of the devices it needs via a ResourceClaim, and a vendor-supplied DRA driver plus the scheduler resolve the actual allocation. It is the same conceptual move as PersistentVolumeClaims for storage, applied to accelerators.

The timing is the news. DRA graduated to GA in Kubernetes 1.34 (September 2025) — the core resource.k8s.io APIs are now stable — and downstream distributions followed (DRA GA in OpenShift 4.21, March 2026). This matters because it changes what the scheduling plane can express natively: attribute-based selection ("2 GPUs on the same node interconnected by NVLink, min 40 GB VRAM"), fine-grained partitioning (MIG, time-slicing, fractional GPU) without static 1:1 pod-to-GPU mapping, and network-attached devices. Vendor reports cite utilization moving from the 45–60% range under device plugins toward 70–85% under DRA through better packing and sharing — a direct goodput-and-economics lever.

The decision: adopt DRA now, or run one more cycle on device-plugin + custom-scheduler? DRA is GA, but the surrounding ecosystem — driver maturity across vendors, scheduler-plugin integration, operational tooling — is still settling in mid-2026. For a greenfield K8s cluster being stood up today, DRA is the forward-looking bet and the model the topology-aware work in Chapter 10.2 increasingly assumes. For an existing production fleet with a working device-plugin setup, the conservative path is to pilot DRA on a partition while keeping the proven model in production until the driver/tooling story for your specific hardware is boring. Betting a brand-new revenue cluster on a still-stabilizing path is the kind of avoidable risk that shows up as goodput loss in month one.

Slinky: converging the two planes

If you genuinely need both Slurm's batch semantics and Kubernetes' ecosystem on the same physical fleet, the answer the industry converged on is Slinky — the open-source project (from SchedMD, now NVIDIA) that bridges the two rather than forcing a choice. It has two complementary pieces. The slurm-operator runs Slurm on Kubernetes: Slurm controller and compute daemons as managed K8s workloads, so you get Slurm's batch/gang semantics with Kubernetes lifecycle management underneath. The slurm-bridge goes the other way: it lets Slurm act as a scheduler for Kubernetes, so Slurm's allocation logic governs placement of K8s-submitted work. NVIDIA has published this pattern running HPC gang-scheduled training and cloud-native services side by side at 8,000+ GPU scale.

The promise is one physical fleet, one pool of GPUs, both workload styles — no more carving the cluster into a static "Slurm half" and "K8s half" that strand capacity at the boundary. The cost is real and must be named: you are now operating, patching, and debugging two scheduling systems wired together, with two topology models, two quota regimes, and a non-trivial integration surface where the interesting failures live. Convergence pays off for operators whose workload mix is genuinely bimodal and whose utilization losses from static partitioning exceed the operational cost of the hybrid. For a fleet that is ~90% training or ~90% inference, the simpler answer — pick the one scheduler that fits the dominant workload and serve the minority case with a small dedicated partition — is usually the better engineering call.

~70% / ~20% / ~10%

AI/HPC scheduler share: Slurm / Kubernetes / in-house (rule of thumb)

2026HPCwire, ‘Slurm vs Kubernetes in the Age of AI’; ClusterMAX

~90% K8s / ~50% Slurm

GPU-cloud customers using K8s for inference vs Slurm for training

2025SemiAnalysis ClusterMAX

GA in 1.34

Kubernetes Dynamic Resource Allocation graduated to stable (Sept 2025)

2025Kubernetes blog, ‘v1.34: DRA has graduated to GA’

45–60% → 70–85%

reported GPU utilization, device plugins vs DRA (better packing/sharing)

2026Red Hat / vendor DRA analyses

~90% / ~96%

goodput (effective-training-time): industry avg vs best-in-class

2025SemiAnalysis ClusterMAX / CoreWeave

~every 3 hr

failure interval for a 16k-GPU cluster at ~80,000-hr per-GPU MTBF

2025Meta Llama 3 / domain reliability math

8,000+ GPUs

demonstrated scale of Slurm-on-Kubernetes (Slinky) side-by-side workloads

2025NVIDIA Developer, Slinky / slurm-bridge blog

Apache-2.0, Apr 2025

NVIDIA open-sourced KAI Scheduler (gang, fair-share, DRA, topology)

2025NVIDIA / KAI-Scheduler GitHub

A reference control-plane topology

Stitched together, a 2026-current reference fleet looks like this — read it bottom-up, as planes stacked by timescale. At the base, bare metal is provisioned as code: BMC/Redfish bring-up, PXE, golden images, and a join step that registers each node with the control plane (Chapter 10.5). On every node sits the node runtime: a pinned, identical stack of GPU driver, CUDA/ROCm, NCCL/RCCL, and firmware, exposed to the scheduler through the GPU Operator's device plugin or a DRA driver (Chapter 10.4). The hard rule lives here — driver, CUDA, cuDNN, and NCCL pinned byte-identical across every node in a synchronous job, because a single mismatch causes collective hangs, silent bandwidth degradation, or wrong results.

Above the nodes sits the workload-scheduling plane — Slurm, or K8s + Volcano/KAI, or both via Slinky — which must be topology-aware: it has to know that the 72 GPUs in a GB200 NVL72 form one coherent NVLink domain and schedule the rack as a block, not as 72 interchangeable units (Chapter 10.2). Wrapping everything is the fleet/control plane: DCGM/NVML telemetry and XID/SXID parsing feed an observability stack (Chapter 10.6); a health-and-recovery loop cordons sick nodes, drains their work, swaps in hot spares, and triggers break-fix without a human (Chapter 10.7); and an IaC/GitOps layer keeps the whole thing declaratively reconciled. The architectural test of this topology is simple: can a node fail and the fleet detect, drain, recover, and resume — autonomously — fast enough to hold >90% goodput at a 3-hour MTBF? If the answer requires a human in the critical path, the topology is wrong regardless of which scheduler you picked.

Deep dive: why partial-gang scheduling deadlocks, and what gang scheduling actually guarantees

The single most important property a training scheduler must provide is gang (co-)scheduling, and it is worth understanding precisely what it buys you. A synchronous data-/tensor-/pipeline-parallel job is one logical process spread across N GPUs that must all start together and stay together; every training step ends in a collective (all-reduce, all-gather) that blocks until every rank arrives. If even one rank is missing, the entire job stalls. So a scheduler that places ranks independently — Kubernetes' default behavior — creates two pathologies.

First, partial placement waste: a 64-GPU job that gets 60 GPUs and blocks on the last 4 is holding 60 GPUs idle, producing nothing, until the gang completes. Second, and worse, resource deadlock: two 64-GPU jobs each grab 60 GPUs on a 96-GPU pool, each waits for 4 more that the other is holding, and neither ever runs — a classic deadly embrace that only an external timeout or a human breaks. Gang scheduling eliminates both by making placement atomic: the scheduler reserves all N slots and commits the job only when all N are simultaneously available; otherwise it places nothing and the job waits in the queue without consuming GPUs. Slurm does this by default — it is the native HPC model. Kubernetes does not, which is exactly why Volcano, KAI, and Run:ai exist: they add the co-scheduling plugin that gives K8s the all-or-nothing semantics a gang job requires. When you read "K8s needs a batch scheduler for AI," this deadlock is the concrete reason — not a nice-to-have, a correctness requirement for any fleet running synchronous distributed jobs.

Build vs buy the platform

The last fork in this chapter is whether to assemble the orchestration platform from open-source parts or buy a managed product. The assemble path — Slurm or K8s, plus Volcano/KAI, plus the GPU Operator, plus a Prometheus/Grafana/DCGM observability stack, plus your own IaC and recovery automation — gives maximal control and no per-node licensing, at the cost of a standing platform-engineering team that owns the integration, the upgrades, and the 3 a.m. pages when a driver bump breaks a collective. The buy path — NVIDIA Mission Control / Base Command Manager, a hyperscaler's managed Slurm/K8s, or a neocloud's turnkey cluster — compresses time-to-first-job and hands you autonomous recovery and validated golden stacks out of the box, at the price of a license, a vendor's roadmap, and less ability to customize the deep internals.

The deciding variables are the same three that govern most build-vs-buy calls in this guide: scale (a large, durable fleet amortizes a platform team; a small or short-lived one cannot), differentiation (if orchestration is a competitive edge — a neocloud selling goodput — you build; if it is undifferentiated plumbing, you buy), and time-to-power (buying collapses months of integration into days). The recurring mistake is the mid-size operator who builds a bespoke platform to save license fees, then spends three senior engineers' salaries maintaining it — burning far more than the license, and shipping lower goodput than the managed product would have, because they are re-deriving autonomous recovery that a vendor already shipped. The productization view of this same fork — what operators actually sell on top of these platforms — is the subject of Chapter 10.9.

Reversible vs irreversible scheduling decisions

Irreversible-ish (choose for the fleet you'll have in three years): the workload scheduler itself (migrating a production fleet from Slurm to K8s or back is a re-platforming project, not a config change); the topology model and how the rack maps to a scheduling unit; and the multi-tenant isolation boundary (soft namespaces vs hard per-tenant clusters — Chapter 10.3) which entangles security and billing. Reversible (defer, tune in production): queue and partition layout, fair-share weights and QoS tiers, preemption and backfill policy, oversubscription ratios, and the inference autoscaling signal. The strategic move, as everywhere in this guide, is to make the irreversible substrate accommodate the workload you are ramping toward — choose a topology-aware, DRA-capable plane now even if today's jobs don't need it — while keeping the policy knobs reversible and tuned against goodput as the fleet evolves.

Deep dive: the operational cost of running two control planes

"We'll just run both Slurm and Kubernetes" is the most under-priced decision in AI orchestration, so it is worth itemizing what "both" actually costs once you are past the demo. You now maintain two definitions of a node — a Slurm node and a K8s node — and a node that is healthy to one can be cordoned by the other, so your health-and-recovery loop must reconcile both views or risk scheduling onto hardware the other plane already condemned. You maintain two quota and fairness systems — Slurm Fair Tree/QoS and K8s namespace quotas / scheduler fair-share — and a tenant's entitlement must be expressed, and enforced, consistently across both, or a user games the seam. You maintain two topology models, and if they drift, one plane places a job badly across an NVLink boundary the other plane would have respected (Chapter 10.2). You carry two upgrade cycles with their own breakage modes, and two failure surfaces whose interaction is where the genuinely confusing incidents originate.

Slinky exists precisely to collapse some of this — one substrate, bridged scheduling — but it does not make the complexity vanish; it relocates it into the bridge, which is itself software you operate and debug. The defensible reasons to pay this cost are specific and bimodal: a research org that must run frontier Slurm training and productionize inference on K8s on the same capital-constrained fleet, where static partitioning would strand 20–30% of GPUs at the boundary. The indefensible reason is inertia — two teams who each brought their own scheduler and never reconciled. Before committing to two planes, compute the stranded-capacity cost of simply partitioning the fleet and giving each workload its own scheduler; very often that number is smaller than a year of two-plane operational overhead, and the simpler architecture wins.

This chapter is the map; the rest of Part 10 is the territory. Topology-aware and rack-scale scheduling — the NVL72-as-block problem this chapter only names — is engineered in Chapter 10.2. The multi-tenancy, quota, and isolation machinery that sits on top of the scheduler is Chapter 10.3. The node runtime plane the scheduler trusts — drivers, CUDA/ROCm, NCCL/RCCL, the golden stack — is Chapter 10.4; provisioning and bring-up is Chapter 10.5; observability and GPU health is Chapter 10.6; and the autonomous detect-drain-recover loop that makes high goodput possible is Chapter 10.7. The training frameworks the scheduler ultimately serves live in Chapter 10.8, and inference-serving's distinct control plane in Chapter 10.11. The fabric-side oversubscription and topology decisions that the scheduling plane must respect are in Chapter 8.5; the checkpoint math behind interruption tolerance is in Chapter 9.4; and the confidential-computing isolation boundary referenced in multi-tenancy is canonical in Chapter 11.5.