Guide › Software, Orchestration & Service Delivery › 10.4

Chapter 10.4

Node Software Stack: Drivers, CUDA/ROCm, NCCL & Firmware

The node software stack is a single versioned organism — driver, CUDA/ROCm runtime, NCCL/RCCL, and firmware must move together as one pinned, attested artifact across every node, because a one-line version skew in a synchronous cluster does not slow the job, it hangs it.

GOODPUTDENSITY-RAMP

What you'll decide here

Whether you run a single fleet-wide golden stack (one driver+CUDA+NCCL+firmware tuple, pinned and attested on every node) or allow per-job stacks — and who owns the blast radius when the two diverge.
Driver injection model: NVIDIA GPU Operator (or AMD GPU Operator) managing containerized drivers and the toolkit, versus host-baked drivers in a golden image — the trade is rolling-upgrade agility against bare-metal determinism.
Whether you commit to a CUDA-only fleet or build a genuinely dual-vendor node stack (CUDA + ROCm), and therefore whether you pay the ROCm engineering tax to buy second-source leverage and ~15-30% lower hardware cost.
NCCL/RCCL tuning posture: ship NVIDIA's auto-tuned defaults, or invest in per-topology tuning (algorithm/protocol selection, SHARP, rail awareness) that recovers the last 10-20% of bus bandwidth on your specific fabric.
Firmware lifecycle ownership: rolling/canary updates over Redfish/PLDM with attestation and rollback, versus deferring firmware until something breaks — a deferral that quietly accumulates the gray failures that destroy multi-week runs.

Above the rack and below the scheduler sits a layer that almost no slide deck shows and almost every multi-week training failure traces back to: the node software stack. It is the vertical column of software that turns a powered, networked GPU server into a node a distributed job can actually use — the kernel driver that talks to the silicon, the CUDA or ROCm runtime the framework links against, the collective-communication library (NCCL on NVIDIA, RCCL on AMD) that carries every all-reduce, and the firmware estate (GPU VBIOS, NIC, BMC, NVSwitch, CX/Thor, PSU) that sits beneath all of it. These are not independent knobs. They are a single coupled organism with a compatibility matrix, and the central engineering fact of this chapter is that the matrix is unforgiving at scale: a synchronous training job runs at the speed of its slowest, most-skewed node, and a mismatched NCCL version across two nodes out of two thousand does not degrade the run — it hangs the collective and stalls every GPU behind it.

Most operators treat this layer as plumbing until it costs them a run. We trace the vertical stack and the driver lifecycle; the injection model (GPU Operator vs golden image) and the container-runtime layer beneath it; the AMD ROCm path as a real but taxed second source; NCCL/RCCL tuning as the difference between paper bandwidth and realized bus bandwidth; and the discipline that holds it all together — fleet-wide version pinning, the golden stack, and firmware update mechanics under attestation. The recurring fork is the same: uniformity buys goodput, per-job flexibility buys agility, you cannot maximize both, and choosing wrong shows up not as a config error but as a stalled $50M run.

The vertical stack: one column, many version constraints

Read the node stack bottom-to-top and the coupling becomes obvious. At the base is firmware — GPU VBIOS/InfoROM, NVSwitch/UALink-switch firmware, NIC firmware (ConnectX/BlueField/Thor on the NVIDIA side; Pollara/Pensando on AMD), BMC, BIOS/UEFI, PSU and CDU controllers — each with its own version, its own update path, and its own dependency on the layers above. On top of firmware sits the kernel driver (the NVIDIA data-center driver branch, or AMDGPU/amdkfd for ROCm), which must match the kernel ABI and the GPU's firmware. The driver exposes a user-space runtime: the CUDA driver API and toolkit (CUDA 13.3 as of mid-2026), or the ROCm runtime (HIP, rocBLAS, MIOpen). Above that are the math and communication libraries — cuDNN, cuBLAS, NCCL on NVIDIA; MIOpen, rocBLAS, RCCL on AMD — and finally the framework (PyTorch, JAX, vLLM) the user actually writes against.

The reason this matters is that the constraints run both ways. A given CUDA toolkit requires a minimum driver; a given NCCL build is compiled against a CUDA version; a given firmware revision is validated against a driver branch; and the framework pins a cuDNN/NCCL it was built and tested with. NVIDIA's CUDA forward-compatibility and minor-version compatibility features loosen this slightly — a newer CUDA toolkit can run on an older driver within the same major branch via the compat package — but those are escape hatches, not a license to let the fleet drift. The operative posture for a production cluster is the opposite: pin everything, identically, everywhere, and treat a deviation as a defect.

The hard rule: identical NCCL across every node, or the collective hangs

This is the single most important operational invariant in the chapter, and it is not advisory. In a synchronous-training cluster, the driver + CUDA + cuDNN + NCCL tuple must be byte-for-byte identical on every participating node. A mismatched NCCL version between even two nodes does not produce a clean error — it produces a hang: the collective waits forever for a peer speaking a slightly different wire protocol, the watchdog eventually fires, the job dies, and you restart from the last checkpoint having burned hours of cluster-time on a one-line skew. Bandwidth-only mismatches (a tuning-table difference) are subtler and worse: the run completes but at degraded bus bandwidth, and you lose 10-20% of throughput silently for weeks. Version skew is the most common self-inflicted goodput wound in the fleet, and it is entirely preventable with a pinned golden stack and an admission check that refuses to schedule a job onto a node whose stack hash does not match.

The driver lifecycle and the injection fork

NVIDIA ships data-center drivers in two cadences, and choosing between them is the first real fork. Production Branches rotate roughly annually with about a year of full support; Long-Term Support Branches (LTSB) get a three-year lifecycle prioritizing stability over features. As of mid-2026 the R580 branch is the current LTSB, with support running to roughly August 2028, paired against CUDA 13.x. A fleet that values determinism over feature velocity pins to an LTSB and upgrades on a deliberate cadence; a fleet chasing the newest accelerator generation or a kernel feature may have to ride a Production Branch and accept the shorter support window. The consequence of choosing wrong is asymmetric: ride too-new a branch and you inherit regressions on hardware you cannot afford to debug under a live run; ride too-old a branch and you cannot enable the next density generation when the ramp arrives.

The second fork is how the driver lands on the node. Two models dominate, and they trade agility against bare-metal determinism.

Driver/stack injection model — GPU Operator vs golden image

Dimension	GPU Operator (containerized)	Golden image (host-baked)
Where the driver lives	Containerized, deployed/managed by the Operator on Kubernetes	Baked into the host OS image at provisioning time
Upgrade mechanics	Rolling/cordon-drain node upgrades via Operator CRDs	Re-image or in-place package update via config-mgmt (Ansible/etc.)
Best fit	Kubernetes-native, inference, heterogeneous, fast-moving fleets	Slurm/HPC training, bare-metal, change-controlled fleets
Determinism	Good, but more moving parts (Operator + container runtime)	Highest — node boots to a known, frozen, attested state
Blast radius of a bad version	Bounded by rollout policy; can canary one node pool	Bounded by image promotion; rollback = re-image
Component coverage	Driver, toolkit, device plugin, DCGM, MIG mgr, container toolkit	Driver + toolkit baked; plugins/DCGM often still Operator-managed

The two dominant ways to land and manage the node stack. 'Both' (host-baked driver + Operator-managed toolkit/plugins) is increasingly common at large fleets. NVIDIA GPU Operator and AMD GPU Operator are the reference implementations.

The GPU Operator (NVIDIA's, and AMD's equivalent) is the Kubernetes-native answer: it deploys and lifecycles the containerized driver, the container toolkit, the Kubernetes device plugin, DCGM/DCGM-exporter, the MIG manager, and node-feature discovery as a coordinated set, and it performs rolling driver upgrades by cordoning and draining a node before swapping its stack. That is exactly the right tool for a cloud-native, inference-heavy, or heterogeneous fleet where you want to roll a driver across thousands of nodes without re-imaging. The cost is more moving parts in the critical path and a dependency on the container runtime being correctly wired (the NVIDIA Container Toolkit injecting /dev/nvidia*, libraries, and the right capabilities into the container; the equivalent CDI-based device injection on the AMD side). For change-controlled Slurm/HPC training fleets the opposite instinct usually wins: bake the driver and toolkit into a golden OS image, attest it at boot, and treat every node as a frozen, identical artifact — the determinism that synchronous training rewards. Many large operators run a hybrid: host-baked driver for determinism, Operator-managed plugins/DCGM/MIG for the layers that benefit from declarative lifecycle. The provisioning machinery that produces and promotes those images is the subject of Chapter 10.5.

The AMD ROCm path: a real second source, with a tax

For most of the last decade the node stack was a single-vendor question because the alternative was not credible at production scale. That has changed. ROCm 7.x (7.0 shipped September 2025, support for the MI350X/MI355X; the line has since moved to 7.1/7.2) is the first AMD software stack that a serious operator can stand up a multi-thousand-GPU training or inference fleet on without heroics. The strategic prize is real: a credible second source breaks single-vendor allocation pain and lands roughly 15-30% lower hardware cost per unit of compute, plus negotiating leverage that is worth more than the sticker discount. The reason it is still a fork and not a free lunch is the ROCm tax — the engineering cost of the remaining ecosystem gaps.

Two things make ROCm viable today. First, RCCL maintains NCCL-API parity: it implements the same collective API surface NCCL exposes, so frameworks that call NCCL can call RCCL with minimal change, and the tuning concepts (algorithm/protocol selection, rail awareness, LL/LL128 protocols) carry over. Second, the framework story has matured — upstream PyTorch and JAX run on ROCm, and inference engines like vLLM support AMD targets. Where the tax still bites is in the long tail: custom CUDA kernels and Triton paths that assume NVIDIA intrinsics, profiling and debugging tooling that is less mature than Nsight/CUPTI, library coverage gaps (a kernel that exists tuned in cuDNN but not yet in MIOpen), and — the number that decides TCO — the realized-MFU gap. Independent benchmarking has shown AMD parts delivering a meaningful but workload-dependent fraction of NVIDIA's realized throughput on identical models; on some inference workloads the gap has closed substantially, on others it persists. The node-stack consequence is concrete: a dual-vendor fleet means two golden stacks, two firmware estates, two sets of NCCL/RCCL tuning tables, and two on-call runbooks. You buy second-source leverage and pay in operational surface area. The deeper hardware and TCO treatment of this choice lives in Chapter 7.3; the lock-in economics of CUDA vs ROCm vs XLA vs Neuron in Chapter 7.9.

The dual-vendor fork: when the ROCm tax is worth paying

Default to a single-vendor golden stack unless you have a specific reason not to — uniformity is the cheapest goodput you will ever buy, and a second vendor doubles your node-stack surface area. The ROCm tax is worth paying when (a) you operate at enough scale that 15-30% hardware savings dwarfs the incremental SRE/ML-engineering headcount; (b) your workload mix is inference-weighted, where the realized-MFU gap is smallest and RCCL/vLLM maturity is highest; or (c) you need second-source leverage badly enough that the option value alone justifies the cost. It is rarely worth paying for a small, training-heavy, custom-kernel-heavy fleet, where the gap is widest and the per-GPU engineering cost is highest. The wrong move is to go dual-vendor for the discount and discover you have funded a permanent second on-call rotation, a second firmware-validation matrix, and a second NCCL/RCCL tuning project that never quite reaches parity.

NCCL/RCCL: where paper FLOPS become realized bus bandwidth

The collective library is the most performance-critical and most tuning-sensitive component in the node stack, because at training scale above a few hundred GPUs the fabric — not the GPU — sets job completion time, and NCCL/RCCL is what drives the fabric. Its job is to execute the collectives that define distributed training (all-reduce, all-gather, reduce-scatter, all-to-all) at the highest possible bus bandwidth (busbw) — the effective throughput a collective achieves relative to the link's theoretical ceiling. The acceptance bar operators gate handoff on is concrete: NCCL all_reduce_perf should reach roughly 92% of theoretical fabric bandwidth scaling from two nodes to the full cluster — about 370 GB/s busbw on a 400G (NDR) fabric. Falling short of that is not a benign inefficiency; it is a tax amortized across the entire run, because every GPU waits on every collective.

NCCL auto-tunes well out of the box, and for many fleets the right answer is to ship defaults and not touch the knobs. But the last 10-20% of busbw on a specific topology is recovered only by tuning the library to the fabric you actually built: algorithm selection (ring vs tree vs the newer adaptive trees, CollNet/PAT), protocol selection (Simple/LL/LL128), rail awareness so traffic stays on the rail-optimized topology, channel and buffer sizing, and — where the fabric supports it — SHARP in-network reduction. NCCL 2.27+ composes NVLink-SHARP and IB-SHARP, cutting the GPU SM count consumed by a reduction from ~16 to ≤6 and halving the data on the wire; the 2.28+ line adds symmetric-memory kernels and Multimem multicast over NVLink-SHARP within an NVL72 domain. The in-network-compute mechanics live in Chapter 8.6, and the topology/oversubscription decisions that NCCL tuning must match in Chapter 8.5. The choice is sharp: ship defaults and accept ~80-90% of achievable busbw, or invest in per-topology tuning and recover the rest. Either way the tuning tables become part of the pinned golden stack, because a tuning skew is a silent bandwidth skew across nodes.

Deep dive: why a NCCL hang is a version-skew detective story (and how to design it out)

When a synchronous job stops making progress, the symptom is almost never a stack trace pointing at the culprit — it is a watchdog timeout and a cluster of GPUs all blocked in a collective. The diagnosis is a detective story, and the usual suspects are all in the node stack. NCCL version skew across nodes is the classic: two builds negotiate a slightly different protocol and one peer waits forever. Firmware skew on a NIC or NVSwitch can change link behavior so one rail silently underperforms or flaps. A single straggler node — a GPU throttling on a thermal or power excursion, a degraded optical link, a partially-failed HBM stack — drags the whole collective to its speed because the all-reduce cannot complete until the slowest participant arrives. And silent data corruption can poison a reduction without ever raising an error.

The design response is to make skew impossible rather than to debug it after the fact. Three controls do most of the work. First, NCCL's flight recorder / trace tooling records the in-flight collective state so that when a hang occurs you can identify which rank stalled and on which operation, rather than guessing — turning a multi-hour bisection into a targeted ejection. Second, an admission gate in the scheduler that refuses to place a job on any node whose golden-stack hash (driver + CUDA + NCCL + firmware revisions) does not match the cluster's pinned reference, so a mis-imaged node can never join a collective in the first place. Third, continuous active health checks — DCGM diagnostics plus a short NCCL all-reduce on idle GPUs — that catch a drifted or degrading node before it is allocated to a run. The XID/SXID error taxonomy and DCGM mechanics that feed these gates are the subject of Chapter 10.6; the autonomous ejection-and-replace loop that acts on them is Chapter 10.7.

CUDA 13.3

current CUDA Toolkit (released May 2026); paired with R580 LTSB data-center driver

mid-2026NVIDIA CUDA Toolkit Release Notes; Data Center Driver docs

R580

current NVIDIA data-center LTS driver branch; ~3-yr lifecycle, EOL ~Aug 2028

mid-2026NVIDIA Data Center Drivers; AI Enterprise lifecycle policy

ROCm 7.x

AMD stack with RCCL NCCL-API parity; MI350X/MI355X support (7.0 Sep 2025)

2025-2026AMD ROCm 7.0 release notes & compatibility matrix

NCCL 2.28+

NVLink-SHARP Multimem multicast + symmetric-memory kernels within an NVL72 domain

2026NVIDIA NCCL release notes; GitHub releases

~16 → ≤6 SMs

GPU SMs consumed by a reduction after composing NVLink-SHARP + IB-SHARP in NCCL 2.27

2025NVIDIA Developer (NCCL 2.27); SHARP in-network computing

~92%

NCCL all_reduce busbw vs theoretical (acceptance gate); ≈370 GB/s on 400G NDR

2025NVIDIA DGX BasePOD NCCL validation; OCI/Together AI

15-30%

lower hardware cost for AMD vs NVIDIA — the prize that funds the ROCm tax

2026domain-research keyNumbers; SemiAnalysis AMD vs NVIDIA

every ~3 hr

failure cadence of a 16k-GPU cluster (Llama 3: 419 unplanned/54 days) the stack must absorb

2024Meta Llama 3 405B disclosure

The golden stack: fleet-wide version pinning as a discipline

Everything above converges on one operating model: the golden stack — a single, named, version-pinned tuple of driver + CUDA/ROCm + cuDNN/MIOpen + NCCL/RCCL + container toolkit + firmware revisions, validated together and promoted as one atomic artifact across the entire fleet. The golden stack is the fleet's answer to the compatibility matrix: instead of reasoning about whether this CUDA works with that driver on those nodes, you validate one combination, hash it, and refuse to run anything else. Its three properties are non-negotiable. It is uniform — byte-identical on every node, enforced by an admission gate, because synchronous training cannot tolerate skew. It is versioned and attested — every node can prove which stack it booted, so drift is detectable and a non-conforming node is automatically drained. And it is promoted as a unit through stages (lab → canary node pool → ring rollout → fleet), never component-by-component in production, because a component upgrade that is fine in isolation can be a busbw or hang regression in combination.

The cost of the golden-stack discipline is agility: you cannot let a single team try a newer NCCL on a shared training cluster without re-pinning and re-validating the whole fleet, and that friction is real. The cost of not having it is worse and recurs forever: version skew hangs, silent bandwidth regressions, irreproducible runs, and a fleet where 'works on my node' is a daily occurrence. The right posture for a production training cluster is unambiguous — one golden stack, pinned, attested, promoted in rings, with per-job overrides confined to isolated dev/experiment pools that never share a synchronous collective with the production fleet.

Firmware: the slow-moving layer that decides reliability

Firmware is the part of the node stack operators most often defer — it is tedious, it requires reboots, and it rarely breaks loudly. That deferral is exactly the trap. The GB200 ramp made the point at scale: integration reliability hinged on firmware, with NVLink copper-backplane issues and NVL36×2 cross-rack signal-integrity problems fixed partly through firmware revisions. A node carrying stale NIC, NVSwitch, or VBIOS firmware is a node that contributes gray failures — link flaps, lane dropouts, intermittent throttling — that do not crash the node but do drag synchronous collectives and accumulate into the every-few-hours interruption rate that destroys multi-week run economics. Firmware is the slow-moving layer that quietly decides reliability.

The modern firmware estate is managed out-of-band and as code. The OCP GPU Firmware Update specification standardizes the mechanics — Redfish for the management API, PLDM-over-MCTP for the update transport, and secure out-of-band update so firmware can be staged and applied without a host agent. At fleet scale, firmware updates follow the same ring discipline as the golden stack: rolling/canary rollout, drift detection against a pinned firmware baseline, a dependency matrix that sequences firmware against driver/CUDA so you never strand a node between incompatible layers, and rollback when a canary regresses. Two constraints make this harder than ordinary patching: every update is a scheduling event (a node must drain a job before it can reboot), and firmware is a high-value attack surface — so updates must be signed and attested against a hardware root of trust (Caliptra/DICE, measured boot), the subject of Chapter 11.4. The day-2 firmware-estate operations, change-management, and refresh mechanics are deepened in Chapter 14.9.

Node-stack version-pinning posture by workload

Workload	Pinning posture	Injection model	Firmware cadence	Why
Synchronous pre-training	Strict — one golden stack, admission-gated	Golden image (host-baked driver)	Scheduled, ring rollout, drift-detected	One skew hangs the whole job; uniformity = goodput
Post-training / RL	Strict on trainer; pinned-but-versioned on rollout pool	Hybrid (baked driver + Operator plugins)	Scheduled, with trainer prioritized	Async coupling tolerates more, but trainer is still synchronous
Online inference	Pinned per service, rolled independently	GPU Operator (containerized)	Rolling/canary, low-disruption windows	Loosely coupled; rolling upgrades beat fleet freezes
Batch inference	Loosely pinned; tolerant of mixed versions	GPU Operator	Opportunistic, off-peak	Embarrassingly parallel; no shared collective to skew
Dev / experiment	Per-job overrides allowed, isolated pool	Operator or per-job container	Lagging-but-safe baseline	Velocity matters; must never share a production collective

How tightly to pin, and where per-job flexibility is tolerable. The pattern: the more synchronous and large the job, the more uniformity dominates. 'Dev/experiment' pools are deliberately isolated from production collectives.

The container-runtime layer and what it injects

One layer is easy to overlook because it is invisible when it works: the container runtime that actually exposes GPUs to a workload. On NVIDIA, the NVIDIA Container Toolkit hooks the OCI runtime to inject the GPU devices (/dev/nvidia*), the driver user-space libraries, and the right capabilities into the container at start — so the containerized framework links against the host's pinned driver rather than a stale library baked into the image. The industry is migrating this device-exposure model to the vendor-neutral Container Device Interface (CDI), which both NVIDIA and AMD implement, decoupling device injection from any single runtime. The consequence for the golden stack is subtle but important: even in a containerized fleet, the driver is a host property and the container inherits it, which is why a host-baked driver plus Operator-managed user-space is a coherent posture rather than a contradiction. Kubernetes is also moving the resource-request model itself — from counting whole GPUs toward Dynamic Resource Allocation (DRA), which lets a job claim attributes (MIG geometry, NVLink topology, memory) rather than a count — which pushes more of the node-stack's capabilities (partitioning, topology) into the scheduling layer treated in Chapter 10.2 and Chapter 10.3.

Deep dive: a sane upgrade procedure for the golden stack (and why component-by-component fails)

The instinct when a new NCCL promises better busbw, or a new driver enables a hardware feature, is to upgrade that one component on the fleet. On a synchronous training cluster that instinct is wrong, and the failure mode is predictable: the component is fine in a unit test, ships to production, and interacts badly with the pinned driver or firmware on some nodes, producing a hang or a silent busbw regression that is now mixed into live runs and hard to attribute. The discipline that avoids this treats the stack as atomic.

A defensible procedure has four stages. (1) Lab validation: assemble the candidate tuple (driver + CUDA + NCCL + cuDNN + firmware), run the full acceptance suite — DCGM diagnostics, all_reduce_perf to the ~92% busbw gate across representative node counts, a short reference training run for MFU, and an SDC screen. (2) Canary pool: promote to a small isolated node pool carrying real-but-non-critical work, and watch goodput, XID rates, and busbw for a soak period. (3) Ring rollout: promote in waves (e.g. 5% → 25% → 100%), each wave gated on health telemetry, with the golden-stack hash updated and the admission gate enforcing the new reference so no node runs a mixed stack within a single job. (4) Rollback path: every stage is reversible — golden image means rollback is a re-image to the previous pinned tuple, which is why host-baked determinism pays off precisely at upgrade time. The thing you never do is let the production fleet run two golden stacks at once on the same synchronous job. The acceptance-suite tooling that gates stages (1) and (2) is detailed in the commissioning and observability chapters; the autonomous drain/eject machinery that enforces the gate at runtime is Chapter 10.7.

Anti-patterns

The node-stack failures that recur are all variations on one mistake — treating a coupled organism as a set of independent knobs:

Letting NCCL drift across nodes. The textbook self-inflicted goodput wound: a node re-imaged with a slightly different NCCL joins a collective and hangs the run, or worse, completes at degraded busbw for weeks. Prevent it with an admission gate on the golden-stack hash, not with a wiki page asking people to be careful.
Upgrading one component on a live training fleet. A 'safe' point-upgrade to a driver or NCCL that passes in isolation but regresses in combination, now mixed into production runs. Promote the stack atomically through rings, never component-by-component.
Deferring firmware until something breaks. Stale NIC/NVSwitch/VBIOS firmware contributes gray failures — link flaps, throttling, lane dropouts — that never crash a node but drag every synchronous collective and inflate the interruption rate. Firmware is a scheduled, attested, ring-rolled estate, not an emergency activity.
Going dual-vendor for the discount without budgeting the surface area. A second vendor doubles golden stacks, firmware matrices, and on-call rotations. The 15-30% hardware saving is real, but so is the operational cost — go dual-vendor for leverage and scale, not for a line-item discount on a small fleet. → Chapter 7.3.

The node stack is the floor the rest of the software domain stands on. The scheduling plane that places jobs onto these nodes is Chapter 10.1; topology- and rack-scale-aware placement that the stack's NVLink/IMEX capabilities feed is Chapter 10.2; partitioning (MIG/MPS) that the driver exposes is Chapter 10.3; provisioning and golden-image production is Chapter 10.5; the DCGM/XID telemetry that feeds the admission and health gates is Chapter 10.6; and the autonomous drain/eject/recover loop is Chapter 10.7. The fabric this stack drives is engineered in Chapter 8.5 and Chapter 8.6; the CUDA-vs-ROCm hardware and lock-in economics in Chapter 7.3 and Chapter 7.9; the firmware root-of-trust and attestation in Chapter 11.4; and the day-2 firmware estate, refresh, and change-management in Chapter 14.9.