Guide › Software, Orchestration & Service Delivery › 10.5

Chapter 10.5

Provisioning, Bring-Up & Infrastructure as Code

Provisioning is where a rack of stranded silicon becomes a revenue-earning node — and the choice between treating physical hardware as artisanal pets or as programmable, declaratively-described cattle decides how many GPU-hours you burn between 'powered on' and 'production', every refresh, forever.

DENSITY-RAMPGOODPUT

What you'll decide here

Whether bare-metal lifecycle is driven imperatively (a runbook a human follows) or declaratively as Infrastructure-as-Code (a Redfish/Ironic/MAAS pipeline that converges a rack from PDU-on to cluster-joined) — because the second is the only one that survives a 100k-GPU fleet and a generational refresh.
Golden, immutable OS images versus live configuration management (Ansible/Puppet) on a mutable base — the fork that sets your reproducibility, your drift surface, and whether two nodes in a synchronous job are ever bit-for-bit identical.
What burn-in and acceptance gates a node must pass before it is allowed to carry a paying job — the duration, the thresholds, and the pass/fail bar — because a marginal node admitted to a synchronous cluster does not fail cleanly, it strands the whole job.
Build-versus-buy on the cluster-management platform: an integrated vendor stack (NVIDIA Base Command Manager / Mission Control) versus an open, assembled pipeline (MAAS/Ironic + Terraform + Ansible + Slurm) — trading time-to-first-job against lock-in and fleet flexibility.
The day-2 lifecycle model: how a node is drained, reimaged, re-validated, and returned to the pool — because provisioning is not a one-time event but a continuous loop that 'renews' node-level MTBF against a fleet that fails every few hours.

By the time a chapter like this one is relevant, the hard capital decisions are made: the silicon is bought (Part 4), the racks are integrated and leak-tested at the factory (Part 6), the fabric is cabled (Part 8), and the node software stack — driver, CUDA/ROCm, NCCL, firmware — is specified (Chapter 10.4). What remains is the unglamorous, schedule-critical seam that converts all of that into a node a scheduler can place a job on: provisioning and bring-up. It is unglamorous because no one buys a data center to admire its PXE pipeline. It is schedule-critical because every day a rack sits powered-but-unprovisioned is a day of depreciation with no offsetting revenue — and on a fleet earning roughly $10–12B per GW per year, the time from rack-on-the-floor to first-paid-job is money measured in millions per week.

This chapter is about one decision repeated at every layer: do you treat physical hardware as pets or as cattle? The pet model — a human follows a runbook, SSHes into a BMC, mounts an ISO, installs an OS, runs a few checks, and blesses the node — works at ten nodes and collapses at ten thousand. The cattle model — every node is described declaratively, provisioned by a pipeline, validated by an automated gate, and reimaged on demand — is the only one that survives fleet scale and the relentless generational refresh that defines the 2026 era. We trace the bare-metal lifecycle (BMC/Redfish to cluster-join), the golden-image-versus-config-management fork, the burn-in and acceptance gates that decide what a node must prove before it carries load, the build-versus-buy choice on the management platform, and the day-2 loop that makes provisioning continuous rather than one-shot.

The bare-metal lifecycle: from PDU-on to cluster-joined

A bare-metal node has a deterministic lifecycle, and the discipline is to make every transition in it machine-driven rather than human-driven. The canonical sequence — as implemented by MAAS, OpenStack Ironic, and every neocloud's home-grown equivalent — is: enroll → power-control → network-boot → image → configure → validate → join → (eventually) decommission. Walk it once, because each step is an out-of-band (OOB) or in-band operation with a specific protocol, and the protocol choice is itself a decision with consequences.

Enroll and power-control happen out-of-band, through the BMC. The baseboard management controller is a small always-on computer beside the host, reachable on a separate management network even when the host is powered off. Historically you spoke to it with IPMI — a 1998-era protocol with a long CVE history and weak auth. The modern path is Redfish, a RESTful, JSON, schema-defined DMTF standard that does the same job (power on/off/cycle, set one-time PXE boot, read sensors, mount virtual media) but is scriptable, securable, and increasingly the only thing a 2026 platform targets. The fork here is real: a fleet still leaning on IPMI inherits a fragile, hard-to-secure management plane that becomes a security liability (the BMC is a root-of-trust target — see Chapter 11.4); a fleet standardized on Redfish gets a uniform, auditable OOB API across Dell iDRAC, HPE iLO, Lenovo XCC, and Supermicro. Redfish is also how power-smoothing and capping controls reach GB300-class racks at run time, so the management plane you build for provisioning is the same one you later use for grid-friendliness (Chapter 4.5).

Network-boot and image happen in-band, over the provisioning fabric. Once the BMC sets a one-time PXE boot and powers the node on, it network-boots into an ephemeral environment (a small in-memory Linux), which then writes the real OS image to local disk — MAAS uses curtin, Ironic uses its deploy ramdisk — and hands off to cloud-init for first-boot configuration. Configure applies the node's role: storage layout, network (the all-important fabric NIC and RDMA setup), and the pinned node-software stack. Validate runs the acceptance gates (below). Only after the gates pass does the node join the scheduler's pool and become eligible to carry work. The entire choreography — MAAS describes it as Allocate → PXE boot → ephemeral environment → curtin → cloud-init → reboot → deployed — must be expressible as code, parameterized by an inventory, and idempotent, so that re-running it on a flaky node is safe and produces an identical result.

The master fork: imperative runbook vs declarative Infrastructure-as-Code

If you take one decision from this chapter, take this one. Imperative bring-up is a sequence of human actions (or a script that assumes a known starting state): do this, then this, then this. It is fast to write for a handful of nodes and impossible to reason about at scale — you can never be sure two nodes provisioned weeks apart are identical, and recovery means a human re-deriving the current state. Declarative IaC describes the desired end state — Terraform/Ironic/MAAS for the physical layer, Ansible/cloud-init for in-OS config — and lets a converger drive any node from any state to that state, idempotently. The consequence chain is the whole chapter: declarative provisioning is what makes a node reproducible (same input, same node), auditable (the spec is the source of truth, version-controlled), and fast to recycle (drain, wipe, re-converge in minutes, not a half-day of manual care). For a synchronous training cluster, reproducibility is not a nicety — two nodes that differ in driver, NCCL, or firmware version will hang the collective or silently degrade bisection bandwidth (Chapter 10.4). Declarative bring-up is the only model under which 'every node is identical' is an enforceable invariant rather than a hope.

Golden images vs configuration management

Inside the declarative camp there is a second fork that practitioners argue about endlessly, and both answers are defensible depending on what you optimize for: immutable golden images versus live configuration management.

The golden-image model bakes a complete, tested OS — kernel, pinned GPU driver, CUDA/ROCm, NCCL, container runtime, firmware-update tooling, health agents — into a single immutable artifact, versioned and signed. Provisioning a node means writing that exact image and rebooting; there is no per-node configuration step that can drift. The payoff is the strongest possible reproducibility and the smallest drift surface: a node is either running golden-image v2026.06.2 or it is not, and 'not' is a clean, detectable state. The cost is that every change — a driver bump, a security patch, an agent update — requires building, testing, and rolling a new image across the fleet, which makes small changes heavyweight and pushes you toward batched releases.

The configuration-management model (Ansible, Puppet, Salt) starts from a thinner base image and converges the node to a desired state by running roles/manifests in place. It is lighter for incremental change — patch one package, re-run the play — and more flexible for heterogeneous fleets. The cost is drift: a long-lived mutable node accumulates state no manifest captures (a hand-fixed config during an incident, a half-applied update, a package installed to debug something at 3 a.m.), and over months no two nodes are truly identical. For a fleet whose headline failure mode is 'a marginal node silently degrades a synchronous job', drift is not a hygiene issue — it is a goodput tax.

The 2026 consensus for AI fleets leans toward immutable golden images for the OS and node-software stack, with thin config management only for genuinely per-node state (hostname, fabric addressing, role labels). The reasoning is the same one that governs the node-software stack: a synchronous training cluster wants every node bit-for-bit identical, and the cheapest way to guarantee that is to make the node a disposable instantiation of a versioned artifact rather than a long-lived entity you keep patching. The decision below shows the tradeoff explicitly.

Golden image vs configuration management vs hybrid

Dimension	Immutable golden image	Live config management	Hybrid (default)
Reproducibility	Strongest — node is a versioned artifact	Weak over time — converges but drifts	Strong — stack is immutable, only per-node state varies
Drift surface	Near-zero (node is rebuilt, not patched)	Large — mutable nodes accumulate untracked state	Small — confined to declared per-node config
Cost of a small change	High — rebuild + test + roll a new image	Low — patch a package, re-run the play	Medium — image for the stack, play for the rest
Rollback	Trivial — re-deploy the previous image	Hard — un-applying state is rarely clean	Trivial for the stack; play-level for the rest
Fit for synchronous training	Excellent — guarantees node-identical fleet	Risky — version skew hangs collectives	Excellent — same guarantee, less rebuild churn
Best when	Large homogeneous fleet, frequent refresh	Heterogeneous fleet, many small bespoke changes	AI fleet that wants identity + incremental agility

The practical fork for OS + node-software provisioning on an AI fleet. The 2026 default for synchronous-training clusters is the hybrid: immutable golden image for the stack, thin config-management for per-node state.

Burn-in and acceptance gates: what a node must prove

Provisioning a node is not the same as trusting it. The single most expensive provisioning mistake on an AI fleet is admitting a marginal node — one that boots fine and passes a smoke test but fails under sustained thermal and bandwidth load — into a synchronous job. Because the whole job moves at the speed of its slowest straggler and restarts from a checkpoint when any node fails, one bad node does not degrade gracefully; it strands hundreds or thousands of healthy GPUs (Chapter 10.7). The defense is a burn-in and acceptance gate: a battery of stress and benchmark tests, with explicit pass/fail thresholds, that a node must clear before it is allowed into the schedulable pool. This is the provisioning-side counterpart to the factory burn-in done at integration (Chapter 7.14) — and it must be repeated, because nodes that passed at the factory routinely fail again after transport, install, and first thermal cycle in the actual hall.

The mature reference is a phased, subsystem-by-subsystem validation that builds up from the single GPU to the full cluster, codified by practitioners like Together AI and standardized in spirit by SemiAnalysis's ClusterMAX rating. The phases run in order — preparation/config, GPU validation, NVLink/NVSwitch, network fabric, storage, end-to-end model build, then continuous observability — so that a failure is localized to the cheapest layer that can explain it rather than discovered as a mysterious slowdown in a full training run. Each phase has named tools and, critically, named thresholds.

Phased acceptance gate: subsystem → tool → pass bar

Phase	What it proves	Primary tools	Representative pass bar
GPU validation	Each GPU is present, healthy, thermally stable under sustained load	DCGM Diagnostics, gpu-burn	No XID/SXID errors; no thermal throttle; ECC clean over a multi-hour soak
NVLink / NVSwitch	Intra-node GPU-to-GPU bandwidth at spec	nvbandwidth, NCCL tests	Measured ~near-spec NVLink bandwidth; no degraded links
Network fabric (RDMA)	Inter-node collective bandwidth at spec across the back-end fabric	NCCL all_reduce, iperf3	all_reduce busbw ~92% of fabric line rate (~370 GB/s on a 400 GbE/IB fabric)
Storage	I/O meets the checkpoint/data-load budget	FIO	Read/write throughput and IOPS meet the storage SLA for the workload
End-to-end model build	The whole stack delivers expected MFU on a reference job	Reference training run	Achieved MFU/goodput within tolerance of the validated reference
Burn-in soak	No early-life failures under days of continuous load	DCGM + gpu-burn + NCCL loops	Clean run over a multi-day soak (commonly 72–168 hr) before production admission

Practitioner-standard validation sequence (Together AI; NVIDIA DGX BasePOD; SemiAnalysis ClusterMAX). Thresholds are 2026 reference points; the discipline is to fix the bar in writing before the cluster ships.

Two numbers anchor the gate. The fabric phase has a real, citable threshold: a healthy node should hit roughly 92% of the theoretical fabric maximum on NCCL all_reduce — about 370 GB/s on a 400 GB/s fabric (Together AI). A node that comes in materially below that has a marginal optic, a mis-seated cable, or a routing problem, and admitting it pollutes the bisection bandwidth of every collective it later participates in. The burn-in phase has a duration debate rather than a single number: practitioner ranges cluster around 72–168 hours of continuous soak for new clusters (Introl; neocloud operator reports), and the open question — how much soak is economically optimal before the marginal-node catch rate stops justifying the GPU-hours spent soaking — remains genuinely unsettled in public data. The decision you must make explicitly is the pass bar: set it too low and marginal nodes leak into production and strand jobs; set it too high and you burn revenue GPU-hours soaking hardware that was already fine.

A node that boots is not a node that works

The common failure is to equate 'provisioned' with 'production-ready'. A node can PXE-boot cleanly, write its golden image, join the scheduler, and still carry a GPU that throttles at sustained load, an optic whose bit-error rate climbs under heat, or an HBM stack that develops uncorrectable errors after a few hours. None of these show up in a thirty-second smoke test; all of them surface during a paying job, where the cost of discovery is a checkpoint restart across the whole cluster. The non-negotiable rule: no node enters the schedulable pool until it has cleared the full acceptance gate, including a multi-day burn-in soak, under load that mimics production thermals and collectives. The GPU-hours spent soaking are cheap; the GPU-hours a marginal node strands in a synchronous job are not. This is the provisioning-layer expression of the goodput discipline that governs the whole fleet (Chapter 10.7).

~92%

of fabric line rate is the NCCL all_reduce acceptance bar (~370 GB/s on a 400 GB/s fabric)

2025Together AI — Practitioner's Guide to Testing Large GPU Clusters

72–168 hr

typical burn-in soak before a new cluster is admitted to production

2025Introl validation frameworks; neocloud operator reports

~3 hr

mean time between failures for a 16,000-GPU cluster — why provisioning is a continuous day-2 loop, not a one-time event

2024Meta Llama 3 (16,384 H100); ~80,000-hr per-GPU MTBF

~7 days

MTBF per 512 GPUs at a top-tier H100 operator; new clusters fail far more during 3–4 week burn-in

2025SemiAnalysis (100k H100 clusters)

~90 sec

automated node replacement on a best-in-class fleet — the day-2 lifecycle target

2026SemiAnalysis AI Neocloud Playbook / ClusterMAX

<2 days

to provision 128 GPUs to a customer at a top-rated neocloud — the bring-up-as-competitive-lever benchmark

2026SemiAnalysis ClusterMAX 2.0

>90%

goodput (effective-training-time) achievable despite ~3-hr cluster MTBF, given automated validation + recovery

2025Google Cloud goodput; NVIDIA Mission Control

$10–12B

revenue per GW per year — the depreciation clock that makes time-from-rack-to-first-job a million-dollar-per-week metric (contested — single-source)

2026Domain synthesis; SemiAnalysis

Build vs buy the cluster-management platform

Everything above can be assembled from open components or bought as an integrated stack, and this is the provisioning-layer instance of the build-versus-buy fork that recurs throughout the platform (Chapter 10.1). The open, assembled path wires together a bare-metal provisioner (MAAS or OpenStack Ironic), an IaC layer (Terraform driving Redfish/Ironic, plus Ansible/cloud-init), a scheduler (Slurm or Kubernetes), and your own health-check and observability glue. The integrated vendor path buys a turnkey fleet-management product — NVIDIA Base Command Manager for provisioning and cluster management, NVIDIA Mission Control for autonomous validation and hardware recovery on DGX/GB200-class systems — that ships the bring-up, health-check, and recovery workflows as a supported product.

The tradeoff is the familiar one. The integrated path compresses time-to-first-job — the workflows are pre-built, validated against the reference architecture, and supported — at the cost of lock-in to a vendor's hardware and operational model, and a price premium. The open path preserves fleet flexibility (heterogeneous vendors, a second-source GPU like AMD MI300X/MI350X under ROCm, bespoke validation gates) and avoids per-node licensing, at the cost of carrying the integration and reliability engineering yourself — which is real headcount, not a weekend project. The honest decision rule: a single-vendor, homogeneous, NVIDIA-reference fleet that values speed-to-revenue over flexibility leans buy; a multi-vendor neocloud differentiating on bring-up speed and tenant flexibility, or anyone hedging GPU lock-in, leans build. Many large operators do both — buy the integrated stack for the homogeneous reference pods, build the open pipeline for the heterogeneous and experimental capacity.

Deep dive: MAAS/Ironic + Terraform/Redfish — bare-metal-as-code, and why it is an economic lever

The phrase that captures the 2026 shift is bare-metal-as-code: treating a physical rack the way cloud treats a VM — a programmable, reservable, declaratively-described resource — rather than an artifact a technician hand-builds. The reference open stack is MAAS or OpenStack Ironic at the physical layer, exposing each machine through its BMC (Redfish preferred, IPMI legacy), driven by Terraform so that 'allocate, image, and configure this rack' is a version-controlled plan, with Ansible/cloud-init handling in-OS convergence. MAAS 3.7 even provisions NVIDIA BlueField DPUs directly through their BMC, extending the same model to the smart-NIC layer.

Why this is an economic lever and not just an ops convenience: the metric that matters is time-from-rack-to-revenue, and on a fleet earning ~$10–12B/GW/year, compressing that interval is directly monetizable. A declarative pipeline turns a multi-day, technician-bound bring-up into a parameterized run that can image and validate racks in parallel, unattended, overnight. It also makes the day-2 loop cheap: when a node fails (and at ~3-hour cluster MTBF, nodes fail constantly), the same pipeline drains, wipes, re-images, re-validates, and re-admits it without a human re-deriving its state — the mechanism behind best-in-class fleets' ~90-second automated node replacement. The capital case is straightforward: the engineering cost of building the pipeline is fixed and one-time; the GPU-hours it saves recur on every node, every refresh, for the life of the fleet. → continuous-recovery framing in Chapter 10.7; the productization of fast provisioning as an SLA in Chapter 10.9.

Day-2 lifecycle: provisioning as a continuous loop

The mistake that organizations make after a successful initial deployment is to treat provisioning as done. It is not. On a fleet where a 16,000-GPU cluster fails roughly every three hours (Chapter 10.7) and where a generational hardware refresh arrives on a 12–18 month cadence, provisioning is a continuous loop, and the same declarative pipeline that did day-0 bring-up is what runs it. The day-2 cycle is: detect a failing or suspect node → cordon and drain its jobs → wipe and re-image from the current golden image → re-run the acceptance gate → re-admit to the pool (or route to break-fix). Done well, this loop also serves a subtler purpose: renewing node-level MTBF. A node that has run continuously for months accumulates wear and drift; periodically cycling nodes through re-validation and reimaging — a pre-allocation burn-in before a node is handed to a new tenant or job — catches degradation early and resets the node to a known-good baseline, which is why validation-first, proactive fleet management has become standard practice rather than reactive break-fix.

The day-2 loop is also where the density-ramp thread reappears. A refresh is not a like-for-like swap: a hall provisioning GB200 NVL72 today must re-image and re-validate against a new node-software stack, new firmware, new fabric parameters, and a heavier power/thermal envelope when the next generation lands. A declarative pipeline absorbs this as a new image version and a new acceptance threshold; an imperative, pet-based process treats every refresh as a fresh manual project, and the difference compounds across the asset's life. The provisioning architecture you choose at day-0 is, in effect, a bet on how cheaply you can absorb every refresh after it. → refresh execution economics in Chapter 14.9.

Deep dive: the security seam — provisioning is also where supply-chain and firmware trust is established

Bring-up is not only an availability problem; it is the moment the fleet's hardware root of trust is established or forfeited. The OOB management plane you build for provisioning — the BMCs, the Redfish endpoints, the provisioning network — is simultaneously a high-value attack surface: a compromised BMC sits below the OS, persists across reimaging, and can subvert measured boot. This is why the move from IPMI to Redfish is a security decision as much as an operational one, and why the provisioning pipeline must enforce signed golden images, verified firmware (PLDM-over-MCTP, OCP firmware-update spec), and platform attestation at join time rather than trusting that a node is what it claims to be.

The discipline: a node should not enter the schedulable pool until it has (1) booted a signed image, (2) attested its firmware and boot measurements against expected values, and (3) passed the performance acceptance gate. Folding attestation into the same join gate that already runs the burn-in tests is the cheap, correct place to do it — the node is already isolated and under test, so adding a trust check costs nothing extra operationally. Skipping it means your fast, automated provisioning pipeline is also a fast, automated way to admit a tampered or counterfeit node. The canonical treatment of the firmware/BMC root of trust lives in Chapter 11.4; supply-chain provenance and HBOM/RIM in Chapter 11.3.

Anti-patterns

The same provisioning failures recur, and each one traces back to treating hardware as pets or skipping a gate:

Imperative bring-up that never converges to identical nodes. A runbook-driven fleet where nodes provisioned weeks apart differ in driver, NCCL, or firmware version. The cost is invisible until a synchronous job hangs on a collective or silently runs at degraded bandwidth, and the debugging is slow because the difference is buried in version skew no one tracked (Chapter 10.4).
Skipping or shortcutting burn-in to hit a deadline. Admitting nodes that passed a smoke test but never soaked under sustained load. The marginal nodes leak into production and strand synchronous jobs at a cost far exceeding the GPU-hours the shortcut saved.
Mutable long-lived nodes that drift. A config-management fleet where nodes are patched in place for years until no two are identical and incident-time hand-fixes are never reconciled. The fix — reimaging from a golden artifact — is the thing the architecture made expensive, so it never happens.
Provisioning treated as one-time. A pipeline built for day-0 and abandoned, so the day-2 loop (drain, reimage, re-validate, re-admit) is manual — which means at ~3-hour cluster MTBF the fleet is perpetually behind on recovering failed capacity, and goodput bleeds out one un-recycled node at a time.

Provisioning sits between the node-software stack it deploys — driver, CUDA/ROCm, NCCL, firmware, the golden stack — in Chapter 10.4, and the scheduler it joins nodes into, in Chapter 10.1 and Chapter 10.2. The acceptance gates here are the deploy-time counterpart of factory burn-in in Chapter 7.14; the continuous day-2 recovery loop is engineered in Chapter 10.7 and observed in Chapter 10.6. Fast provisioning becomes a sellable SLA in Chapter 10.9; the OOB/Redfish management plane is a security root-of-trust surface in Chapter 11.4, with supply-chain provenance in Chapter 11.3; and the refresh economics that make a cheap day-2 loop pay off live in Chapter 14.9.