Chapter 10.5
Provisioning, Bring-Up & Infrastructure as Code
Provisioning is where a rack of stranded silicon becomes a revenue-earning node — and the choice between treating physical hardware as artisanal pets or as programmable, declaratively-described cattle decides how many GPU-hours you burn between 'powered on' and 'production', every refresh, forever.
What you'll decide here
- Whether bare-metal lifecycle is driven imperatively (a runbook a human follows) or declaratively as Infrastructure-as-Code (a Redfish/Ironic/MAAS pipeline that converges a rack from PDU-on to cluster-joined) — because the second is the only one that survives a 100k-GPU fleet and a generational refresh.
- Golden, immutable OS images versus live configuration management (Ansible/Puppet) on a mutable base — the fork that sets your reproducibility, your drift surface, and whether two nodes in a synchronous job are ever bit-for-bit identical.
- What burn-in and acceptance gates a node must pass before it is allowed to carry a paying job — the duration, the thresholds, and the pass/fail bar — because a marginal node admitted to a synchronous cluster does not fail cleanly, it strands the whole job.
- Build-versus-buy on the cluster-management platform: an integrated vendor stack (NVIDIA Base Command Manager / Mission Control) versus an open, assembled pipeline (MAAS/Ironic + Terraform + Ansible + Slurm) — trading time-to-first-job against lock-in and fleet flexibility.
- The day-2 lifecycle model: how a node is drained, reimaged, re-validated, and returned to the pool — because provisioning is not a one-time event but a continuous loop that 'renews' node-level MTBF against a fleet that fails every few hours.
By the time a chapter like this one is relevant, the hard capital decisions are made: the silicon is bought (Part 4), the racks are integrated and leak-tested at the factory (Part 6), the fabric is cabled (Part 8), and the node software stack — driver, CUDA/ROCm, NCCL, firmware — is specified (Chapter 10.4). What remains is the unglamorous, schedule-critical seam that converts all of that into a node a scheduler can place a job on: provisioning and bring-up. It is unglamorous because no one buys a data center to admire its PXE pipeline. It is schedule-critical because every day a rack sits powered-but-unprovisioned is a day of depreciation with no offsetting revenue — and on a fleet earning roughly $10–12B per GW per year, the time from rack-on-the-floor to first-paid-job is money measured in millions per week.
This chapter is about one decision repeated at every layer: do you treat physical hardware as pets or as cattle? The pet model — a human follows a runbook, SSHes into a BMC, mounts an ISO, installs an OS, runs a few checks, and blesses the node — works at ten nodes and collapses at ten thousand. The cattle model — every node is described declaratively, provisioned by a pipeline, validated by an automated gate, and reimaged on demand — is the only one that survives fleet scale and the relentless generational refresh that defines the 2026 era. We trace the bare-metal lifecycle (BMC/Redfish to cluster-join), the golden-image-versus-config-management fork, the burn-in and acceptance gates that decide what a node must prove before it carries load, the build-versus-buy choice on the management platform, and the day-2 loop that makes provisioning continuous rather than one-shot.
The bare-metal lifecycle: from PDU-on to cluster-joined
A bare-metal node has a deterministic lifecycle, and the discipline is to make every transition in it machine-driven rather than human-driven. The canonical sequence — as implemented by MAAS, OpenStack Ironic, and every neocloud's home-grown equivalent — is: enroll → power-control → network-boot → image → configure → validate → join → (eventually) decommission. Walk it once, because each step is an out-of-band (OOB) or in-band operation with a specific protocol, and the protocol choice is itself a decision with consequences.
Enroll and power-control happen out-of-band, through the BMC. The baseboard management controller is a small always-on computer beside the host, reachable on a separate management network even when the host is powered off. Historically you spoke to it with IPMI — a 1998-era protocol with a long CVE history and weak auth. The modern path is Redfish, a RESTful, JSON, schema-defined DMTF standard that does the same job (power on/off/cycle, set one-time PXE boot, read sensors, mount virtual media) but is scriptable, securable, and increasingly the only thing a 2026 platform targets. The fork here is real: a fleet still leaning on IPMI inherits a fragile, hard-to-secure management plane that becomes a security liability (the BMC is a root-of-trust target — see Chapter 11.4); a fleet standardized on Redfish gets a uniform, auditable OOB API across Dell iDRAC, HPE iLO, Lenovo XCC, and Supermicro. Redfish is also how power-smoothing and capping controls reach GB300-class racks at run time, so the management plane you build for provisioning is the same one you later use for grid-friendliness (Chapter 4.5).
Network-boot and image happen in-band, over the provisioning fabric. Once the BMC sets a one-time PXE boot and powers the node on, it network-boots into an ephemeral environment (a small in-memory Linux), which then writes the real OS image to local disk — MAAS uses curtin, Ironic uses its deploy ramdisk — and hands off to cloud-init for first-boot configuration. Configure applies the node's role: storage layout, network (the all-important fabric NIC and RDMA setup), and the pinned node-software stack. Validate runs the acceptance gates (below). Only after the gates pass does the node join the scheduler's pool and become eligible to carry work. The entire choreography — MAAS describes it as Allocate → PXE boot → ephemeral environment → curtin → cloud-init → reboot → deployed — must be expressible as code, parameterized by an inventory, and idempotent, so that re-running it on a flaky node is safe and produces an identical result.
Golden images vs configuration management
Inside the declarative camp there is a second fork that practitioners argue about endlessly, and both answers are defensible depending on what you optimize for: immutable golden images versus live configuration management.
The golden-image model bakes a complete, tested OS — kernel, pinned GPU driver, CUDA/ROCm, NCCL, container runtime, firmware-update tooling, health agents — into a single immutable artifact, versioned and signed. Provisioning a node means writing that exact image and rebooting; there is no per-node configuration step that can drift. The payoff is the strongest possible reproducibility and the smallest drift surface: a node is either running golden-image v2026.06.2 or it is not, and 'not' is a clean, detectable state. The cost is that every change — a driver bump, a security patch, an agent update — requires building, testing, and rolling a new image across the fleet, which makes small changes heavyweight and pushes you toward batched releases.
The configuration-management model (Ansible, Puppet, Salt) starts from a thinner base image and converges the node to a desired state by running roles/manifests in place. It is lighter for incremental change — patch one package, re-run the play — and more flexible for heterogeneous fleets. The cost is drift: a long-lived mutable node accumulates state no manifest captures (a hand-fixed config during an incident, a half-applied update, a package installed to debug something at 3 a.m.), and over months no two nodes are truly identical. For a fleet whose headline failure mode is 'a marginal node silently degrades a synchronous job', drift is not a hygiene issue — it is a goodput tax.
The 2026 consensus for AI fleets leans toward immutable golden images for the OS and node-software stack, with thin config management only for genuinely per-node state (hostname, fabric addressing, role labels). The reasoning is the same one that governs the node-software stack: a synchronous training cluster wants every node bit-for-bit identical, and the cheapest way to guarantee that is to make the node a disposable instantiation of a versioned artifact rather than a long-lived entity you keep patching. The decision below shows the tradeoff explicitly.
| Dimension | Immutable golden image | Live config management | Hybrid (default) |
|---|---|---|---|
| Reproducibility | Strongest — node is a versioned artifact | Weak over time — converges but drifts | Strong — stack is immutable, only per-node state varies |
| Drift surface | Near-zero (node is rebuilt, not patched) | Large — mutable nodes accumulate untracked state | Small — confined to declared per-node config |
| Cost of a small change | High — rebuild + test + roll a new image | Low — patch a package, re-run the play | Medium — image for the stack, play for the rest |
| Rollback | Trivial — re-deploy the previous image | Hard — un-applying state is rarely clean | Trivial for the stack; play-level for the rest |
| Fit for synchronous training | Excellent — guarantees node-identical fleet | Risky — version skew hangs collectives | Excellent — same guarantee, less rebuild churn |
| Best when | Large homogeneous fleet, frequent refresh | Heterogeneous fleet, many small bespoke changes | AI fleet that wants identity + incremental agility |
Burn-in and acceptance gates: what a node must prove
Provisioning a node is not the same as trusting it. The single most expensive provisioning mistake on an AI fleet is admitting a marginal node — one that boots fine and passes a smoke test but fails under sustained thermal and bandwidth load — into a synchronous job. Because the whole job moves at the speed of its slowest straggler and restarts from a checkpoint when any node fails, one bad node does not degrade gracefully; it strands hundreds or thousands of healthy GPUs (Chapter 10.7). The defense is a burn-in and acceptance gate: a battery of stress and benchmark tests, with explicit pass/fail thresholds, that a node must clear before it is allowed into the schedulable pool. This is the provisioning-side counterpart to the factory burn-in done at integration (Chapter 7.14) — and it must be repeated, because nodes that passed at the factory routinely fail again after transport, install, and first thermal cycle in the actual hall.
The mature reference is a phased, subsystem-by-subsystem validation that builds up from the single GPU to the full cluster, codified by practitioners like Together AI and standardized in spirit by SemiAnalysis's ClusterMAX rating. The phases run in order — preparation/config, GPU validation, NVLink/NVSwitch, network fabric, storage, end-to-end model build, then continuous observability — so that a failure is localized to the cheapest layer that can explain it rather than discovered as a mysterious slowdown in a full training run. Each phase has named tools and, critically, named thresholds.
| Phase | What it proves | Primary tools | Representative pass bar |
|---|---|---|---|
| GPU validation | Each GPU is present, healthy, thermally stable under sustained load | DCGM Diagnostics, gpu-burn | No XID/SXID errors; no thermal throttle; ECC clean over a multi-hour soak |
| NVLink / NVSwitch | Intra-node GPU-to-GPU bandwidth at spec | nvbandwidth, NCCL tests | Measured ~near-spec NVLink bandwidth; no degraded links |
| Network fabric (RDMA) | Inter-node collective bandwidth at spec across the back-end fabric | NCCL all_reduce, iperf3 | all_reduce busbw ~92% of fabric line rate (~370 GB/s on a 400 GbE/IB fabric) |
| Storage | I/O meets the checkpoint/data-load budget | FIO | Read/write throughput and IOPS meet the storage SLA for the workload |
| End-to-end model build | The whole stack delivers expected MFU on a reference job | Reference training run | Achieved MFU/goodput within tolerance of the validated reference |
| Burn-in soak | No early-life failures under days of continuous load | DCGM + gpu-burn + NCCL loops | Clean run over a multi-day soak (commonly 72–168 hr) before production admission |
Two numbers anchor the gate. The fabric phase has a real, citable threshold: a healthy node should hit roughly 92% of the theoretical fabric maximum on NCCL all_reduce — about 370 GB/s on a 400 GB/s fabric (Together AI). A node that comes in materially below that has a marginal optic, a mis-seated cable, or a routing problem, and admitting it pollutes the bisection bandwidth of every collective it later participates in. The burn-in phase has a duration debate rather than a single number: practitioner ranges cluster around 72–168 hours of continuous soak for new clusters (Introl; neocloud operator reports), and the open question — how much soak is economically optimal before the marginal-node catch rate stops justifying the GPU-hours spent soaking — remains genuinely unsettled in public data. The decision you must make explicitly is the pass bar: set it too low and marginal nodes leak into production and strand jobs; set it too high and you burn revenue GPU-hours soaking hardware that was already fine.
Build vs buy the cluster-management platform
Everything above can be assembled from open components or bought as an integrated stack, and this is the provisioning-layer instance of the build-versus-buy fork that recurs throughout the platform (Chapter 10.1). The open, assembled path wires together a bare-metal provisioner (MAAS or OpenStack Ironic), an IaC layer (Terraform driving Redfish/Ironic, plus Ansible/cloud-init), a scheduler (Slurm or Kubernetes), and your own health-check and observability glue. The integrated vendor path buys a turnkey fleet-management product — NVIDIA Base Command Manager for provisioning and cluster management, NVIDIA Mission Control for autonomous validation and hardware recovery on DGX/GB200-class systems — that ships the bring-up, health-check, and recovery workflows as a supported product.
The tradeoff is the familiar one. The integrated path compresses time-to-first-job — the workflows are pre-built, validated against the reference architecture, and supported — at the cost of lock-in to a vendor's hardware and operational model, and a price premium. The open path preserves fleet flexibility (heterogeneous vendors, a second-source GPU like AMD MI300X/MI350X under ROCm, bespoke validation gates) and avoids per-node licensing, at the cost of carrying the integration and reliability engineering yourself — which is real headcount, not a weekend project. The honest decision rule: a single-vendor, homogeneous, NVIDIA-reference fleet that values speed-to-revenue over flexibility leans buy; a multi-vendor neocloud differentiating on bring-up speed and tenant flexibility, or anyone hedging GPU lock-in, leans build. Many large operators do both — buy the integrated stack for the homogeneous reference pods, build the open pipeline for the heterogeneous and experimental capacity.
Deep dive: MAAS/Ironic + Terraform/Redfish — bare-metal-as-code, and why it is an economic lever
The phrase that captures the 2026 shift is bare-metal-as-code: treating a physical rack the way cloud treats a VM — a programmable, reservable, declaratively-described resource — rather than an artifact a technician hand-builds. The reference open stack is MAAS or OpenStack Ironic at the physical layer, exposing each machine through its BMC (Redfish preferred, IPMI legacy), driven by Terraform so that 'allocate, image, and configure this rack' is a version-controlled plan, with Ansible/cloud-init handling in-OS convergence. MAAS 3.7 even provisions NVIDIA BlueField DPUs directly through their BMC, extending the same model to the smart-NIC layer.
Why this is an economic lever and not just an ops convenience: the metric that matters is time-from-rack-to-revenue, and on a fleet earning ~$10–12B/GW/year, compressing that interval is directly monetizable. A declarative pipeline turns a multi-day, technician-bound bring-up into a parameterized run that can image and validate racks in parallel, unattended, overnight. It also makes the day-2 loop cheap: when a node fails (and at ~3-hour cluster MTBF, nodes fail constantly), the same pipeline drains, wipes, re-images, re-validates, and re-admits it without a human re-deriving its state — the mechanism behind best-in-class fleets' ~90-second automated node replacement. The capital case is straightforward: the engineering cost of building the pipeline is fixed and one-time; the GPU-hours it saves recur on every node, every refresh, for the life of the fleet. → continuous-recovery framing in Chapter 10.7; the productization of fast provisioning as an SLA in Chapter 10.9.
Day-2 lifecycle: provisioning as a continuous loop
The mistake that organizations make after a successful initial deployment is to treat provisioning as done. It is not. On a fleet where a 16,000-GPU cluster fails roughly every three hours (Chapter 10.7) and where a generational hardware refresh arrives on a 12–18 month cadence, provisioning is a continuous loop, and the same declarative pipeline that did day-0 bring-up is what runs it. The day-2 cycle is: detect a failing or suspect node → cordon and drain its jobs → wipe and re-image from the current golden image → re-run the acceptance gate → re-admit to the pool (or route to break-fix). Done well, this loop also serves a subtler purpose: renewing node-level MTBF. A node that has run continuously for months accumulates wear and drift; periodically cycling nodes through re-validation and reimaging — a pre-allocation burn-in before a node is handed to a new tenant or job — catches degradation early and resets the node to a known-good baseline, which is why validation-first, proactive fleet management has become standard practice rather than reactive break-fix.
The day-2 loop is also where the density-ramp thread reappears. A refresh is not a like-for-like swap: a hall provisioning GB200 NVL72 today must re-image and re-validate against a new node-software stack, new firmware, new fabric parameters, and a heavier power/thermal envelope when the next generation lands. A declarative pipeline absorbs this as a new image version and a new acceptance threshold; an imperative, pet-based process treats every refresh as a fresh manual project, and the difference compounds across the asset's life. The provisioning architecture you choose at day-0 is, in effect, a bet on how cheaply you can absorb every refresh after it. → refresh execution economics in Chapter 14.9.
Deep dive: the security seam — provisioning is also where supply-chain and firmware trust is established
Bring-up is not only an availability problem; it is the moment the fleet's hardware root of trust is established or forfeited. The OOB management plane you build for provisioning — the BMCs, the Redfish endpoints, the provisioning network — is simultaneously a high-value attack surface: a compromised BMC sits below the OS, persists across reimaging, and can subvert measured boot. This is why the move from IPMI to Redfish is a security decision as much as an operational one, and why the provisioning pipeline must enforce signed golden images, verified firmware (PLDM-over-MCTP, OCP firmware-update spec), and platform attestation at join time rather than trusting that a node is what it claims to be.
The discipline: a node should not enter the schedulable pool until it has (1) booted a signed image, (2) attested its firmware and boot measurements against expected values, and (3) passed the performance acceptance gate. Folding attestation into the same join gate that already runs the burn-in tests is the cheap, correct place to do it — the node is already isolated and under test, so adding a trust check costs nothing extra operationally. Skipping it means your fast, automated provisioning pipeline is also a fast, automated way to admit a tampered or counterfeit node. The canonical treatment of the firmware/BMC root of trust lives in Chapter 11.4; supply-chain provenance and HBOM/RIM in Chapter 11.3.
Anti-patterns
The same provisioning failures recur, and each one traces back to treating hardware as pets or skipping a gate:
- Imperative bring-up that never converges to identical nodes. A runbook-driven fleet where nodes provisioned weeks apart differ in driver, NCCL, or firmware version. The cost is invisible until a synchronous job hangs on a collective or silently runs at degraded bandwidth, and the debugging is slow because the difference is buried in version skew no one tracked (Chapter 10.4).
- Skipping or shortcutting burn-in to hit a deadline. Admitting nodes that passed a smoke test but never soaked under sustained load. The marginal nodes leak into production and strand synchronous jobs at a cost far exceeding the GPU-hours the shortcut saved.
- Mutable long-lived nodes that drift. A config-management fleet where nodes are patched in place for years until no two are identical and incident-time hand-fixes are never reconciled. The fix — reimaging from a golden artifact — is the thing the architecture made expensive, so it never happens.
- Provisioning treated as one-time. A pipeline built for day-0 and abandoned, so the day-2 loop (drain, reimage, re-validate, re-admit) is manual — which means at ~3-hour cluster MTBF the fleet is perpetually behind on recovering failed capacity, and goodput bleeds out one un-recycled node at a time.