Chapter 13.2

Documentation, Scripts & Acceptance Test Plans

A commissioning script is a contract written in numbers: every test either has a pre-agreed quantitative pass/fail gate and a witnessed signature, or it is theatre — and on an AI factory the only test that proves the building works is one a facility load bank physically cannot run.

GOODPUTDENSITY-RAMP

What you'll decide here

Whether each script carries a hard, pre-agreed numerical acceptance gate with a named witness and a redline-on-fail rule — or a soft 'engineer's judgement' clause that turns every disputed result into a change order.
Where the facility acceptance boundary stops and the cluster acceptance boundary starts — and who owns the seam where a resistive load bank's heat is rejected but no real GPU transient has ever been seen.
What instrumentation and data-acquisition basis the pass/fail gates are read against — calibrated reference instruments and a captured time-series, or the building's own BMS sensors marking their own homework.
How deficiencies are classified (A/B/C severity) and which classes are go-live blockers versus warranty-list items — because the punch list, not the script binder, is what actually gates handover.
Whether you capture a signed baseline 'fingerprint' of every subsystem at acceptance, so day-2 drift has a reference — or accept that the first time you characterise the plant is the day it misbehaves.

Chapter 13.1 set the governance frame — the L1–L5 ladder, the two parallel facility and IT tracks, and the governing documents (OPR/BOD/SOO) that everything traces back to. This chapter is the layer where governance becomes execution: the actual documents the commissioning agent writes, the scripts the field runs, and the gates that decide whether a system passes. It is the least glamorous chapter in Part 13 and the one that most reliably separates a building that goes live on schedule from one that slips a quarter while two firms argue about whether a 14-second generator transfer was a pass.

The question that recurs at every level is the same: did you write a quantitative, pre-agreed acceptance gate, or did you leave it soft? A script that says 'UPS shall transfer to battery without dropping the load' is unfalsifiable — what is 'the load' under no real load, and what voltage sag counts as a 'drop'? A script that says 'on loss of utility, bus voltage shall not sag below 90% nominal for more than 10 ms, verified against a calibrated power-quality analyser logging at ≥ 10 kHz, witnessed by Owner and CxA' is a contract. The difference is not pedantry. It is the difference between a deficiency you can force the contractor to fix on their dime and a 'disagreement' that becomes a change order on yours. Commissioning is broadly cited at 1–3% of total project cost; the rework and downtime it prevents is multiples of that — but only if the gates are hard enough to enforce. → Chapter 13.1.

Anatomy of a commissioning script

A commissioning script (variously a test procedure, ATP step, or functional-performance test) is not a checklist. A checklist asks 'did you do X?'; a script asks 'when you do X, does the measured result fall inside the gate?' Every well-formed script has the same eight fields, and the absence of any one of them is where disputes are born:

Unique ID and traceability — back to a specific OPR/BOD requirement and a SOO step, so a passing script proves a design intent, not just an action.
Pre-conditions — the exact system state, isolations, and safety lockouts that must hold before the step runs (the field's single most common shortcut, and the one that injures people).
Procedure — the numbered actions, written so a competent technician who has never seen the plant can execute them identically.
Expected result with a quantitative gate — a number, a tolerance band, and a unit. 'Within spec' is not a gate; '≤ 10 ms, +0/−0 ripple beyond ±5%' is.
Instrumentation and DAQ basis — which calibrated instrument reads the result, its accuracy class, and its in-date calibration certificate.
Actual result — the recorded measurement, with a timestamp and the captured waveform/trend reference, not a tick.
Pass / Fail / Deferred determination — decided against the gate, not against an opinion in the room.
Witness signatures — CxA, Owner's representative, and the responsible contractor, each signing that they observed the result, not that they trust it.

The non-negotiable property is that the gate is agreed and frozen before the test is run. The instant you negotiate the acceptance threshold while looking at a failing result, you have lost the leverage commissioning exists to give you. Freeze the gates at script-review (a formal owner sign-off of the procedures, weeks before energization); after that they are a contract, not a draft.

The hard-gate vs soft-gate fork

If you take one decision from this chapter, take this one. A hard gate is a number, a tolerance, an instrument, and a witness, frozen at script review. A soft gate is any clause containing 'satisfactory', 'as judged by', 'approximately', or 'to the engineer's reasonable satisfaction'. Every soft gate is a deferred argument that resurfaces at the worst possible moment — during energization, with a live-load go-date and a room full of people who each remember the verbal agreement differently. The downstream cost is not abstract: soft gates convert deficiencies (contractor pays) into disputes (owner pays, schedule slips). Audit every script before energization and convert each soft gate to a number or strike the step. The one defensible soft clause is a documented engineering-judgement escalation path for genuinely novel results — but it must name who decides and on what data, not leave it to the room.

Facility ATP vs cluster ATP: two acceptance boundaries, one seam

An AI factory has two acceptance regimes that meet at a seam, and most program risk lives in that seam. The facility ATP (the L3/L4 mechanical/electrical track) proves the building: switchgear, generators, UPS/BESS, chillers, CDUs, pumps, BMS, under load-bank load, culminating in L5 integrated systems testing. The cluster ATP / SAT (the IT track) proves the machine: node burn-in, fabric BER and bandwidth, NCCL collectives, storage throughput, scheduler behaviour, and a reference workload — accepted against goodput, not against load-bank kilowatts. They run on different schedules, against different standards, witnessed by different parties, and they are not interchangeable. → facility electrical in Chapter 13.3, cooling in Chapter 13.5, IST in Chapter 13.6; cluster burn-in in Chapter 13.8, fabric in Chapter 13.7, benchmarking in Chapter 13.9.

The seam exists because of a physics gap that is canonical to AI commissioning: the facility ATP exercises the building with a load the real workload does not resemble. A resistive load bank draws a flat, unity-power-factor, thermally-steady load that rejects its heat to air. A synchronous training job draws a spiky, microsecond-scale, multi-megawatt-swinging load that rejects its heat into cold plates and a liquid loop. The facility ATP can prove the power chain holds a steady 100 MW and the cooling plant rejects it; it cannot prove the UPS/BESS rides through a real GPU power transient, and it cannot push realistic transient heat-flux through a CDU's worst-case branch — because the load bank's heat never enters the liquid loop at all. That gap is the reason the SAT and a proxy training run are not optional 'IT validation' tacked on at the end; they are the only tests that exercise the realistic load. The dynamic-load realism problem is the canonical subject of Chapter 13.6; the cooling load-realism limit is engineered in Chapter 13.5.

Facility ATP vs cluster ATP/SAT — what each can and cannot prove

Dimension	Facility ATP (L3–L5)	Cluster ATP / SAT (IT track)
Object under test	Power, cooling, BMS — the building	Nodes, fabric, storage, scheduler — the machine
Applied load	Resistive/reactive/AI-emulating load banks	Real GPUs running burn-in + reference workload
Heat path exercised	Rejected to air (load banks); liquid loop only partly	Real heat into cold plates and the full liquid loop
Primary acceptance metric	kW held, °C delta-T, transfer ms, leak-free hold	BER, busbw (GB/s), goodput %, SDC count, FIO IOPS
Governing standards	ASHRAE Gd 0 / DC Cx guideline, Uptime, BICSI 002	Vendor RA, ClusterMAX-class criteria, NCCL/MLPerf
Can prove	Plant holds steady design load; redundancy topology	Hardware health, fabric integrity, real-workload goodput
Cannot prove	Real GPU power/thermal transients; CDU worst-case branch under real flux	Facility ride-through under utility loss (needs the plant)

The two acceptance regimes of an AI factory. The right-hand 'cannot prove' column is why the seam between them is where program risk concentrates.

Instrumentation and data acquisition: who reads the gate

A pass/fail gate is only as trustworthy as the instrument that reads it, and the recurring sin is letting the building's own BMS grade its own homework. The facility's permanent sensors are installed for control and trending, not for metrology: a BMS temperature point may carry ±1–2 °C uncertainty and a multi-second poll interval, which is useless for accepting a delta-T gate of ±1 °C or a transfer gate of ±10 ms. Acceptance reads against calibrated reference instruments — power-quality analysers, thermal imagers, ultrasonic and Coriolis flow meters, calibrated PT/RTD references, micro-ohmmeters — each with an in-date NIST-traceable (or national-lab-traceable) calibration certificate attached to the script. A result without a certificate behind the instrument is not data; it is an anecdote.

Two DAQ decisions distinguish a serious program. First, sample rate must out-resolve the phenomenon: a UPS transfer or a generator pickup is a sub-100 ms event, so logging at hundreds of Hz to tens of kHz is required to even see the sag you are accepting against — a 1 Hz BMS trend will report 'no anomaly' through a transient that breached spec. Second, capture the full time-series, not the summary statistic: store the waveform and the trend, not just 'min 89.2%'. The captured series is what lets you adjudicate a disputed result after the fact, and it doubles as the baseline fingerprint discussed below. On AI factories this matters more than on legacy IT halls precisely because the loads are transient: the interesting failures live in the milliseconds, and a DAQ basis that cannot see milliseconds cannot accept against them. → fabric timing acceptance (PTP/IEEE-1588) as its own metrology problem in Chapter 8.7.

1–3%

commissioning as share of total project cost; prevents multiples in rework/downtime

2025Industry Cx cost guidance (TrueLook / practitioner)

12–18 mo

lead time operators now lock in commissioning agents ahead of energization

2025iRecruit / DC construction-trend reporting

1e-12

default fabric BER acceptance threshold per port (InfiniBand ibdiagnet)

2025NVIDIA/Mellanox ibdiagnet manual

72–168 hr

GPU node burn-in/soak duration gated before cluster acceptance

2025Together AI / Introl validation guides

~90% / ~96%

goodput acceptance bar: industry-avg vs best-in-class effective training time

2025SemiAnalysis ClusterMAX / CoreWeave

20–25 °C

CDU coolant inlet acceptance band; deviation can throttle GPUs up to ~50%

2025NVIDIA OCP / Introl (GB200 NVL72)

99.982% / 99.995%

Tier III vs Tier IV availability the redundancy-topology scripts must demonstrate

2025Uptime Institute Tier classification

~115 / ~17 kW

NVL72 heat split (liquid vs air) — the load a facility load bank cannot reproduce in the loop

2025NVIDIA OCP / Introl

Deficiency and punch-list management: the document that actually gates go-live

The binder of passed scripts is the visible deliverable; the deficiency log is the one that decides whether you go live. A serious program treats every failed or partially-passed step as a tracked deficiency with an owner, a root cause, a corrective action, a re-test reference, and a severity classification — and the severity classification is the lever. A flat punch list where a mislabelled valve sits at the same priority as a failed UPS transfer guarantees that go-live becomes a negotiation about which items 'really' matter, conducted under schedule pressure. Pre-agree the severity tiers and which tiers block.

Class A (blocker) — a life-safety defect or a failure of a core redundancy/ride-through claim. Go-live cannot proceed until closed and re-tested. Example: failed automatic transfer to generator; an EPO that does not trip; a leak-detection interlock that does not isolate.
Class B (conditional) — a real deficiency that does not defeat the design basis. Go-live may proceed on a documented, dated corrective-action plan with an owner. Example: a single redundant pump trending warm; a BMS alarm mis-mapped but functional.
Class C (warranty/punch) — cosmetic or documentation-only items that roll to the warranty list. Example: missing label; as-built drawing not yet redlined.

The consequence of getting the tiering wrong cuts both ways. Tier too loosely and you carry a Class-A ride-through gap into live operation, where the first real utility loss finds it. Tier too strictly and you hold a live-block go-date hostage to a paint scratch. The discipline is to fix the tiering rules and the blocker list in writing at the same time you freeze the gates — before anyone has a result to argue about. Open Class-A and Class-B counts, trended to zero, are the real go-live gate, and they feed directly into the Operational Readiness review and the handover package. → handover and the Operational Readiness gate in Chapter 13.10.

Re-test scope creep: the deficiency that quietly invalidates a pass

When a Class-A deficiency is corrected, the seductive shortcut is to re-run only the one failed step. But a fix that touched the firmware on a UPS, the setpoints on a CDU, or the logic on the BMS may have invalidated every upstream script that depended on that subsystem's behaviour. Define the re-test blast radius as part of the deficiency record: which previously-passed scripts must be re-witnessed because the corrective action changed a shared component, a control sequence, or a firmware version. Skipping this is how a building accumulates a binder full of green checkmarks that no longer describe the system as it actually exists — and it is the same shared-component, common-cause logic that the reliability model treats explicitly in Chapter 12.5.

Baseline 'fingerprint' capture: acceptance as the birth of day-2

The most valuable artifact commissioning produces is one that has no pass/fail gate at all: the baseline fingerprint. At the moment a system is accepted, it is in its known-good state — clean filters, balanced flows, calibrated sensors, fresh firmware, characterised transients. Capture that state quantitatively and you have given day-2 operations a reference against which all future drift is measured. Skip it and the first time anyone characterises the plant is the day it misbehaves, with nothing to compare against.

A useful fingerprint is multi-domain and time-stamped: the captured transient waveforms from every transfer test; the as-accepted pump/fan curves and flow balance; per-rack and per-branch coolant flow and delta-T at known load; thermal images of every switchgear connection and busbar joint; PUE/WUE at the commissioned load point; per-node power-draw signatures and HBM/ECC baselines from burn-in; per-port BER and per-link bandwidth from the fabric; and NCCL busbw and goodput from the reference run. These are not paperwork — they are the seed data for the operational digital twin and the day-2 reliability program. Anomaly detection, predictive maintenance, and lemon-node ejection all need a 'normal' to deviate from, and acceptance is the only time you ever observe a guaranteed-normal system. → the operational twin and telemetry handoff in Chapter 14.2; the goodput baseline carried into operations in Chapter 14.1. Note the fingerprint is distinct from the design-validation digital twin of Chapter 2.7 — that one predicts behaviour pre-build; this one records measured reality at acceptance.

Deep dive: writing a pass/fail gate that survives the room — a worked UPS-transfer example

Consider the single most-disputed facility script: loss-of-utility ride-through. The soft version — 'on utility loss, UPS shall support the load without interruption' — fails the moment the result is anything but obviously clean, because every term is undefined. Here is the same step as a hard gate, field-ready:

Pre-conditions: facility at 100% commissioned load via load banks; all redundancy modules in service; PQ analyser installed at the critical bus, calibration cert #__ in date, logging at ≥ 10 kHz; CxA and Owner present. Procedure: open the utility breaker to simulate loss; observe through generator pickup and re-transfer. Gate: critical-bus RMS voltage shall not deviate beyond ±5% nominal at any point; no zero-crossing dropout; generator shall accept load within the SOO-specified window (e.g. ≤ 10 s to stable, ≤ 100 ms initial sag); frequency within ±0.5 Hz; captured waveform attached. Determination: pass only if every gate holds on the recorded series; a single out-of-band sample is a fail, not a 'close enough'.

Why this matters for AI factories specifically: the load bank makes this a steady, well-behaved load, so passing it proves the plant handles the easy case. The hard case — a real multi-MW GPU power swing during the same transfer — is exactly what the load bank cannot present, which is why this script's gate must be read alongside the dynamic-load realism analysis of Chapter 13.6 and the electrical transient physics of Chapter 13.3. A green checkmark here is necessary, not sufficient; the script binder must say so explicitly so no one reads facility acceptance as workload acceptance.

Deep dive: digital Cx platforms — what they fix and what they cannot

Paper-and-PDF commissioning is collapsing under the document volume of a multi-hundred-MW AI campus, and digital Cx platforms (CxPlanner, Bluerithm, ProjectSight and peers) are now standard on hyperscale builds. What they genuinely fix: a single source of truth for thousands of scripts; templated, reusable test procedures that enforce consistency across identical blocks; real-time deficiency tracking with severity, owner, and re-test linkage; automated rollup of pass/fail status to a live program dashboard; mobile field execution with photo/waveform attachment at the point of test; and auto-generated turnover packages. On a campus where the same NVL72-block script runs hundreds of times, templating alone removes a class of transcription error that paper guarantees.

What they cannot fix, and must not be mistaken for: a digital platform makes a soft gate just as fast to sign off as a hard one. The tool enforces process completeness, not measurement rigour — it will happily collect a thousand signed scripts whose gates are 'satisfactory'. The platform is a force multiplier on whatever discipline you bring to the script content; bring soft gates and you have merely digitised the dispute. The decision is therefore upstream of the tool: freeze hard gates and a severity taxonomy first, then let the platform scale their execution. AI-assisted script generation (now appearing in these platforms) sharpens the warning — a generated procedure can read fluently and still ship an unfalsifiable gate, so the human review that converts every gate to a number remains the load-bearing step.

Sequencing: how scripts interlock across the program

Scripts are not independent; they form a dependency graph, and a passing downstream script is only valid if its upstream prerequisites passed first. Electrical acceptance (L3/L4) must clear before integrated systems testing can apply real building load; cooling acceptance and the secondary-loop flush must clear before any GPU draws power into a cold plate; fabric BER and bandwidth must clear before NCCL collectives mean anything; node burn-in must clear before a reference training run is interpretable. The program is therefore a sequenced set of gates, each unlocking the next, with two deliberately overlapping seams that 13.1 flagged: mechanical-Cx ↔ GPU burn-in (the cold plates need real heat the load bank cannot give) and facility-IST ↔ first-real-workload (the only true dynamic-load emulator is a proxy training run). Treat those overlaps as a single coordinated gate with shared acceptance criteria, not as a clean hand-off, or each side will accept to its own boundary and the seam will go untested. → the staged power/load ramp that walks these gates live in Chapter 13.10; the design-basis redundancy definitions the topology scripts validate against in Chapter 0.5.

This chapter is the documentation-and-gates layer beneath the governance frame of Chapter 13.1. The scripts it describes are executed by domain in the chapters that follow: electrical power acceptance in Chapter 13.3, cooling/CDU acceptance in Chapter 13.5, Level-5 IST and the dynamic-load realism gap in Chapter 13.6, network fabric in Chapter 13.7, GPU burn-in in Chapter 13.8, and cluster-scale benchmarking/goodput acceptance in Chapter 13.9. The fingerprint feeds the day-2 telemetry and twin of Chapter 14.2 and the goodput program of Chapter 14.1; the redundancy-topology gates trace to the design basis in Chapter 0.5 and the availability model in Chapter 12.5; go-live and handover land in Chapter 13.10.