Chapter 13.2
Documentation, Scripts & Acceptance Test Plans
A commissioning script is a contract written in numbers: every test either has a pre-agreed quantitative pass/fail gate and a witnessed signature, or it is theatre — and on an AI factory the only test that proves the building works is one a facility load bank physically cannot run.
What you'll decide here
- Whether each script carries a hard, pre-agreed numerical acceptance gate with a named witness and a redline-on-fail rule — or a soft 'engineer's judgement' clause that turns every disputed result into a change order.
- Where the facility acceptance boundary stops and the cluster acceptance boundary starts — and who owns the seam where a resistive load bank's heat is rejected but no real GPU transient has ever been seen.
- What instrumentation and data-acquisition basis the pass/fail gates are read against — calibrated reference instruments and a captured time-series, or the building's own BMS sensors marking their own homework.
- How deficiencies are classified (A/B/C severity) and which classes are go-live blockers versus warranty-list items — because the punch list, not the script binder, is what actually gates handover.
- Whether you capture a signed baseline 'fingerprint' of every subsystem at acceptance, so day-2 drift has a reference — or accept that the first time you characterise the plant is the day it misbehaves.
Chapter 13.1 set the governance frame — the L1–L5 ladder, the two parallel facility and IT tracks, and the governing documents (OPR/BOD/SOO) that everything traces back to. This chapter is the layer where governance becomes execution: the actual documents the commissioning agent writes, the scripts the field runs, and the gates that decide whether a system passes. It is the least glamorous chapter in Part 13 and the one that most reliably separates a building that goes live on schedule from one that slips a quarter while two firms argue about whether a 14-second generator transfer was a pass.
The question that recurs at every level is the same: did you write a quantitative, pre-agreed acceptance gate, or did you leave it soft? A script that says 'UPS shall transfer to battery without dropping the load' is unfalsifiable — what is 'the load' under no real load, and what voltage sag counts as a 'drop'? A script that says 'on loss of utility, bus voltage shall not sag below 90% nominal for more than 10 ms, verified against a calibrated power-quality analyser logging at ≥ 10 kHz, witnessed by Owner and CxA' is a contract. The difference is not pedantry. It is the difference between a deficiency you can force the contractor to fix on their dime and a 'disagreement' that becomes a change order on yours. Commissioning is broadly cited at 1–3% of total project cost; the rework and downtime it prevents is multiples of that — but only if the gates are hard enough to enforce. → Chapter 13.1.
Anatomy of a commissioning script
A commissioning script (variously a test procedure, ATP step, or functional-performance test) is not a checklist. A checklist asks 'did you do X?'; a script asks 'when you do X, does the measured result fall inside the gate?' Every well-formed script has the same eight fields, and the absence of any one of them is where disputes are born:
- Unique ID and traceability — back to a specific OPR/BOD requirement and a SOO step, so a passing script proves a design intent, not just an action.
- Pre-conditions — the exact system state, isolations, and safety lockouts that must hold before the step runs (the field's single most common shortcut, and the one that injures people).
- Procedure — the numbered actions, written so a competent technician who has never seen the plant can execute them identically.
- Expected result with a quantitative gate — a number, a tolerance band, and a unit. 'Within spec' is not a gate; '≤ 10 ms, +0/−0 ripple beyond ±5%' is.
- Instrumentation and DAQ basis — which calibrated instrument reads the result, its accuracy class, and its in-date calibration certificate.
- Actual result — the recorded measurement, with a timestamp and the captured waveform/trend reference, not a tick.
- Pass / Fail / Deferred determination — decided against the gate, not against an opinion in the room.
- Witness signatures — CxA, Owner's representative, and the responsible contractor, each signing that they observed the result, not that they trust it.
The non-negotiable property is that the gate is agreed and frozen before the test is run. The instant you negotiate the acceptance threshold while looking at a failing result, you have lost the leverage commissioning exists to give you. Freeze the gates at script-review (a formal owner sign-off of the procedures, weeks before energization); after that they are a contract, not a draft.
Facility ATP vs cluster ATP: two acceptance boundaries, one seam
An AI factory has two acceptance regimes that meet at a seam, and most program risk lives in that seam. The facility ATP (the L3/L4 mechanical/electrical track) proves the building: switchgear, generators, UPS/BESS, chillers, CDUs, pumps, BMS, under load-bank load, culminating in L5 integrated systems testing. The cluster ATP / SAT (the IT track) proves the machine: node burn-in, fabric BER and bandwidth, NCCL collectives, storage throughput, scheduler behaviour, and a reference workload — accepted against goodput, not against load-bank kilowatts. They run on different schedules, against different standards, witnessed by different parties, and they are not interchangeable. → facility electrical in Chapter 13.3, cooling in Chapter 13.5, IST in Chapter 13.6; cluster burn-in in Chapter 13.8, fabric in Chapter 13.7, benchmarking in Chapter 13.9.
The seam exists because of a physics gap that is canonical to AI commissioning: the facility ATP exercises the building with a load the real workload does not resemble. A resistive load bank draws a flat, unity-power-factor, thermally-steady load that rejects its heat to air. A synchronous training job draws a spiky, microsecond-scale, multi-megawatt-swinging load that rejects its heat into cold plates and a liquid loop. The facility ATP can prove the power chain holds a steady 100 MW and the cooling plant rejects it; it cannot prove the UPS/BESS rides through a real GPU power transient, and it cannot push realistic transient heat-flux through a CDU's worst-case branch — because the load bank's heat never enters the liquid loop at all. That gap is the reason the SAT and a proxy training run are not optional 'IT validation' tacked on at the end; they are the only tests that exercise the realistic load. The dynamic-load realism problem is the canonical subject of Chapter 13.6; the cooling load-realism limit is engineered in Chapter 13.5.
| Dimension | Facility ATP (L3–L5) | Cluster ATP / SAT (IT track) |
|---|---|---|
| Object under test | Power, cooling, BMS — the building | Nodes, fabric, storage, scheduler — the machine |
| Applied load | Resistive/reactive/AI-emulating load banks | Real GPUs running burn-in + reference workload |
| Heat path exercised | Rejected to air (load banks); liquid loop only partly | Real heat into cold plates and the full liquid loop |
| Primary acceptance metric | kW held, °C delta-T, transfer ms, leak-free hold | BER, busbw (GB/s), goodput %, SDC count, FIO IOPS |
| Governing standards | ASHRAE Gd 0 / DC Cx guideline, Uptime, BICSI 002 | Vendor RA, ClusterMAX-class criteria, NCCL/MLPerf |
| Can prove | Plant holds steady design load; redundancy topology | Hardware health, fabric integrity, real-workload goodput |
| Cannot prove | Real GPU power/thermal transients; CDU worst-case branch under real flux | Facility ride-through under utility loss (needs the plant) |
Instrumentation and data acquisition: who reads the gate
A pass/fail gate is only as trustworthy as the instrument that reads it, and the recurring sin is letting the building's own BMS grade its own homework. The facility's permanent sensors are installed for control and trending, not for metrology: a BMS temperature point may carry ±1–2 °C uncertainty and a multi-second poll interval, which is useless for accepting a delta-T gate of ±1 °C or a transfer gate of ±10 ms. Acceptance reads against calibrated reference instruments — power-quality analysers, thermal imagers, ultrasonic and Coriolis flow meters, calibrated PT/RTD references, micro-ohmmeters — each with an in-date NIST-traceable (or national-lab-traceable) calibration certificate attached to the script. A result without a certificate behind the instrument is not data; it is an anecdote.
Two DAQ decisions distinguish a serious program. First, sample rate must out-resolve the phenomenon: a UPS transfer or a generator pickup is a sub-100 ms event, so logging at hundreds of Hz to tens of kHz is required to even see the sag you are accepting against — a 1 Hz BMS trend will report 'no anomaly' through a transient that breached spec. Second, capture the full time-series, not the summary statistic: store the waveform and the trend, not just 'min 89.2%'. The captured series is what lets you adjudicate a disputed result after the fact, and it doubles as the baseline fingerprint discussed below. On AI factories this matters more than on legacy IT halls precisely because the loads are transient: the interesting failures live in the milliseconds, and a DAQ basis that cannot see milliseconds cannot accept against them. → fabric timing acceptance (PTP/IEEE-1588) as its own metrology problem in Chapter 8.7.
Deficiency and punch-list management: the document that actually gates go-live
The binder of passed scripts is the visible deliverable; the deficiency log is the one that decides whether you go live. A serious program treats every failed or partially-passed step as a tracked deficiency with an owner, a root cause, a corrective action, a re-test reference, and a severity classification — and the severity classification is the lever. A flat punch list where a mislabelled valve sits at the same priority as a failed UPS transfer guarantees that go-live becomes a negotiation about which items 'really' matter, conducted under schedule pressure. Pre-agree the severity tiers and which tiers block.
- Class A (blocker) — a life-safety defect or a failure of a core redundancy/ride-through claim. Go-live cannot proceed until closed and re-tested. Example: failed automatic transfer to generator; an EPO that does not trip; a leak-detection interlock that does not isolate.
- Class B (conditional) — a real deficiency that does not defeat the design basis. Go-live may proceed on a documented, dated corrective-action plan with an owner. Example: a single redundant pump trending warm; a BMS alarm mis-mapped but functional.
- Class C (warranty/punch) — cosmetic or documentation-only items that roll to the warranty list. Example: missing label; as-built drawing not yet redlined.
The consequence of getting the tiering wrong cuts both ways. Tier too loosely and you carry a Class-A ride-through gap into live operation, where the first real utility loss finds it. Tier too strictly and you hold a live-block go-date hostage to a paint scratch. The discipline is to fix the tiering rules and the blocker list in writing at the same time you freeze the gates — before anyone has a result to argue about. Open Class-A and Class-B counts, trended to zero, are the real go-live gate, and they feed directly into the Operational Readiness review and the handover package. → handover and the Operational Readiness gate in Chapter 13.10.
Baseline 'fingerprint' capture: acceptance as the birth of day-2
The most valuable artifact commissioning produces is one that has no pass/fail gate at all: the baseline fingerprint. At the moment a system is accepted, it is in its known-good state — clean filters, balanced flows, calibrated sensors, fresh firmware, characterised transients. Capture that state quantitatively and you have given day-2 operations a reference against which all future drift is measured. Skip it and the first time anyone characterises the plant is the day it misbehaves, with nothing to compare against.
A useful fingerprint is multi-domain and time-stamped: the captured transient waveforms from every transfer test; the as-accepted pump/fan curves and flow balance; per-rack and per-branch coolant flow and delta-T at known load; thermal images of every switchgear connection and busbar joint; PUE/WUE at the commissioned load point; per-node power-draw signatures and HBM/ECC baselines from burn-in; per-port BER and per-link bandwidth from the fabric; and NCCL busbw and goodput from the reference run. These are not paperwork — they are the seed data for the operational digital twin and the day-2 reliability program. Anomaly detection, predictive maintenance, and lemon-node ejection all need a 'normal' to deviate from, and acceptance is the only time you ever observe a guaranteed-normal system. → the operational twin and telemetry handoff in Chapter 14.2; the goodput baseline carried into operations in Chapter 14.1. Note the fingerprint is distinct from the design-validation digital twin of Chapter 2.7 — that one predicts behaviour pre-build; this one records measured reality at acceptance.
Deep dive: writing a pass/fail gate that survives the room — a worked UPS-transfer example
Consider the single most-disputed facility script: loss-of-utility ride-through. The soft version — 'on utility loss, UPS shall support the load without interruption' — fails the moment the result is anything but obviously clean, because every term is undefined. Here is the same step as a hard gate, field-ready:
Pre-conditions: facility at 100% commissioned load via load banks; all redundancy modules in service; PQ analyser installed at the critical bus, calibration cert #__ in date, logging at ≥ 10 kHz; CxA and Owner present. Procedure: open the utility breaker to simulate loss; observe through generator pickup and re-transfer. Gate: critical-bus RMS voltage shall not deviate beyond ±5% nominal at any point; no zero-crossing dropout; generator shall accept load within the SOO-specified window (e.g. ≤ 10 s to stable, ≤ 100 ms initial sag); frequency within ±0.5 Hz; captured waveform attached. Determination: pass only if every gate holds on the recorded series; a single out-of-band sample is a fail, not a 'close enough'.
Why this matters for AI factories specifically: the load bank makes this a steady, well-behaved load, so passing it proves the plant handles the easy case. The hard case — a real multi-MW GPU power swing during the same transfer — is exactly what the load bank cannot present, which is why this script's gate must be read alongside the dynamic-load realism analysis of Chapter 13.6 and the electrical transient physics of Chapter 13.3. A green checkmark here is necessary, not sufficient; the script binder must say so explicitly so no one reads facility acceptance as workload acceptance.
Deep dive: digital Cx platforms — what they fix and what they cannot
Paper-and-PDF commissioning is collapsing under the document volume of a multi-hundred-MW AI campus, and digital Cx platforms (CxPlanner, Bluerithm, ProjectSight and peers) are now standard on hyperscale builds. What they genuinely fix: a single source of truth for thousands of scripts; templated, reusable test procedures that enforce consistency across identical blocks; real-time deficiency tracking with severity, owner, and re-test linkage; automated rollup of pass/fail status to a live program dashboard; mobile field execution with photo/waveform attachment at the point of test; and auto-generated turnover packages. On a campus where the same NVL72-block script runs hundreds of times, templating alone removes a class of transcription error that paper guarantees.
What they cannot fix, and must not be mistaken for: a digital platform makes a soft gate just as fast to sign off as a hard one. The tool enforces process completeness, not measurement rigour — it will happily collect a thousand signed scripts whose gates are 'satisfactory'. The platform is a force multiplier on whatever discipline you bring to the script content; bring soft gates and you have merely digitised the dispute. The decision is therefore upstream of the tool: freeze hard gates and a severity taxonomy first, then let the platform scale their execution. AI-assisted script generation (now appearing in these platforms) sharpens the warning — a generated procedure can read fluently and still ship an unfalsifiable gate, so the human review that converts every gate to a number remains the load-bearing step.
Sequencing: how scripts interlock across the program
Scripts are not independent; they form a dependency graph, and a passing downstream script is only valid if its upstream prerequisites passed first. Electrical acceptance (L3/L4) must clear before integrated systems testing can apply real building load; cooling acceptance and the secondary-loop flush must clear before any GPU draws power into a cold plate; fabric BER and bandwidth must clear before NCCL collectives mean anything; node burn-in must clear before a reference training run is interpretable. The program is therefore a sequenced set of gates, each unlocking the next, with two deliberately overlapping seams that 13.1 flagged: mechanical-Cx ↔ GPU burn-in (the cold plates need real heat the load bank cannot give) and facility-IST ↔ first-real-workload (the only true dynamic-load emulator is a proxy training run). Treat those overlaps as a single coordinated gate with shared acceptance criteria, not as a clean hand-off, or each side will accept to its own boundary and the seam will go untested. → the staged power/load ramp that walks these gates live in Chapter 13.10; the design-basis redundancy definitions the topology scripts validate against in Chapter 0.5.