Chapter 0.3
Vocabulary, Mental Models & the Metric Stack
If you cannot name the unit you are counting and the metric you are optimizing, you will misprice the build — so before any engineering, fix a shared vocabulary, three mental models, and one metric stack that everyone in the project reads the same way.
What you'll decide here
- Which power number anchors every contract and capacity claim — critical IT power, total facility power, or grid draw — because conflating them mis-sizes the interconnect, the switchgear, and the lease by 15-50%.
- Which capacity unit you scope, budget, and order against — the rack, the NVLink/scale-up domain, the scalable unit (SU), the pod, or the gigawatt campus — and the blast radius each implies.
- Which single number the facility is being optimized for: facility efficiency (PUE/WUE), useful work (MFU/goodput), or unit economics ($/GPU-hr, $/M-tokens) — they pull in different directions.
- How you will write and read availability claims — the nines, their downtime budgets, and serial-vs-parallel composition — so a vendor's '99.99%' cannot quietly mean four different things.
- Which of the three network tiers (scale-up, scale-out, scale-across) a given bandwidth, latency, or oversubscription claim refers to — because the same word means radically different physics at each tier.
This guide is a sequence of decisions and their consequences. But a decision you cannot state precisely is a decision you cannot make. The recurring failure mode in AI-data-center projects is not bad engineering judgment — it is two parties using the same word to mean different quantities, discovering the gap only after a contract is signed or a slab is poured. A developer quotes "100 MW" and means grid draw; the tenant hears critical IT power and sizes a cluster a third too large. A neocloud advertises "99.99% uptime" and means a single node's hardware availability; the customer assumed it described the training job's goodput. A vendor cites "1.8 TB/s" of bandwidth without saying it is the scale-up domain, and a network architect budgets the scale-out fabric against it. None of these are lies. They are vocabulary collisions, and each one costs real money downstream.
So this chapter does something narrow and load-bearing: it fixes the shared language. We pin down the power vocabulary (the three power numbers that are routinely conflated), the capacity units (from the rack up to the gigawatt campus), the metric stack (efficiency, useful-work, and unit-economics families and how they trade off), the availability algebra (the nines, their downtime budgets, and how components compose), and the three-tier network hierarchy (scale-up vs scale-out vs scale-across). This is an orientation chapter, not the canonical home for any of these — each metric's rigorous definition, measurement plan, and gotchas live in their discipline chapters, and we point you there. The job here is to make sure that when Part 3 says "facility power" or Part 8 says "oversubscription" or Part 12 says "goodput," you already read the word the same way the author wrote it.
Power vocabulary: the three numbers everyone confuses
Start with power, because power is the binding constraint of the 2026 era and the most frequently mis-stated number in the building. There are three distinct power figures along the chain, and a project that does not keep them separate will mis-size something expensive.
Critical IT power is the power delivered to the compute itself — the racks, after the last PDU, the number the workload actually consumes. Total facility power adds everything required to keep that IT alive: cooling plant, pumps, fans, lighting, losses in the UPS and transformers, the whole non-IT overhead. The ratio between them is PUE — total facility power divided by critical IT power. Grid draw (or utility demand) is what the meter sees and what the interconnection agreement is written against; it tracks total facility power but is what you contract for, in MW, with the utility. A 100 MW critical-IT design at PUE 1.2 needs ~120 MW of facility power and an interconnection sized above that for headroom and inrush. Quote the wrong one of these in a lease or a power-purchase term sheet and the error propagates into switchgear ratings, transformer orders, and the size of the queue slot you reserve.
Two further distinctions trip people up. MW vs MVA vs kVA: real power (MW) is what does work and what you pay for in energy; apparent power (MVA/kVA) is the vector sum of real and reactive power, and it is what transformers, switchgear, and generators are rated against. At a power factor of 0.9, a 100 MW load is ~111 MVA of apparent power — so equipment sized in MW to match the load will be undersized by the inverse of the power factor. AI loads, with their large rectifier front ends and sharp synchronous transients, make power factor and the MW/MVA gap a live design concern, not a textbook footnote. → Chapter 4.1. And TDP vs EDPp: a GPU's thermal design power (TDP) is the per-chip steady-state envelope (~700 W for H100, ~1.0-1.2 kW for a Blackwell GPU), while the figure that actually sizes your power chain is the per-rack and per-cluster draw under real workloads, including transients that can swing tens of percent on synchronous all-reduce boundaries. Design to nameplate TDP summed naively and you will both over-provision average capacity and under-provision for the transient. → Chapter 6.6.
Capacity units: from the rack to the gigawatt campus
The second piece of shared vocabulary is the unit of capacity you scope, budget, order, and reason about failures in. AI facilities are built and discussed at five nested scales, and choosing the wrong granularity for a given decision is itself a mistake — you procure in scalable units, you reason about blast radius in fault domains, and the two do not always align.
The rack is the atomic power-and-cooling unit: a GB200 NVL72 rack draws ~120-132 kW, weighs ~1.36 t wet, and is the thing the slab, the busway, and the cooling manifold are sized against. The NVLink / scale-up domain is the set of accelerators that share a single coherent memory fabric and behave, for the workload, like one large GPU — 8 in an HGX node, 72 in an NVL72, heading to 576 in the announced Rubin Ultra Kyber generation. This is the most important unit for the software, because tensor- and expert-parallelism must fit inside it to run at full bandwidth. The scalable unit (SU) is the repeatable building block of the cluster — a fixed bundle of racks, leaf switches, cooling, and power that the design replicates N times to scale; it is the unit you actually procure and commission, and the unit a reference architecture (DGX SuperPOD and its peers) is specified in. The pod is a deployment block of multiple SUs sharing a spine layer and often a fault domain. The gigawatt campus is the 2026 frontier unit of strategic conversation — multi-building sites in the 1 GW-plus class that are now the headline of hyperscaler announcements and the scale at which power, water, and grid impact become regional questions. → reference architectures in Chapter 0.4; the SU as a budgeting artifact in Chapter 1.7; fault domains in Chapter 12.1.
| Unit | Scale (2026 reference) | Bound by | Reason about it for |
|---|---|---|---|
| Rack | 1 rack; ~120-132 kW (NVL72), heading to ~600 kW (Kyber, 2027) | Slab loading, busway, cooling manifold, one CDU's reach | Power & cooling density; floor loading |
| NVLink / scale-up domain | 8 (HGX) - 72 (NVL72) - 576 (Kyber) GPUs, one coherent memory fabric | Copper/NVLink reach; switch radix | Whether tensor-/expert-parallelism fits at full bandwidth |
| Scalable unit (SU) | Repeatable bundle: racks + leaf switches + power + cooling | The reference architecture's replication block | Procurement, commissioning, the capacity ramp |
| Pod | Multiple SUs sharing a spine layer | Spine radix; a shared fault domain | Scale-out topology and blast radius |
| Gigawatt campus | Multi-building site, 1 GW-plus critical load | Grid interconnection; regional water/power | Siting strategy, grid impact, financing |
The metric stack at a glance
There is no single "efficiency" number for an AI factory, and pretending there is leads to vanity metrics. There are three families, each answering a different question, and a well-run facility reports from all three because optimizing one in isolation degrades the others. The canonical definitions, measurement plans, and gotchas for every metric below live in Chapter 15.1 — here we only orient you to what each family measures and where they conflict.
Facility-efficiency metrics ask: how much of the power and water going in is overhead rather than compute? PUE (total facility power / critical IT power) is the incumbent and remains stuck at an industry-weighted ~1.54 for the sixth straight year (Uptime Institute Global Survey, 2025), though best-in-class liquid-cooled designs reach 1.05-1.15. WUE (litres of water per kWh of IT) ranges from an industry ~1.8-1.9 L/kWh down to ~0 for closed-loop designs. ERF/ERE credit reused heat, REF credits renewable energy, and CUE measures carbon per unit of IT energy. The trap: PUE and WUE trade against each other — evaporative cooling buys a better PUE by spending water (worse WUE), so a facility chasing a headline PUE can quietly become a water glutton. → Chapter 15.4.
Useful-work metrics ask the question PUE cannot: of the energy that did reach the chips, how much produced actual progress? MFU (model FLOPs utilization) and MBU (model bandwidth utilization) measure how close a workload runs to the hardware's theoretical compute and memory-bandwidth ceilings. Goodput — the fraction of wall-clock time spent on useful, non-wasted computation — is the metric that actually governs training economics; industry-average goodput sits near ~90%, best-in-class near ~96% (SemiAnalysis ClusterMAX; CoreWeave). ETTR (effective training time ratio) and its cousin, the time lost to failures and recovery, sit underneath goodput. A facility can post a beautiful 1.08 PUE and still waste a quarter of its energy on a fabric that starves the all-reduce or a checkpoint cadence that loses hours per failure — PUE would never show it. This is the GOODPUT thread of the guide, and it is why facility efficiency is necessary but never sufficient. → Chapter 12.2.
Unit-economics metrics ask: what does the work cost? Tokens-per-joule (and its inverse, joules-per-token) is the rising efficiency metric for inference, because it captures the whole stack — chip, fabric, cooling, software — in one number tied to the product. $/GPU-hr (~$0.74 self-operated TCO at scale; ~$1.03-3.50+ rented) is the supply-side cost; $/M-tokens (~$1.90 self-hosted for a 70B model, market average ~$2.50) is the demand-side price. These are the numbers the business actually lives on, and they integrate everything above: a facility with great PUE and poor goodput, or great goodput and a bad power contract, shows up here as a bad $/M-tokens. → Chapter 1.8.
| Family | Question it answers | Headline metrics | 2026 reference band | Optimizing this alone breaks |
|---|---|---|---|---|
| Facility efficiency | How much input is overhead, not compute? | PUE, WUE, ERF/ERE, REF, CUE | PUE ~1.54 avg / 1.05-1.15 best; WUE ~0-1.9 L/kWh | WUE (evaporative cooling) and capex (chasing 1.0x) |
| Useful work | Of energy that reached the chips, how much did real work? | MFU, MBU, goodput, ETTR | Goodput ~90% avg / ~96% best; MFU workload-dependent | Cost (over-building fabric/redundancy for marginal goodput) |
| Unit economics | What does the work cost or sell for? | tokens/joule, $/GPU-hr, $/M-tokens | ~$0.74-3.50 /GPU-hr; ~$1.90-2.50 /M-tokens | Quality/SLO (cutting cost by degrading latency or accuracy) |
Availability algebra: the nines and how they compose
Availability is the most abused number in infrastructure marketing, because the algebra behind it is rarely shown. Get the vocabulary straight here and you can read any "number of nines" claim for what it actually promises.
Availability is A = MTBF / (MTBF + MTTR) — mean time between failures over the sum of MTBF and mean time to repair. The headline form is "the nines": 99.9% (three nines) is ~8.8 hours of downtime per year; 99.99% (four nines) is ~52 minutes; 99.999% (five nines) is ~5.3 minutes. Uptime Tier III maps to ~99.982% (~1.6 hr/yr) and Tier IV to ~99.995% (~26 min/yr), though Uptime Institute itself no longer formally endorses specific percentages. The two levers are independent: you raise availability either by failing less often (higher MTBF) or by recovering faster (lower MTTR) — and in AI clusters, where hardware fails constantly, MTTR is usually the cheaper lever, which is why fast checkpoint-restart and lemon-node detection buy more goodput than chasing component reliability.
The part that matters for design is composition. Components in series (each one a single point of failure for the path) multiply: a path of ten 99.9% components is 0.999^10 = ~99.0%, not 99.9% — series composition always makes the whole worse than any part. Components in parallel (redundant, where one surviving path suffices) multiply their unavailabilities: two parallel 99% paths give 1 - (0.01 x 0.01) = 99.99%. This is the entire mathematical case for N+1 and 2N redundancy, and it is also the trap: redundancy only helps if the parallel paths are genuinely independent. A shared controller, a shared cooling loop, or a common-mode failure collapses two "parallel" paths back into series, and the four nines you paid for silently become three. → the redundancy vocabulary (N, N+1, 2N, block- and distributed-redundant, catcher topologies) is defined in Chapter 0.5; the quantitative RBD/Markov/Monte-Carlo machinery is built out in Chapter 12.5.
The three-tier network hierarchy
The last piece of shared vocabulary is the network, because the same words — bandwidth, latency, oversubscription — mean different physics at each of three tiers, and conflating them is how a fabric gets mis-budgeted by an order of magnitude.
Scale-up is the fabric inside a single coherent domain — NVLink and its peers — connecting the GPUs that act as one large accelerator. It is the fastest, shortest-reach, highest-cost-per-bit tier: ~1.8 TB/s per GPU on NVLink5, a 130 TB/s aggregate inside an NVL72 rack, mostly copper because the reach is sub-metre. Tensor- and expert-parallelism live here. Scale-out is the back-end fabric between domains — InfiniBand or Ethernet/RoCE — connecting racks and pods into a cluster: ~400 Gb/s per NIC today, roughly 18x less per-GPU bandwidth than scale-up, which is exactly why the software tries to keep tight parallelism inside the scale-up domain and spend the scale-out fabric only on data- and pipeline-parallel traffic. This is the tier where oversubscription is a real choice: training demands 1:1 non-blocking, inference tolerates 2:1-3:1 (cutting back-end cost ~31%). Scale-across is the newest tier: connecting multiple buildings or campuses for distributed training when no single site has the power — long-haul DCI over tens to hundreds of kilometres, where latency and the speed of light, not bandwidth, become the binding constraint. A claim about "oversubscription" or "latency" is meaningless until you say which tier it describes. → scale-up in Chapter 8.2; scale-out topology and oversubscription in Chapter 8.5; scale-across multi-campus fabric in Chapter 8.8.
Deep dive: why the metric stack must be read together, never alone
The temptation is to pick one headline number and manage to it. Every era of this industry has a favourite: enterprise IT chased PUE, neoclouds chase $/GPU-hr, training labs chase MFU. Each, optimized alone, breaks something one tier away.
Optimize PUE alone and you reach for evaporative cooling — which buys a 1.1 PUE by spending water, wrecking WUE in a water-stressed region, and does nothing for the goodput being lost to a starved fabric. Optimize goodput alone and you over-build: a 1:1 non-blocking fabric and 2N power for a workload that checkpoints anyway, buying the last 2% of effective time at a cost that would have returned more as additional GPUs. Optimize $/GPU-hr alone and you cut the redundancy, the health-checking, and the spare capacity that goodput depends on — your supply cost looks great while your delivered cost ($/M-tokens, which integrates goodput) quietly rises. The three families are a system: PUE bounds how much energy reaches the chips, useful-work metrics bound how much of that energy does real work, and unit economics integrate both against price. A defensible operating plan reports one number from each family and watches the conflicts between them. The honest measurement plan that avoids vanity numbers is built in Chapter 15.1; the economics that integrate them in Chapter 1.8.
Deep dive: reading a vendor spec sheet without getting fooled
Vendor and developer claims are usually under-specified rather than dishonest, and the reader supplies the missing assumption, often wrongly. A short checklist, drawn from the vocabulary above, catches most of it.
- A capacity in MW: ask which power — critical IT, total facility, or contracted grid draw — and at what PUE. The three differ by 20-50%.
- An equipment rating: ask MW or MVA. Switchgear, transformers, and generators are rated in MVA; at a 0.9 power factor a 100 MW load is ~111 MVA, and a unit sized to the MW figure is undersized.
- A bandwidth in TB/s or Gb/s: ask which network tier (scale-up, scale-out, scale-across) and whether it is per-GPU, per-NIC, or rack-aggregate. These span ~18x.
- An availability in nines: ask what it is the availability of — a single node, the facility power path, or the delivered job — and whether the redundant paths are truly independent or share a common-mode failure.
- An efficiency number: ask which family. A great PUE says nothing about goodput; a great $/GPU-hr says nothing about delivered $/M-tokens.
- A density in kW/rack: ask TDP-summed nameplate or measured workload draw with transients — and at what generation, since the figure is on a steep ramp.
Every one of these is a vocabulary collision waiting to become a procurement error. The discipline is to never let a single number stand without the qualifier that pins its meaning. → numbers provenance and vintage discipline in Chapter 0.2.
How the rest of the guide uses this vocabulary
Nothing here is meant to be the last word — each term has a discipline chapter where it is defined rigorously, measured honestly, and pushed to its edge cases. This chapter is the dictionary you keep open while reading the rest. When Part 3 reorders the siting hierarchy around power, it assumes you read "grid draw" and "MVA" correctly. When Part 5 treats the air-cooling cliff, it assumes "kW/rack" means measured workload draw. When Part 8 chooses an oversubscription ratio, it assumes you know which network tier is on the table. When Part 12 argues that AI factories optimize goodput rather than availability, it assumes you can do the availability algebra well enough to see why that is a real distinction. The three threads of the guide — POWER-BOUND, GOODPUT, and DENSITY-RAMP — are each, at bottom, a discipline about reading one of these metrics honestly across a four-year build against a moving target.