Numbers Provenance Register

Every date-stamped figure in the guide — 1,420 entries, sourced and flagged where contested.

1,420 matches · showing 400

Metric	Value	As of	Where	Source
rack power across the inflection: legacy → GB200 NVL72 (~132 kW) → Rubin Ultra Kyber (~600 kW, 2027 roadmap)	~10–15 kW → 120–600 kW	2026	0.1	SemiAnalysis / NVIDIA roadmap
practical air-cooling ceiling per rack — the discontinuity that forces liquid and rewrites the building	~41 kW	2025	0.1	ASHRAE TC 9.9; SemiAnalysis Datacenter Anatomy
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80–90% of draw at large operators	~2/3	2026	0.1	Deloitte TMT Predictions 2026; McKinsey
US large-load grid interconnection lead time end-to-end; up to ~10 yr in the worst queues — the binding constraint	~3–7+ yr	2025	0.1	ERCOT / PJM filings synthesis
HV/substation power transformer lead time (standard); up to ~60 months in constrained markets — often the schedule's long pole	~128 wk	2025	0.1	Wood Mackenzie / pv magazine
global data center capex in 2026 (~21% CAGR through 2029; GPUs ~1/3 of capex)	approaching ~$1T	2026	0.1	Dell'Oro Group
cumulative global data center capex by 2030 (~$5.2T AI-capable) — the scale that makes mis-coordination catastrophic	~$6.7T	2025	0.1	McKinsey, 'The cost of compute'
end-to-end electrical-chain efficiency, 800VDC/DC chain vs legacy AC (utility-to-VRM) — a system gain only co-design captures	>92% vs ~61–87.5%	2025	0.1	SemiAnalysis, Datacenter Anatomy Pt 1
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023) — a fast-moving figure read as direction, not a fixed level	~2/3	2026	0.2	Deloitte TMT Predictions 2026
global data center capex 2026, approaching — volatile market figure; analyst estimates differ by capex-scope definition	~$1T	2026	0.2	Dell'Oro Group
per GB200 NVL72 rack (shipping, ~115 kW liquid + ~17 kW air) — a semi-durable hardware spec you can design against	~132 kW	2025	0.2	NVIDIA OCP / Introl
per Rubin Ultra Kyber-class rack — marked roadmap/announced, not shipping; do not budget as a level	~600 kW	2027 (announced)	0.2	SemiAnalysis / NVIDIA roadmap
practical air-cooling ceiling per rack — a durable physics number, safe to treat as a hard constraint	~41 kW	2025	0.2	ASHRAE TC 9.9 / SemiAnalysis
large-load grid interconnection lead time — volatile and region-dependent; up to ~10 yr in worst queues	3–7+ yr	2025	0.2	ERCOT / PJM filings synthesis
GPU economic vs book life — flagged CONTESTED; run irreversible decisions across the range, not a point estimate	2–3 yr vs 5–6 yr	2026	0.2	CNBC / SemiAnalysis synthesis
best-in-class vs industry-average training goodput — a GOODPUT-thread target, vendor-marketed upper bound	~96% vs ~90%	2025	0.2	SemiAnalysis ClusterMAX / CoreWeave
industry-weighted average PUE, flat for a 6th year; best-in-class liquid 1.05-1.15	~1.54	2025	0.3	Uptime Institute Global Data Center Survey 2025
WUE range: industry avg ~1.8-1.9; best-in-class 0.3-0.7; closed-loop ~0	~0-1.9 L/kWh	2025	0.3	Vertiv / NREL synthesis; Microsoft FY2025 fleet ~0.30
goodput (effective training time): industry average vs best-in-class	~90% / ~96%	2025	0.3	SemiAnalysis ClusterMAX 2.0 / CoreWeave
scale-up (NVLink) domain size: HGX node, NVL72 rack, announced Rubin Ultra Kyber	8 - 72 - 576	2026	0.3	NVIDIA NVLink / Rubin platform roadmap
scale-up (NVLink5/GPU) vs scale-out (per-NIC) bandwidth — roughly 18x apart	~1.8 TB/s vs ~400 Gb/s	2025	0.3	NVIDIA / SemiAnalysis
self-operated TCO at 2048-GPU scale, 90% util; ~$1.03-3.50 rented (contested — single-source)	~$0.74/GPU-hr	2025-2026	0.3	SemiAnalysis H100 cost / rental analyses
inference cost per million tokens: self-hosted 70B worked example vs market average	~$1.90-2.50	2025	0.3	Introl / SemiAnalysis synthesis
Uptime Tier III vs Tier IV availability (~1.6 hr vs ~26 min downtime/yr)	99.982% / 99.995%	2025	0.3	Uptime Institute (figures Uptime no longer formally endorses)
Uptime: concurrent maintainability vs fault tolerance; legacy ~99.982% (~1.6 h/yr) vs ~99.995% (~26 min/yr), now Uptime-disavowed	Tier III / IV	2025	0.4	Uptime Institute Tier Standard
TIA-942-C resilience scale; full-facility telecom + M&E standard, May 2024 (C) revision	Rated 1–4	2024	0.4	ANSI/TIA-942-C
EN 50600 / ISO/IEC 22237 Availability Classes (+ Protection Classes); basis of the EU DC sustainability scheme	Class 1–4	2024	0.4	CEN / ISO/IEC JTC 1
ASHRAE TC 9.9 air classes and liquid W-classes (5th ed. + 2024 liquid-cooling resiliency addendum)	A1–A4 / W17–W45	2024	0.4	ASHRAE TC 9.9 Thermal Guidelines
OCP Diablo 400 (Mt. Diablo) sidecar-power spec; ±400/800 VDC, ~100 kW to ~1 MW racks	v0.5.2	May 2025	0.4	OCP (Google/Meta/Microsoft)
FedRAMP 20x Key Security Indicators replacing 325+ NIST 800-53 controls; Phase 3 opens to all Q3 2026	56–61 KSIs	2026	0.4	FedRAMP PMO (RFC-0006)
ISO/IEC 42001 (first AI management-system standard) from publication to operationalized certification bodies	2023 → 2026	2026	0.4	ISO/IEC; ANAB/BSI accreditation
industry-weighted PUE (flat YoY) — the ISO/IEC 30134-2 KPI that lands in leases and disclosures	~1.54	2025	0.4	Uptime Institute Global DC Survey 2025
published Tier III (~1.6 hr/yr down) vs Tier IV (~26 min/yr) availability — Uptime no longer endorses the specific %	99.982% / 99.995%	2025	0.5	Uptime Institute Tier Standard
Tier IV capital premium over Tier III for the fault-tolerance step; total build often 2-3x in practice	20-40%	2026	0.5	Uptime Institute; INGENIOUS.BUILD; market data
of impactful data-center outages root-caused to power (most often UPS); IT/networking ~23%	45%	2025	0.5	Uptime Institute Annual Outage Analysis
of recent major outages cost over $100k / over $1M respectively	~57% / ~20%	2025	0.5	Uptime Institute Global Survey
of human-error outages caused by staff not following procedures (up 10 pts YoY) — process, not topology	58%	2025	0.5	Uptime Institute Annual Outage Analysis
best-in-class H100 cluster failure rate; one failure restarts a synchronous job from checkpoint	~1 failure / 512 GPUs / week	2025	0.5	SemiAnalysis (100k H100 clusters)
training goodput: industry average vs best-in-class; reliability overhead 6-21% of TCO	~90% / ~96%	2025	0.5	SemiAnalysis ClusterMAX / CoreWeave
data-center load tripped on a single 230 kV fault, triggering a rare NERC Level 3 alert — a grid-scale blast radius	~1,500 MW	2026	0.5	NERC / Utility Dive
per GB200 NVL72 rack (≈132 kW typical: ~115 kW liquid + ~17 kW air)	120–140 kW	2025	1.1	NVIDIA GB200 NVL72 / HPE & Supermicro datasheets
per Rubin Ultra Kyber NVL576 rack on 800 VDC	~600 kW	H2 2027 (announced)	1.1	NVIDIA GTC (Jensen Huang); DCD, Tom's Hardware
practical air-cooling ceiling per rack; RDHx ~50–100 kW; DLC 200+ kW	~41 kW	2025	1.1	ASHRAE TC 9.9; SemiAnalysis Datacenter Anatomy
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80–90% of draw at large operators	~2/3	2026	1.1	Deloitte TMT Predictions 2026; McKinsey
active generation + storage in US interconnection queues (end-2024; ~twice US installed capacity); large-load waits 4–7 yr in top hubs	~2,290 GW	end-2024	1.1	LBNL, Queued Up 2025 Edition
all-in cost per 8-GPU H100 server (excl. storage); ~$31k/GPU/yr enterprise all-in	$283–318k	2025	1.1	SemiAnalysis AI Neocloud Playbook
TCO at 2048-GPU scale, 90% utilization; ~$1.03 small clusters; cloud H100 ~$1.49 (contested — single-source)	~$0.74/GPU-hr	2025	1.1	SemiAnalysis H100 cost/rental analyses
accelerated economic life vs 5–6 yr book life; used GPUs retain ~20–40% residual after 3 yr	2–3 yr	2025	1.1	Goldman Sachs; CNBC/secondary-market analyses
per dense training rack (GB200 NVL72 ~120–132 kW; GB300 ~142 kW)	120–142 kW	2025	1.2	NVIDIA OCP / SemiAnalysis / Introl
per Rubin Ultra Kyber NVL576 rack on 800 VDC (announced roadmap)	~600 kW	2027 (announced)	1.2	NVIDIA GTC; SemiAnalysis 800 VDC
practical air-cooling ceiling/rack; RDHx ~50–100 kW; DLC 200+ kW	~41 kW	2025	1.2	ASHRAE TC 9.9 / SemiAnalysis
GB200 NVL72 DLC inlet & flow; deviation throttles GPUs up to ~50%	20–25 °C / ~80 L/min	2025	1.2	NVIDIA OCP / Introl
training back-end fabric non-blocking; 2:1 'optimized' cuts back-end cost ~31% (contested — single-source)	1:1 vs 2:1	2025	1.2	SemiAnalysis AI Neocloud Playbook
NVLink5 per-GPU BW (1.8 TB/s) vs ~400G scale-out NIC — keep collectives in scale-up	~18x	2025	1.2	NVIDIA / SemiAnalysis
unplanned interruptions on 16,384 H100s (~1 / 3 hr); 78% hardware-caused	419 / 54 days	2024	1.2	Meta Llama 3 paper (Table 5)
best-in-class mature H100 cluster MTBF; one failure restarts a synchronous job	~7 days / 512 GPUs	2025	1.2	SemiAnalysis 100k-H100 clusters
training goodput: industry average / best-in-class; reliability overhead 6–21% of TCO	~90% / ~96%	2025	1.2	SemiAnalysis ClusterMAX / CoreWeave
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80-90% of draw at large operators	~2/3	2026	1.3	Deloitte TMT Predictions 2026
AI inference capacity to 2030 (~35% CAGR) vs training 23.1 → 62.2 GW (~22%)	20.9 → 93.3 GW	2026	1.3	McKinsey, 'The next big shifts in AI workloads'
market for inference-optimized chips in 2026; most inference stays in data centers, not at the edge	>$50B	2026	1.3	Deloitte TMT Predictions 2026
power-oversubscription headroom: inference (uncorrelated per-request peaks) vs training (synchronous peaks)	~21% vs ~3%	2026	1.3	Uptime Institute Journal; arXiv power-profile studies
inference fabric oversubscription (vs 1:1 non-blocking for training); 2:1 cuts back-end cost ~31% (contested — single-source)	2:1-3:1	2025	1.3	SemiAnalysis AI Neocloud Playbook; Juniper
HBM3E per Ironwood TPU v7 (inference-era ASIC); 9,216-chip pods, 42.5 FP8 ExaFLOPS, 4,614 FP8 TFLOPS/chip	192 GiB / 7.4 TB/s	2025	1.3	Google Cloud; SemiAnalysis
self-hosted vs market-avg inference cost per million tokens; ~10x/yr token-price deflation (LLMflation)	~$1.90 → ~$2.50/M tok	2025	1.3	Introl / NVIDIA synthesis; a16z
inference uptime target (99.995%) vs training's checkpoint-tolerant N/N+1 posture	Tier IV ~26 min/yr	2025	1.3	Uptime Institute (Tier classes)
of wall-clock spent on rollout generation in agentic/reasoning RL post-training	~80%	2026	1.4	2025–2026 RL-systems papers (ROLL Flash, ROLLART) & Introl RLHF infra report
of compute consumed by rollouts at 16K-token generation length (RLVR long-CoT)	~70%	2025	1.4	RLVR / long-CoT RL-systems analyses (arXiv)
tokens per RL trajectory for reasoning/agentic tasks — the rollout that dominates cost	10K–100K+	2026	1.4	domain-research keyNumbers; reasoning-model RL reports
wall-clock speedup of variance-controlled async RL vs synchronous at equal accuracy (~42h vs ~105h)	2.5x	2026	1.4	Stable Asynchrony / VCPO (arXiv 2602.17616)
just to hold weights for a 70B PPO-RLHF stack (actor + reference + reward + critic), pre-optimizer	8–16 GPUs	2025	1.4	Introl RLHF infrastructure report
QLoRA fine-tune on a single 48 GB GPU; memory cut from >780 GB to <48 GB without quality loss	65B on 48 GB	2023	1.4	QLoRA (Dettmers et al., arXiv 2305.14314)
share of parameters trained by a LoRA adapter vs full fine-tune (model-dependent)	~0.1%	2026	1.4	LoRA (Hu et al.) / 2026 PEFT practitioner guides
GPU:CPU norm rebalancing toward more CPU per node as agentic RL adds rollout/tool/env load	from 8:1	2026	1.4	domain-research (System Composition); SemiAnalysis
one-way fiber latency from distance alone (~5 ms per 1,000 km); ~1.64 ms RT per 100 mi before any processing	~0.82 ms / 100 mi	2025	1.5	M2 Optics fiber-latency analysis (≈2/3 c in glass)
MEC round-trip at the access edge; under ~50 ms from a regional 5G URLLC breakout	sub-10 ms	2025	1.5	ETSI ISG MEC; arXiv 2504.03708 (telco-LLM latency)
perceptibility thresholds: hard real-time / interactive (AR-VR, agentic) / 'instant' conversational	~30 / 50 / 100 ms	2026	1.5	Spheron hybrid edge guide; AR/VR latency literature
edge data center market, 2026 to 2033, ~14.9% CAGR; AI/ML inference the fastest-growing segment	~$40B → ~$106B	2026	1.5	Grand View Research; Coherent Market Insights
micro data centers' share of the edge market (global 2025) / of US edge by 2026	~35% / ~54%	2026	1.5	Grand View Research; Coherent Market Insights (US)
inference share of AI compute in 2026 (½ in 2025); the growth pool the edge competes for	~2/3	2026	1.5	Deloitte TMT Predictions 2026
edge-site deploy time and install-time reduction under zero-touch provisioning (Vapor IO; ZTP fleet tooling)	~1 hr / 90%+	2026	1.5	Vapor IO; Scale Computing / VMware VCF Edge
practical power envelope per edge micro-site (vs ~132 kW for a centralized NVL72 rack)	a few kW – ~30 kW	2026	1.5	research/domain-research.json; practitioner ranges
time-to-power: greenfield self-build vs wholesale colo (live 50k+ GPU cluster) vs neocloud	24–36 mo / 6–12 mo / days–weeks	2026	1.6	SemiAnalysis; JLL 2026 Outlook; Introl
brownfield retrofit cost: cooling-only vs full AI retrofit; ~2/3 of pre-2015 DCs unsuitable for frontier density	$2–3M / $5–10M per MW	2025	1.6	Introl / Tetra Tech / Schneider synthesis
global wholesale colo average 2025 (record); ~$120 Atlanta to ~$450 Singapore; ~1% vacancy	~$217/kW-month	2025	1.6	JLL / CBRE synthesis
self-build TCO at 2,048-GPU scale, 90% utilization (~$1.03 small clusters) vs neocloud median ~$2.3–3.5/hr (contested — single-source)	~$0.74/GPU-hr	2025	1.6	SemiAnalysis cost / H100 rental analyses
neocloud GPU rental vs hyperscaler pricing (8-GPU node ~$34/hr neocloud vs ~$98/hr hyperscaler)	40–85% below	2026	1.6	SemiAnalysis H100 Index / AM Compute
rise in the 1-year H100 rental contract index, Oct 2025 to Mar 2026, as capacity tightened; on-demand largely sold out	~+40%	2026	1.6	SemiAnalysis H100 Rental Index
breakeven utilization for a debt-financed cluster; swings -$330k to +$340k/mo (55% vs 85%) on a 1,024-GPU H100 build (contested — single-source)	~70%	2025	1.6	AM Compute / McKinsey
US large-load grid interconnection lead time end-to-end; up to ~10 yr in worst queues — the gate behind self-build	~3–7+ yr	2026	1.6	LBNL Queued Up; ERCOT / PJM filings
practical air-cooling ceiling per rack; RDHx ~50–100 kW; DLC 100–200 kW+	~41 kW	2025	1.7	ASHRAE TC 9.9; SemiAnalysis Datacenter Anatomy
per GB200 NVL72 rack (~115 kW liquid + ~17 kW air); GB300 ~142 kW; Rubin Ultra Kyber ~600 kW	120–132 kW	2026	1.7	NVIDIA OCP / SemiAnalysis roadmap
GB200 NVL72 DLC inlet & flow; deviation can throttle GPUs up to ~50%	20–25 °C / ~80 L/min	2025	1.7	NVIDIA OCP / Introl
training non-blocking vs inference oversubscribed; 2:1 cuts back-end cost ~31% (contested — single-source); Meta ran 7:1 on 24k H100	1:1 vs 2:1–3:1	2025	1.7	SemiAnalysis AI Neocloud Playbook / Meta
GPU:CPU ratio shifting from training-era norm toward agentic-inference host demand	~8:1 → 4–8:1	2026	1.7	TrendForce Insights; Introl
full AI liquid retrofit cost crossing the cooling cliff; still strands capacity	~$5–10M/MW	2026	1.7	Introl / Vera Rubin deployment analysis
~1.6 hr/yr vs ~26 min/yr downtime; Tier IV ~20–40% capital premium	Tier III 99.982% / Tier IV 99.995%	2025	1.7	Uptime Institute
goodput (effective training time): industry avg vs best-in-class; reliability overhead 6–21% of TCO	~90% / ~96%	2025	1.7	SemiAnalysis ClusterMAX / CoreWeave
1 GW AI data center: total-program capex (core stack ~$27.9/W plus land, build-out, financing) and all-in annual TCO (~$8.5M/MW-yr)	~$38B / ~$8.5B/yr	2026	1.8	Epoch AI, AI datacenter cost breakdown
1 GW annual TCO at 3-yr / 5-yr / 7-yr IT useful life — the dominant lever	$12B / $8.5B / $7B	2025	1.8	Epoch AI / AM Compute synthesis
self-operated TCO at 2048-GPU scale, 90% util; ~$1.03 small clusters (contested — single-source)	~$0.74/GPU-hr	2025	1.8	SemiAnalysis, GPU cluster cost
breakeven utilization (debt-financed); 1,024-GPU cluster swings -$330k to +$340k/mo (contested — single-source)	~70%	2025	1.8	AM Compute / McKinsey
LLMflation: inference cost decline at fixed quality (Epoch: ~50x/yr median)	~10x/yr	2024-2026	1.8	a16z; Epoch AI
AI-app gross margin vs 70-90% for mature SaaS	~41% to ~52%	2026	1.8	ICONIQ State of AI 2026; Bessemer
wholesale colo global avg 2025; BTS/CTL ~$150-220/kW-mo over 15 yr	~$217/kW-mo	2025	1.8	JLL / CBRE synthesis
estimated understated AI D&A 2026-2028 (CONTESTED); industry AI D&A ~$400B/yr	~$176B	2026	1.8	Burry / secondary analyses; filings
AI/HPC scheduler share: Slurm / Kubernetes / in-house (rule of thumb)	~70% / ~20% / ~10%	2026	10.1	HPCwire, ‘Slurm vs Kubernetes in the Age of AI’; ClusterMAX
GPU-cloud customers using K8s for inference vs Slurm for training	~90% K8s / ~50% Slurm	2025	10.1	SemiAnalysis ClusterMAX
Kubernetes Dynamic Resource Allocation graduated to stable (Sept 2025)	GA in 1.34	2025	10.1	Kubernetes blog, ‘v1.34: DRA has graduated to GA’
reported GPU utilization, device plugins vs DRA (better packing/sharing)	45–60% → 70–85%	2026	10.1	Red Hat / vendor DRA analyses
goodput (effective-training-time): industry avg vs best-in-class	~90% / ~96%	2025	10.1	SemiAnalysis ClusterMAX / CoreWeave
failure interval for a 16k-GPU cluster at ~80,000-hr per-GPU MTBF	~every 3 hr	2025	10.1	Meta Llama 3 / domain reliability math
demonstrated scale of Slurm-on-Kubernetes (Slinky) side-by-side workloads	8,000+ GPUs	2025	10.1	NVIDIA Developer, Slinky / slurm-bridge blog
NVIDIA open-sourced KAI Scheduler (gang, fair-share, DRA, topology)	Apache-2.0, Apr 2025	2025	10.1	NVIDIA / KAI-Scheduler GitHub
Bartz v. Anthropic settlement — largest US copyright payout; ~$3,000 per work across ~500,000 works; pirated copies ordered destroyed	$1.5B	2025	10.10	Bartz v. Anthropic (N.D. Cal.); Authors Guild; Fortune
Italian Garante fine on OpenAI for training ChatGPT without adequate legal basis + transparency failures; plus a 6-month awareness campaign	€15M	Dec 2024	10.10	Garante per la protezione dei dati personali
EU AI Act GPAI obligations apply to new models; training-content summary template (Commission) mandatory	Aug 2, 2025	2025	10.10	European Commission; EU AI Act
Deadline for pre-existing (placed before Aug 2025) GPAI models to publish their training-content summary	Aug 2, 2027	2025	10.10	European Commission; Mayer Brown analysis
ChatGPT conversation logs OpenAI was ordered to produce in NYT v. OpenAI discovery — output logs deemed relevant to fair-use defense	20M logs	2025	10.10	NYT v. OpenAI (S.D.N.Y.); Bloomberg Law
EDPB legitimate-interest assessment for AI training (interest, necessity, balancing); high bar to claim a trained model is anonymous	3-step test	Dec 2024	10.10	EDPB Opinion 28/2024
UK High Court: AI model weights are not an infringing 'copy' under the CDPA (statistical parameters, not stored images)	weights ≠ copy	Nov 2025	10.10	Getty Images v. Stability AI (UK High Court)
EU DSM TDM exception is opt-out by default — crawlers must honor machine-readable reservations (robots.txt / TDM Reservation Protocol)	opt-out	2025	10.10	EU DSM Directive 2019/790, Art. 4; AI Act Code of Practice
TPOT roughly matching human reading speed (~20-25 tok/s); common interactive decode target	~40-50 ms	2025	10.11	Practitioner consensus; NVIDIA / vLLM serving guides
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80-90% of draw at large operators	2/3 (~66%)	2026	10.11	Deloitte TMT Predictions 2026; McKinsey
market-average self-hosted inference cost, fell ~$10→~$2.50 in a year; worked example ~$1.90/M (8xH100, Llama-70B FP16)	~$2.50/M tok	2025	10.11	Introl / NVIDIA synthesis (via provenance)
goodput gain from hybrid aggregation/disaggregation over SOTA when both TTFT and TPOT bind	up to ~77%	2025	10.11	TaiChi (arXiv 2508.01989); see also FlowKV/HexGen-2
inference back-end fabric oversubscription (training is 1:1 non-blocking); 2:1 cuts back-end cost ~31% (contested — single-source)	2:1-3:1	2025	10.11	SemiAnalysis AI Neocloud Playbook
best-in-class vs industry-average goodput (training framing); reliability overhead 6-21% of TCO	~96% / ~90%	2025	10.11	SemiAnalysis ClusterMAX 2.0 / CoreWeave
runtime-reconfigurable disaggregation: x prefill workers feeding y decode workers, re-balanced live	xPyD	2026	10.11	NVIDIA Dynamo / TensorRT-LLM disaggregated serving docs
GB200 NVL72 coherent NVLink domain — the rack-scale block the scheduler treats as atomic	72 GPUs / 130 TB/s	2025	10.2	NVIDIA GB200 NVL72 / NVLink product page
NVLink 5 per-GPU bidirectional bandwidth (scale-up); ~3.6 TB/s on Rubin (roadmap)	~1.8 TB/s	2026	10.2	NVIDIA NVLink
scale-up (NVLink) vs scale-out (~400G NIC) per-GPU bandwidth — the cliff the scheduler defends	~5–18x	2025	10.2	NVIDIA / SemiAnalysis
Slurm block size for one NVL72 NVLink domain in topology.yaml (topology/block plugin)	18 nodes	2025	10.2	NVIDIA Developer — Slurm block scheduling on GB200 NVL72
max IMEX channels (ComputeDomains) per node in Kubernetes DRA — strands partial-node GPUs	1 / node	2025	10.2	NVIDIA Developer — MNNVL on Kubernetes
minimum Kubernetes with DRA APIs enabled for ComputeDomains; GPU Operator 25.3+	K8s 1.32+	2025	10.2	NVIDIA / AWS EKS GB200 guidance
share of Llama-3 training job interruptions traced to network/config issues — the cost of getting topology wrong	~10.7%	2024	10.2	Meta (via Introl topology analysis)
training goodput, industry average vs best-in-class — topology-aware placement is a lever on the gap	~90% → ~96%	2025	10.2	SemiAnalysis ClusterMAX / CoreWeave
MIG instances per GPU (B200/GB200): 2×~93GB, 4×~46GB, or 7×~23GB profiles	up to 7	2025	10.3	NVIDIA MIG User Guide (r580); MIG supported-profiles docs
HBM per Blackwell GPU available to partition across tenants (B200/GB200 class)	180–192 GB	2026	10.3	NVIDIA Blackwell datasheets; provenance.js HBM trajectory
NVIDIAScape (CVE-2025-23266) Container Toolkit escape — container-to-host on shared GPU nodes	CVSS 9.0	2025	10.3	Wiz Research; NVIDIA security bulletin
CVE-2025-23290 — first acknowledged cross-VM GPU-metric leak via vGPU Manager (co-tenant side channel)	CVSS 2.5	2025	10.3	NVIDIA security bulletin; Tenable
Slurm vs Kubernetes share of AI clusters — the two quota/fairness enforcement planes operators must master	~70% / ~20%	2026	10.3	HPCwire, 'Slurm vs Kubernetes in the Age of AI'
inference share of AI compute in 2026 — the workload class that most rewards fractional/MIG sharing	~2/3	2026	10.3	Deloitte TMT Predictions 2026; McKinsey
accelerated GPU economic life — the depreciation clock that makes reclaiming idle silicon urgent	2–3 yr	2025	10.3	Goldman Sachs; secondary-market analyses
current CUDA Toolkit (released May 2026); paired with R580 LTSB data-center driver	CUDA 13.3	mid-2026	10.4	NVIDIA CUDA Toolkit Release Notes; Data Center Driver docs
current NVIDIA data-center LTS driver branch; ~3-yr lifecycle, EOL ~Aug 2028	R580	mid-2026	10.4	NVIDIA Data Center Drivers; AI Enterprise lifecycle policy
AMD stack with RCCL NCCL-API parity; MI350X/MI355X support (7.0 Sep 2025)	ROCm 7.x	2025-2026	10.4	AMD ROCm 7.0 release notes & compatibility matrix
NVLink-SHARP Multimem multicast + symmetric-memory kernels within an NVL72 domain	NCCL 2.28+	2026	10.4	NVIDIA NCCL release notes; GitHub releases
GPU SMs consumed by a reduction after composing NVLink-SHARP + IB-SHARP in NCCL 2.27	~16 → ≤6 SMs	2025	10.4	NVIDIA Developer (NCCL 2.27); SHARP in-network computing
NCCL all_reduce busbw vs theoretical (acceptance gate); ≈370 GB/s on 400G NDR	~92%	2025	10.4	NVIDIA DGX BasePOD NCCL validation; OCI/Together AI
lower hardware cost for AMD vs NVIDIA — the prize that funds the ROCm tax	15-30%	2026	10.4	domain-research keyNumbers; SemiAnalysis AMD vs NVIDIA
failure cadence of a 16k-GPU cluster (Llama 3: 419 unplanned/54 days) the stack must absorb	every ~3 hr	2024	10.4	Meta Llama 3 405B disclosure
of fabric line rate is the NCCL all_reduce acceptance bar (~370 GB/s on a 400 GB/s fabric)	~92%	2025	10.5	Together AI — Practitioner's Guide to Testing Large GPU Clusters
typical burn-in soak before a new cluster is admitted to production	72–168 hr	2025	10.5	Introl validation frameworks; neocloud operator reports
mean time between failures for a 16,000-GPU cluster — why provisioning is a continuous day-2 loop, not a one-time event	~3 hr	2024	10.5	Meta Llama 3 (16,384 H100); ~80,000-hr per-GPU MTBF
MTBF per 512 GPUs at a top-tier H100 operator; new clusters fail far more during 3–4 week burn-in	~7 days	2025	10.5	SemiAnalysis (100k H100 clusters)
automated node replacement on a best-in-class fleet — the day-2 lifecycle target	~90 sec	2026	10.5	SemiAnalysis AI Neocloud Playbook / ClusterMAX
to provision 128 GPUs to a customer at a top-rated neocloud — the bring-up-as-competitive-lever benchmark	<2 days	2026	10.5	SemiAnalysis ClusterMAX 2.0
goodput (effective-training-time) achievable despite ~3-hr cluster MTBF, given automated validation + recovery	>90%	2025	10.5	Google Cloud goodput; NVIDIA Mission Control
revenue per GW per year — the depreciation clock that makes time-from-rack-to-first-job a million-dollar-per-week metric (contested — single-source)	$10–12B	2026	10.5	Domain synthesis; SemiAnalysis
unplanned interruptions on 16,384 H100s (~1 every 3 hr); 78% hardware, 58.7% GPU-related	419 / 54 days	2024	10.6	Meta (Llama 3 paper) / Tom's Hardware
MTBF per 512 GPUs at a best-in-class mature H100 operator (new clusters fail far more)	~7 days	2025	10.6	SemiAnalysis (100k H100 clusters)
machines harboring an SDC-prone defect; SDC expected every 1-2 weeks in large training	~1 in 1,000	2025	10.6	Meta Engineering; OCP SDC-in-AI whitepaper
SDC test seeds per month across Meta's fleet (Fleetscanner + Ripple)	~2.5 billion	2025	10.6	Meta Engineering (How Meta keeps AI hardware reliable)
industry-average vs best-in-class goodput (effective training time)	~90% / ~96%	2025	10.6	SemiAnalysis ClusterMAX / CoreWeave
large-LLM-job failure rate (~37% hardware-attributed; ~73% recoverable via restart)	~43.4%	2024	10.6	Alibaba (Unicron) via SemiAnalysis
reliability/recovery overhead — the cost the observability loop exists to shrink	6-21% of TCO	2025	10.6	SemiAnalysis ClusterMAX 2.0
MTTR achievable with multi-tier checkpointing vs 15-30 min naive restart	<2 min	2025	10.6	Google Cloud (multi-tier checkpointing)
mean-time-to-failure of a 1,024-GPU job vs 47.7 days for an 8-GPU job — the single-point-of-failure penalty of scale	7.9 hr	2025	10.7	Meta, Revisiting Reliability in Large-Scale ML Clusters (arXiv 2410.21680)
failures per thousand node-days on Meta's RSC-1 cluster (11 months, ~80%+ utilization)	6.50 / 1000	2025	10.7	Meta, Revisiting Reliability (arXiv 2410.21680)
unplanned interruptions on 16,384 H100s during Llama 3 405B (~1 every 3 hr); 78% hardware, 58.7% GPU-related	419 / 54 days	2024	10.7	Meta (Llama 3 paper) / Tom's Hardware
checkpoint-and-restart overhead required to hold ETTR ~0.9 on a 100,000-GPU run at RSC-2-like failure rates	~2 min	2025	10.7	Meta, Revisiting Reliability (arXiv 2410.21680)
best-in-class MTBF per 512 GPUs on a mature 100k-H100 cluster (burn-in 3–4 weeks first)	~7 days	2025	10.7	SemiAnalysis, 100k H100 Clusters
large-LLM-job failure rate; ~37% hardware-attributed; ~73% recoverable via restart	43.4%	2024	10.7	Alibaba Unicron production study
goodput (effective training time): industry average / best-in-class; reliability overhead 6–21% of TCO	~90% / ~96%	2025	10.7	SemiAnalysis ClusterMAX / CoreWeave
training restart latency, storage-only vs multi-tier/in-memory checkpointing	15–30 min → <2 min	2025	10.7	Google Cloud multi-tier checkpointing
typical model-FLOPS-utilization (MFU) for large LLM training; best-in-class >50% on Hopper	~30–50%	2025	10.8	SemiAnalysis; provenance.js (domain economics)
BF16 MFU gain on GB200 NVL72 from software/kernel maturation over ~12 months (≈57% throughput from software alone)	34% → 54%	2025	10.8	SemiAnalysis (H100 vs GB200 NVL72 training benchmarks)
BF16 MFU achieved pre-training Llama 3 on 16k H100s (frontier-scale reference point)	~41%	2024	10.8	Meta, The Llama 3 Herd of Models
training-state footprint with Adam (4 weight + 4 grad + 8–12 optimizer); the number the framework must sleep across the fleet	~16–18 B/param	2025	10.8	Standard mixed-precision Adam accounting; DeepSpeed/ZeRO docs
rule-of-thumb checkpoint size on disk (weights + optimizer state); sets async-drain bandwidth need	~14 B/param	2025	10.8	VAST Data (checkpoint bandwidth analysis)
training goodput: industry average vs best-in-class effective-training-time fraction	~90% / ~96%	2025	10.8	SemiAnalysis ClusterMAX / CoreWeave; provenance.js
large-LLM-job failure rate in a production fleet (~37% hardware-attributed; ~73% restart-recoverable) — why elastic orchestration matters	~43.4%	2024	10.8	Alibaba Unicron; provenance.js
MTBF per 512 GPUs at a top-tier operator; one failure restarts a synchronous job from its last checkpoint	~7 days	2025	10.8	SemiAnalysis (100k H100 clusters); provenance.js
goodput (effective training time): industry average vs best-in-class marketed; reliability overhead 6-21% of TCO	~90% / ~96%	2025	10.9	SemiAnalysis ClusterMAX / CoreWeave
ClusterMAX 2.0 GPU-cloud rating: Security, Lifecycle, Orchestration, Storage, Networking, Reliability, Monitoring, Pricing, Partnerships, Availability — Platinum to UnderPerform	10 dimensions / 5 tiers	2025	10.9	SemiAnalysis ClusterMAX 2.0
breakeven utilization for a debt-financed fleet; the cliff the on-demand/take-or-pay mix transfers or carries (contested — single-source)	~70%	2025	10.9	AM Compute / McKinsey
serverless GPU time-to-first-token (H100): warm-pool vs scale-from-zero; snapshots claim ~10x cold-start gains	8-15s warm / 30-90s cold	2026	10.9	RunPod / Modal serverless comparisons
spot/preemptible discount vs on-demand; the price of transferring interruption risk to the customer	~60-80% off	2026	10.9	Spheron / GCP GPU pricing synthesis
H100 on-demand ladder: spot floor to Azure managed; neocloud median ~$2.29-3.50 (the value-stack premium, monetized)	~$1.03 - $12.29/GPU-hr	2026	10.9	SemiAnalysis H100 Index / AM Compute
Tier III vs Tier IV facility availability (~1.6 hr vs ~26 min/yr) — the easy SLA, distinct from goodput	99.982% / 99.995%	2025	10.9	Uptime Institute
RAND Weights Security Levels and adversary operational-capacity tiers; 38 distinct attack vectors enumerated	SL1–SL5 / OC1–OC5	2024	11.1	RAND, Securing AI Model Weights (RRA2849-1)
attack vectors infeasible for OC1–OC3 but feasible for OC4–OC5 — why nation-state defense is categorically harder	8 of 38	2024	11.1	RAND RRA2849-1
assessed posture of frontier labs vs OC4–OC5 adversaries that want the weights — the central gap	~SL2–SL3	2026	11.1	RAND RRA2849-1; IFP / IST SL5 Task Force
allocation-constrained silicon per GB200 NVL72 rack — theft/destruction economics differ from generic cloud hardware	~$3M+	2025	11.1	Guide domain research; OEM rack pricing
RAND theft-window benchmark: a Security Level is defined by thwarting weight theft within roughly this horizon	<2 months	2024	11.1	RAND RRA2849-1
IRGC drones struck 3 AWS facilities in UAE/Bahrain — first deliberate state targeting of commercial data centers in wartime	1 Mar 2026	2026	11.1	CNBC; The Conversation
projected data-center physical-security spend by 2030 (roughly doubling) as kinetic/drone threats enter the planning case	~$4B	2030 (proj.)	11.1	Guide domain research; industry security forecasts
concentric physical model (perimeter → facility → data hall → cage/rack) with escalating MFA at each boundary	4 zones	2026	11.1	NIST / DCK physical-security guidance
of organizations run BMS affected by known-exploited vulnerabilities (KEVs); data centers the worst case	75%	2025	11.10	Claroty Team82, State of CPS Security 2025: BMS Exposures
of organizations exposed to KEVs that are ransomware-linked AND insecurely internet-connected	51%	2025	11.10	Claroty Team82, State of CPS Security 2025
heat lost in Lviv via FrostyGoop — Modbus firmware downgrade, no zero-day	600 buildings / ~2 days	2024	11.10	Dragos / CISA / SANS
data-center load lost instantaneously on a single 230-kV fault (the weaponizable swing)	~1,500 MW	2024	11.10	NERC Level 3 Alert / Utility Dive
load shed in a single Virginia event — the synchronized-load-step primitive, unweaponized	1.5 GW / 82 s	2024	11.10	NERC / Utility Dive
time from CDU flow-loss to GB200 NVL72 throttle/over-temp trip (no chilled-water inertia)	seconds-tens	2025	11.10	NVIDIA OCP DLC spec / Introl
IEC 62443 security levels by attacker capability; destructive primitives are SL3-SL4	SL1-SL4	2025	11.10	ISA/IEC 62443
Modbus TCP — unauthenticated by design; the protocol FrostyGoop and most BMS/CDU controllers speak	Port 502	2024	11.10	Dragos / SANS ICS
FedRAMP 20x Key Security Indicators (Low / Moderate baseline) — automated, measurable outcomes replacing control-narrative essays	56 / 61	2026	11.11	FedRAMP PMO RFC-0006
RFC-0024 deadline: machine-readable (OSCAL) packages mandatory for all FedRAMP providers	Sep 2026	2026	11.11	FedRAMP PMO RFC-0024
CMMC Level 2 third-party certification becomes mandatory for CUI-handling DoD contracts (Phase 2)	10 Nov 2026	2025-2026	11.11	DoD 48 CFR final rule
EU AI Act full enforcement powers activate; most high-risk obligations apply; fines up to 7% global turnover	2 Aug 2026	2026	11.11	European Commission
EU / North American enterprise AI-vendor RFPs asking for ISO 42001 certification or implementation	~40% / ~25%	2026	11.11	Industry RFP analyses
OCP S.A.F.E. accredited Security Review Providers (Atredis, IOActive, NCC Group) for firmware-security conformance audits	3 SRPs	2025-2026	11.11	Open Compute Project
single-event load loss that pushed NERC to treat large AI loads as grid actors subject to CIP-adjacent scrutiny	~1,500 MW	2026	11.11	NERC Level 3 Alert / Utility Dive
ISO 27001 / 42001 certification validity with annual surveillance audits; SOC 2 Type II re-issued every 6–12 mo	3 yr	2026	11.11	ISO; AICPA
mean time to identify + contain a breach in 2025 (lowest in 9 years) — the dwell window your retention must outlast	241 days	2025	11.12	IBM Cost of a Data Breach 2025
average breach cost: global down 9% to $4.44M; US at an all-time high of $10.22M	$4.44M / $10.22M	2025	11.12	IBM Cost of a Data Breach 2025
NIST IR guidance restructured onto CSF 2.0 (Govern/Identify/Protect/Detect/Respond/Recover); first revision since 2012	SP 800-61r3	Apr 2025	11.12	NIST SP 800-61 Rev. 3
GPU registers hidden by the BAR0 decoupler in confidential mode (vs ~7.94% normal) — the forensic opacity the SOC works around	~99.78%	2025	11.12	NVIDIA WP-12554 / arXiv 2507.02770
certificate device-identity chain and structured measurement records (NRAS/RIM) that become the forensic record on confidential systems	5 / 64	2026	11.12	NVIDIA Secure AI whitepaper (domain synthesis)
documented multi-tenant GPU escape (cross-VM disclosure) and cross-tenant DoS — the isolation-breach playbook's design case	CVE-2025-23290 / -23285	2025	11.12	NVIDIA security bulletins (domain research)
IRGC drone strikes on AWS facilities (UAE/Bahrain) — aerial/kinetic attack now an IR design case, not tail-risk	Mar 1, 2026	2026	11.12	Domain research / open reporting
projected data-center physical-security spend by 2030 (~2x), reflecting the converged cyber-physical posture	~$4B	2026	11.12	Security-domain research synthesis
AWS facilities directly hit by drones (UAE) + 1 blast-damaged (Bahrain), Mar 2026 — first confirmed combat strike on US-run hyperscale DC	2 + blast	Mar 2026	11.2	DefenseScoop / DCK / MWI (West Point)
first US statute letting certified state/local/tribal law enforcement deploy counter-UAS (after DOJ training); private operators still cannot legally defeat a drone	FY2026 NDAA	2026	11.2	FY2026 NDAA; CRS; Route Fifty
data center security market 2026, growing to ~$90B by 2034 (~17% CAGR); biometrics the fastest-growing sub-segment	~$25.7B	2026	11.2	Fortune Business Insights; market.us
data center access-control market 2025, to ~$2.53B by 2030 (~10% CAGR)	~$1.55B	2025	11.2	MarketsandMarkets
load lost on a single substation fault; 1.5 GW dropped in 82 s (VA, 2024) — the prize a saboteur targets outside the fence	~1,500 MW	2024	11.2	NERC Level 3 Alert / Utility Dive
grid interconnection lead time for a large load — making the utility tie an irreplaceable single-point-of-failure if attacked	~3–7+ yr	2025	11.2	ERCOT / PJM filings synthesis
typical standoff range a rural power-first campus gets free; urban inference sites often have near-zero	50–150 m	2026	11.2	Practitioner / CPTED siting guidance
cost of a commercial FPV/one-way drone — the asymmetry against a multi-hundred-million-dollar facility	~$300–500/round	2026	11.2	MWI (West Point) / open-source defense reporting
suspect counterfeit-part submissions logged in 2025 (down from 1,055 in 2024, partly a one-off batch); active components ~36% of reports	748	2025	11.3	ERAI 2025 Annual Counterfeit Report
of suspect counterfeit parts that PASSED electrical test — would evade detection if electrical test were the only screen	~24%	2025	11.3	ERAI 2025 report
NIST SP 1800-34 'Validating the Integrity of Computing Devices' finalized — the platform-certificate / provenance reference architecture	Dec 2022	2022	11.3	NIST / NCCoE SP 1800-34
NIST SP 800-88 Rev 2 released — media sanitization modernized for encrypted/virtual/cloud media (Clear / Purge / Destroy)	Sept 2025	2025	11.3	NIST SP 800-88 Rev 2
IEEE 2883-2022: no overwrite-based method meets the Purge threshold for SSD/NVMe — only verified cryptographic erase or physical destruction qualifies	Purge = CE or destroy	2025	11.3	IEEE 2883-2022 / NIST 800-88 r2
of used drives resold on the secondary market found to contain residual recoverable data (PII, financial, IP) — the data-remanence base rate	42%	2019	11.3	Blancco Technology Group study
approximate silicon value concentrated in a single GB200 NVL72 rack (1.36 t) — the asset-value density driving target priority	$3–4M	2025	11.3	NVIDIA / SemiAnalysis (derived)
OCP S.A.F.E. project cadence; AMI the first approved independent firmware vendor SRP — the centralized, inheritable firmware-audit framework	1st Thu/mo	2025	11.3	Open Compute Project S.A.F.E.
CVE-2024-54085 AMI MegaRAC BMC auth-bypass via Redfish; added to CISA KEV 25 Jun 2025; OEM'd across 12+ server vendors	CVSS 10.0	2025	11.4	Eclypsium / CISA KEV / The Hacker News
internet-exposed MegaRAC SP-X Redfish instances found, each potentially exploitable for remote takeover/bricking	1,000+	2025	11.4	Eclypsium (Shodan scan)
Caliptra open silicon RoT co-developed by Microsoft, Google, AMD, NVIDIA; committed in their first-party/server silicon	4 contributors	2025	11.4	OCP / CHIPS Alliance / Microsoft Azure
post-quantum signatures + KEM in Caliptra 2.x via open-source Adams Bridge accelerator (CNSA 2.0 path), side-channel hardened	ML-DSA + ML-KEM	2025	11.4	Microsoft Azure / CHIPS Alliance
irreplaceable, allocation-constrained silicon per GB200 NVL72 rack a management-plane implant can brick or wiretap	$3M+	2025	11.4	RAND / domain research synthesis
NIST Platform Firmware Resiliency (protect/detect/recover); with SP 1800-34 and IR 8320 the standards backbone for firmware integrity	800-193	2024	11.4	NIST / NCCoE
OCP module decoupling BMC + RoT + TPM from the motherboard; 2.1 open reference designs appeared in 2025	DC-SCM 2.0	2025	11.4	OCP / Cloudflare Project Argus / Antmicro
BMC runs on standby power and boots before the host; a rooted BMC is an OS-invisible, persistent foothold under the CPU	always-on	2025	11.4	Eclypsium / OCP DC-SCM
of GPU HBM placed inside the encrypted, integrity-protected Compute Protected Region (CPR)	~90%	2025	11.5	arXiv 2507.02770 (GPU CC Demystified); NVIDIA WP-12554
of GPU memory-mapped registers hidden by the BAR0 decoupler in CC mode (vs ~8% in normal mode)	~99.78%	2025	11.5	arXiv 2507.02770
device-identity chain length and structured measurement records validated against NRAS + RIM goldens	5-cert / 64 records	2025	11.5	arXiv 2507.02770; NVIDIA attestation docs
per-channel session keys derived from one SPDM-negotiated master secret (RPC / DMA / fault / workload)	44+ keys	2025	11.5	arXiv 2507.02770
training / inference advantage HGX B200 retains over H200 with confidential computing fully enabled	~2x / ~2.5x	2025	11.5	NVIDIA Secure AI WP-12554; Corvex/Spheron benchmarks
Blackwell CC overhead on large matrix ops (encrypted HBM + TEE-I/O over NVLink); Hopper far heavier on small/PCIe transfers	under ~3%	2025	11.5	NVIDIA; independent Hopper CC benchmark (arXiv 2409.03992)
Hopper-class confidential-computing scope; multi-GPU TEE-I/O across NVLink is Blackwell-and-later	single-GPU	2025	11.5	NVIDIA Secure AI with Blackwell and Hopper GPUs (WP-12554)
year AMD SEV-SNP + Intel TDX + NVIDIA GPU CC reached broad cloud GA as a paired confidential-AI stack	2025	2025-2026	11.5	NVIDIA / cloud-provider CC GA announcements
CVSS of NVIDIAScape (CVE-2025-23266) — three-line container escape to host root in NVIDIA Container Toolkit	9.0	Jul 2025	11.6	Wiz Research; NVIDIA Security Bulletin
NVIDIA Container Toolkit versions vulnerable to NVIDIAScape (GPU Operator ≤25.3.0)	≤1.17.7	Jul 2025	11.6	Wiz; NVIDIA
first publicly acknowledged cross-VM co-tenant information disclosure via the vGPU Manager	CVE-2025-23290	Jul 2025	11.6	NVIDIA Security Bulletins
max MIG instances per GPU — the only hardware-enforced fractional partition (dedicated SMs, L2 slice, memory controllers, HBM slice)	7	2025	11.6	NVIDIA Multi-Instance GPU
LLM-response data recoverable per query via LeftoverLocals (CVE-2023-4969) from un-scrubbed GPU local memory	≈181 MB	2024	11.6	Trail of Bits
memory and fault isolation guarantees provided by time-slicing / MPS between tenants	0	2025	11.6	Introl; NVIDIA MPS docs
ClusterMAX 2.0 operator-maturity rubric grades tenant/fabric isolation, health-checks, and goodput as first-class	10-dimension	2025	11.6	SemiAnalysis ClusterMAX 2.0
share of data-center traffic that is east-west (interior); approaches 100% on a training back-end fabric	76-80%	2024-2026	11.7	Akamai / Gigamon
average eCrime breakout time (initial access to first lateral movement) in 2025, down from 48 min in 2024; fastest 27 s	29 min	2025	11.7	CrowdStrike 2026 Global Threat Report
BlueField-4 DPU throughput; 64 Arm cores, ~6x BlueField-3 compute; zero-trust east-west enforcement at line rate	800 Gb/s	2026 (Vera Rubin platform)	11.7	NVIDIA / HPCwire
NIST Zero Trust Architecture — 'never trust, always verify'; no trust from network location	SP 800-207	Aug 2020 (current)	11.7	NIST
training back-end fabric design; sub-2 us latency — why inline L7 inspection is a goodput tax there	1:1 non-blocking	2025	11.7	SemiAnalysis / NVIDIA
industry-avg vs best-in-class goodput; inline enforcement on collectives erodes exactly this metric	~90% / ~96%	2025	11.7	SemiAnalysis ClusterMAX / CoreWeave
configuration boundaries (subnet-manager / adapter enforced) — segmentation, not cryptographic isolation	VLAN/PKey	2025	11.7	NVIDIA InfiniBand / SemiAnalysis ClusterMAX
egress posture for the weights enclave: allow-listed proxy + blocked/alerted bulk transfers — the anti-exfil linchpin	default-deny	2025	11.7	RAND RRA2849-1 (weight-security egress controls)
RAND Weights Security Levels (SL1-5), attacker operational-capacity tiers (OC1-5), and catalogued attack vectors	5 levels / 5 tiers / 38 vectors	2024	11.8	RAND RRA2849-1 (Securing AI Model Weights)
where RAND assesses most frontier labs currently sit — stops opportunistic actors and basic insiders, not OC4-OC5 nation-states	~SL2	2024-2026	11.8	RAND RRA2849-1
SL5 Task Force target for nation-state-resistant frontier AI infrastructure; SL5 standard = 43 controls / 10 families (NIST SP 800-53 overlay)	2028/2029	2025-2026	11.8	SL5 Task Force / Institute for Security & Technology
to exfiltrate a ~1,000 Gb model even under an 800 GB/day egress cap — why fixed-rate limits are necessary but not sufficient	~1.25 days	2025	11.8	LessWrong/Alignment Forum egress-limit analyses
token output of a single production inference server — the channel that cannot be rate-capped without breaking the service	~1 TB/day	2025	11.8	Inference-verification exfiltration research
preliminary feasible weight-compression floor in a theft context — shrinks the payload an attacker must move, undercutting fixed egress caps	~1 bit/param	2026	11.8	arXiv 'Aggressive Compression Enables LLM Weight Theft'
GPU HBM inside the encrypted Compute Protected Region; memory-mapped registers hidden by the BAR0 decoupler in CC mode	~90% / ~99.78%	2025	11.8	arXiv 2507.02770; NVIDIA WP-12554
checkpoint size for a 175B to 1T-param model at ~14 bytes/param incl. optimizer state — the at-rest bulk the crypto must wrap	2.3-13.8 TB	2025	11.8	NVIDIA storage guidance; checkpoint-sizing rules of thumb
where consensus assesses frontier labs sit; insider threat is the dominant gap blocking SL4-5, which need human-layer controls not more crypto	~SL2	2024-2025	11.9	RAND RRA2849-1 (Securing AI Model Weights); IST SL5 Task Force
RAND theft benchmark: a Security Level is defined by stopping an adversary attempting weight theft inside this window	<2 months	2024	11.9	RAND RRA2849-1
distinct attack vectors in RAND's model; insider threat spans most of them rather than being one isolated path	38 vectors	2024	11.9	RAND RRA2849-1 (5 SL, 5 OC tiers, 38 vectors)
average annual cost of insider risk per organization (largest Ponemon insider study to date)	$17.4M	2025	11.9	Ponemon / DTEX 2025 Cost of Insider Risks
share of insider incidents that are negligent vs malicious; credential theft ~20% but costliest at ~$779,797/event	~55% / ~25%	2025	11.9	Ponemon 2025 Cost of Insider Risks
average time to detect and contain an insider incident (down from 86 in 2023); far longer than a checkpoint copy takes	81 days	2025	11.9	Ponemon 2025 Cost of Insider Risks
of breaches involve the human element; convenience (60%) now leads deliberate-misuse motive ahead of financial gain (33%)	~60%	2025	11.9	Verizon 2025 DBIR (12,195 breaches)
frontier pattern: time-limited, peer-approved, business-justified grants to weight infrastructure (multi-party authorization)	no standing access	2025	11.9	Anthropic Frontier Model Security; OpenAI frontier-risk
legacy Tier III / Tier IV availability (~1.6 hr vs ~26 min/yr down) — figures Uptime no longer endorses	99.982% / 99.995%	2025	12.1	Uptime Institute Tier Standard
MEP construction-cost swing of 2N over N+1; 2N strands ~50% of capacity idle	+30–50%	2025	12.1	SemiAnalysis Datacenter Anatomy; STACK Infrastructure
Tier IV capital premium over Tier III — for ~70 extra minutes/yr of facility uptime	~20–40%	2025	12.1	Uptime Institute / practitioner data
share of impactful outages caused by power (most often UPS) — the leading cause, 4th year of falling overall frequency	45%	2025	12.1	Uptime Institute Annual Outage Analysis 2025
of human-error outages caused by staff not following procedures (up from 48%); ~40% of orgs hit a major human-error outage in 3 yr	58%	2025	12.1	Uptime Institute Annual Outage Analysis 2025
Llama 3 405B training interruptions on 16,384 H100s (~1 every 3 hr; 78% hardware) yet >90% effective training time	466 / 54 days	2024	12.1	Meta (Llama 3 paper)
best-in-class H100 cluster MTBF per 512 GPUs — the job is its own availability risk, not the building	~7 days	2025	12.1	SemiAnalysis (100k H100 clusters)
rack BBU (OCP ORv3, 5+1 redundant) switchover — backup energy migrating down to the rack/silicon	<5 ms	2025	12.1	OCP ORv3 / Open Rack BBU specs
unplanned interruptions on 16,384 H100s (~1 every 3 hr); 78% hardware, 58.7% GPU/HBM — all at 100% facility availability	419 / 54 days	2024	12.2	Meta (Llama 3 405B paper) / Tom's Hardware
goodput (effective training time): industry average vs best-in-class; reliability overhead 6–21% of TCO	~90% / ~96%	2025	12.2	SemiAnalysis ClusterMAX / CoreWeave
best-in-class MTBF per 512 GPUs on mature H100 clusters; far worse during 3–4 week burn-in	~7 days	2025	12.2	SemiAnalysis (100k H100 clusters)
Uptime Tier III vs Tier IV availability (~1.6 hr vs ~26 min/yr); Tier IV ~20–40% capital premium	99.982% / 99.995%	2025	12.2	Uptime Institute (% figures Uptime-disavowed)
training MTTR cut by multi-tier checkpointing — a goodput gain no facility tier delivers	15–30 min → <2 min	2025	12.2	Google Cloud (multi-tier checkpointing)
data-center load lost on a single 230 kV fault (1.5 GW in 82 s, VA); triggered NERC's rare Level 3 alert	~1,500 MW	2026	12.2	NERC Level 3 Alert / Utility Dive
per-GPU capacitance, GB300 → Vera Rubin (~6x); ~30% peak-grid-demand reduction demonstrated	65 → ~400 J/GPU	2026	12.2	NVIDIA / SemiAnalysis
large-LLM job failure rate (Alibaba Unicron); ~37% hardware-attributed, ~73% restart-recoverable	~43.4%	2024	12.2	Alibaba (Unicron) via SemiAnalysis
practitioner RTO / RPO target for production interactive inference	~15 min / ~5 min	2025	12.3	Introl, Disaster Recovery for AI Infrastructure
training RPO floor — set by checkpoint interval, not by replication; RTO bounded by GPU re-acquire + resume	2-4 hr	2025	12.3	Introl DR analysis; checkpoint practice
infrastructure cost of active-active (carry a second live fleet); hot warm standby ~60% cheaper; pilot light ~20% of full redundancy	~2x	2025	12.3	Introl DR analysis; cloud DR-pattern taxonomy
training throughput (goodput) penalty of forcing a zero-RPO posture vs setting RPO = checkpoint interval	~15-20%	2025	12.3	Introl DR analysis
duration of the AWS US-EAST-1 outage (Oct 19-20, 2025) — a single-region control-plane/DNS dependency cascading estate-wide	~15 hr	2025	12.3	AWS post-event summary; InfoQ; ThousandEyes
availability achievable for inference spanning multiple active regions (e.g. Uber's 3-region inference posture)	99.99%	2025	12.3	Introl / Uber engineering synthesis
continuous replication bandwidth (~200 Gbps) to hold a 1-hour RPO on ~100 TB of training state across regions	~$50k/mo	2025	12.3	Introl DR analysis
large-load grid interconnection lead time — why failover capacity must be energized in advance, not acquired on the day	3-7+ yr	2025	12.3	ERCOT/PJM filings synthesis (provenance register)
training goodput: industry average vs best-in-class marketed (CoreWeave); the gap the contract prices	90% / ~96%	2025	12.4	SemiAnalysis ClusterMAX 2.0 / CoreWeave
GPU-cloud SLA baseline: node uptime / rack uptime, with penalties (ClusterMAX baseline)	99.9% / 99%	2025	12.4	SemiAnalysis ClusterMAX
hyperscaler compute SLA: multi-AZ region-level vs single-instance Monthly Uptime	99.99% / 99.5%	2026	12.4	Amazon EC2 / Compute SLA
reference service-credit ladder rungs (% of monthly bill) as uptime falls through bands	~10% / 25% / 100%	2026	12.4	Amazon EC2 / Compute SLA
Uptime Tier III vs Tier IV availability (~1.6 hr vs ~26 min downtime/yr); Uptime now disavows the %	99.982% / 99.995%	2025	12.4	Uptime Institute Tier Standard
best-in-class H100 MTBF per 512 GPUs — the failure environment any cluster SLA is written against	~7 days	2025	12.4	SemiAnalysis (100k H100 clusters)
Llama-3 405B interruption rate (16,384 H100, 54 days): 466 interruptions, 78% hardware	~1 / 3 hr	2024	12.4	Meta Llama 3 Herd of Models
reliability overhead as a share of cluster TCO — the cost of closing the goodput gap	6–21%	2025	12.4	SemiAnalysis ClusterMAX
failures per 1,000 node-days, Meta RSC-1 vs RSC-2 — the empirical λ that drives any cluster goodput model	6.50 vs 2.34	2024	12.5	Meta, Revisiting Reliability in Large-Scale ML Clusters (arXiv 2410.21680)
projected mean time between failures for a 16,384-GPU vs 131,072-GPU synchronous job	1.8 hr → 14 min	2024	12.5	Meta (arXiv 2410.21680); SemiAnalysis
modeled ETTR (goodput) for a 16k-GPU run moving from 60-min to 5-min checkpoint interval	0.70 → 0.93	2024	12.5	Meta, Revisiting Reliability (arXiv 2410.21680)
512+ GPU job failure rate after lemon-node ejection — a sensitivity result the model must reproduce	14% → 4%	2024	12.5	Meta, Revisiting Reliability (arXiv 2410.21680)
IEC 61508 beta-factor range for common-cause failure; ~10% the default if no diversity measures applied	0.5%–10%	2025	12.5	IEC 61508-6 Annex D; exida
annualized GPU failure rate feeding the per-node λ in fleet roll-up models	~9% AFR	2026	12.5	domain synthesis / Chapter 14.3 fleet data
Uptime Tier III / Tier IV availability targets (~1.6 hr vs ~26 min/yr) — the facility-model benchmark	99.982% / 99.995%	2025	12.5	Uptime Institute (Tier classes; % figures Uptime-disavowed)
industry-average vs best-in-class training goodput — the validation band any goodput model must land in	~90% / ~96%	2025	12.5	SemiAnalysis ClusterMAX / CoreWeave
the commissioning ladder: FAT → SAT → pre-functional → functional → Integrated Systems Test (IST)	L1–L5	2025	13.1	Construct & Commission; BMP MEP; CxPlanner
concurrent maintainability (any path serviceable, no load impact) vs fault tolerance (survive any single unplanned fault)	Tier III vs IV	2025	13.1	Uptime Institute Tier Standard
Tier III (~1.6 hr/yr) vs Tier IV (~26 min/yr) availability; ~20–40% capital premium for IV	99.982% / 99.995%	2025	13.1	Uptime Institute (% figures Uptime-disavowed)
ASHRAE commissioning-process / Basis-of-Design / data-center-specific Cx guidelines; Std 202 formalizes the Cx-Process	Gd 0 / 1.1 / 1.6	2025	13.1	ASHRAE; ACHR News
commissioning as a share of construction cost; CxAs now locked in 12–18 months ahead of energization	0.5–2%	2025	13.1	CxPlanner; iRecruit / industry practice
lost-revenue cost of delaying commissioning a 60 MW facility — the schedule pressure that tempts truncating L5	~$14.2M/mo	2025	13.1	Mastt / industry build-cost analyses
unplanned interruptions on a 16,384-GPU Llama 3 run (~1 every 3 hr); the day-2 reality a thin Cx program hands forward	419 / 54 days	2024	13.1	Meta (Llama 3 paper) / Tom's Hardware
ANSI/BICSI 002-2024 — the most comprehensive lifecycle design+implementation standard; 2024 ed. expanded liquid/immersion	~575 pp	2024	13.1	BICSI
of serious data-center outages involve human error — most trace to missing or unfollowed procedures (the case for the handover package)	~70-80%	2025	13.10	Uptime Institute Global Data Center Survey / Outage Analysis
revenue per GW of AI capacity per year — the clock that pressures teams to override the readiness gate (contested — single-source)	~$10-12B	2025	13.10	SemiAnalysis (onsite gas economics)
data-center load dropped in 82 s (VA, 2024); ~1,500 MW lost on a single fault — the swing go-live first exposes	~1.5 GW	2026	13.10	NERC Level 3 Alert / Utility Dive
NERC Level 2 Recommendation on large loads (commissioning + ramp coordination); Project 2026-02 Computational Loads under way	Sept 2025	2026	13.10	NERC Large Loads Action Plan / Utility Dive
industry-average vs best-in-class goodput — the acceptance floor the full-load stage must clear	~90% / ~96%	2025	13.10	SemiAnalysis ClusterMAX / CoreWeave
Tier III vs Tier IV availability — the redundancy that must hold at every point on the ramp, not just at the end	99.982% / 99.995%	2025	13.10	Uptime Institute Tier Classification
per GB200/GB300 NVL72 rack — the heat flux and power transient the cooling/smoothing stack must absorb at full load	120-142 kW	2026	13.10	SemiAnalysis / NVIDIA roadmap
MTBF per 512 GPUs at a mature operator — the failure cadence operations inherits the instant handover completes	~7 days	2025	13.10	SemiAnalysis (100k H100 clusters)
commissioning as share of total project cost; prevents multiples in rework/downtime	1–3%	2025	13.2	Industry Cx cost guidance (TrueLook / practitioner)
lead time operators now lock in commissioning agents ahead of energization	12–18 mo	2025	13.2	iRecruit / DC construction-trend reporting
default fabric BER acceptance threshold per port (InfiniBand ibdiagnet)	1e-12	2025	13.2	NVIDIA/Mellanox ibdiagnet manual
GPU node burn-in/soak duration gated before cluster acceptance	72–168 hr	2025	13.2	Together AI / Introl validation guides
goodput acceptance bar: industry-avg vs best-in-class effective training time	~90% / ~96%	2025	13.2	SemiAnalysis ClusterMAX / CoreWeave
CDU coolant inlet acceptance band; deviation can throttle GPUs up to ~50%	20–25 °C	2025	13.2	NVIDIA OCP / Introl (GB200 NVL72)
Tier III vs Tier IV availability the redundancy-topology scripts must demonstrate	99.982% / 99.995%	2025	13.2	Uptime Institute Tier classification
NVL72 heat split (liquid vs air) — the load a facility load bank cannot reproduce in the loop	~115 / ~17 kW	2025	13.2	NVIDIA OCP / Introl
ANSI/NETA Acceptance Testing Specifications — the current as-installed bar for switchgear, breakers, relays and primary injection	ATS-2025	2025	13.3	ANSI/NETA ATS-2025; NETA World Journal
data-center load lost on a single 230 kV fault — the synchronized ride-through failure NETA/Cx must now design against	~1,500 MW	2026	13.3	NERC Level 3 Alert / Utility Dive
peak grid-demand reduction from GB300 NVL72 power-shelf energy storage (capacitor smoothing) — an L4 acceptance criterion now, not a spec sheet curiosity	up to 30%	2025	13.3	NVIDIA Developer Blog; ServeTheHome
rack-level electrolytic-capacitance energy storage in GB300 NVL72 power shelves (≈half the PSU volume)	65 J/GPU	2025	13.3	NVIDIA Developer Blog / LITEON
Vera Rubin NVL72 rack-level storage — ~6x GB300 — with closed-loop state-of-charge control for fast transient smoothing	400 J/GPU	2026 (roadmap)	13.3	NVIDIA Vera Rubin POD blog
power-oversubscription headroom: training vs inference — the swing magnitude electrical acceptance must absorb	3% vs 21%	2025	13.3	Uptime Institute Journal
Rubin Ultra Kyber rack on 800 VDC — the density ramp the irreversible power substrate must accept	~600 kW	2027 (announced)	13.3	SemiAnalysis / NVIDIA roadmap; The Next Platform
lagging power factor a reactive load bank loads the chain to — proving generator/UPS at kVA rating, not just kW	0.8 PF	2025	13.3	Aggreko / CxPlanner commissioning practice
behind-the-meter gas announced by 2026 (~7 GW under construction) — the scale of the islanding problem	~82 GW	2026	13.4	Cleanview / SemiAnalysis
LM2500XPRESS aeroderivative unit rating and start time; black-start-capable, grid-independent	35 MW / 5 min	2025	13.4	GE Vernova / Crusoe (29-unit order)
aeroderivative gas-turbine lead time (refurb under 12 mo); the speed-to-power constraint behind islanding	18–36 mo+	2025	13.4	Data Center Frontier / Grid Capacity Intelligence
Vera Rubin rack-level energy storage for power smoothing (~6x prior gen); cuts peak current ~25%	~400 J / GPU	2025	13.4	NVIDIA developer blog
data-center load lost on a single 230 kV fault; 1.5 GW dropped in 82 s (VA, 2024) — triggered NERC Level 3 alert	~1,500 MW	2026	13.4	NERC Level 3 Alert / Utility Dive
microgrid-controller specification (2017) and conformance-test method (2018) — the Cx acceptance basis	IEEE 2030.7 / 2030.8	2017–2018	13.4	IEEE Standards
best-in-class cluster MTBF; a single power transient that drops a synchronous job restarts from checkpoint	~7 days / 512 GPUs	2025	13.4	SemiAnalysis (100k H100 clusters)
GB200 NVL72 coolant inlet spec; deviation can throttle GPUs up to ~50%	20–25 °C	2025	13.5	NVIDIA OCP / Introl
DLC flow per GB200 NVL72 rack (~1.2–2.0 L/min per kW design rule)	~80 L/min	2025	13.5	Dober / NVIDIA OCP
NVL72 CDU/row-level cooling capacity (per-rack heat is ~132 kW: ~115 kW liquid + ~17 kW air)	~2.4 MW	2025	13.5	NVIDIA OCP / Introl
secondary-loop conductivity floor flushed to before coolant charge (DI ≥0.5 MΩ·cm)	≤5 µS/cm	2026	13.5	Liquid-cooling commissioning practice (XD Thermal / Introl synthesis)
rated working pressure for hydrostatic acceptance hold (ASME B31.x / EN 13480 basis)	1.5×	2025	13.5	Liquid-cooling commissioning practice; ASME B31
install + commissioning per GB200 NVL72 system; load staged 25→50→75→100%	2–3 weeks	2026	13.5	Introl GB200 NVL72 deployment
single-phase direct-to-chip share of the liquid-cooling market (the loop you are commissioning)	~55%	2026	13.5	DCD / IDTechEx
best-in-class training goodput the loop must protect; a cooling trip is lost goodput	~96%	2025	13.5	SemiAnalysis ClusterMAX / CoreWeave
GB300 NVL72 in-shelf energy storage for power smoothing; ~30% peak-grid reduction on Megatron training	65 J/GPU	2025	13.6	NVIDIA Developer (GB300 steady power)
Vera Rubin power-smoothing reservoir target; facility BESS roles for transient/ride-through/DR	~400 J/GPU	2025	13.6	NVIDIA (production-ready BESS for AI factories)
single-event large-load loss on a 230 kV fault; 1.5 GW dropped in 82 s (VA, 2024) — the ride-through problem IST must prove against	~1,500 MW	2026	13.6	NERC Level 3 Alert / Utility Dive
GB200/GB300 NVL72 coolant inlet window; deviation throttles GPUs up to ~50% — the thermal ride-through envelope	20-25 °C	2025	13.6	NVIDIA OCP / Introl
power-oversubscription headroom training vs inference — why transient behavior differs by workload IST cannot run	3% vs 21%	2025	13.6	Uptime Institute Journal
single-phase direct-to-chip share of liquid-cooling market — the loop IST load banks cannot exercise at real heat flux	~55%	2026	13.6	DCD / IDTechEx
typical IST planning horizon before a full-facility Level 5 campaign	weeks-to-months	2025	13.6	Construct & Commission (L5 IST guide)
post-FEC BER pass floor for AI fabric links (tightening toward 1e-13 at the highest lane rates)	~1e-12	2025	13.7	IEEE 802.3 / IBTA link specifications; practitioner acceptance plans
PAM4 SerDes per-lane rate driving 800G/1.6T links — FEC-mandatory, BER-screening-critical	100-200 Gb/s	2025	13.7	SemiAnalysis (AI networks); provenance.js optics ladder
minimum link-flap soak under line-rate load at operating temperature before a link is accepted	≥ 24 h	2025	13.7	Practitioner fabric-commissioning practice; Keysight test methodology
InfiniBand point-to-point latency; tuned RoCEv2 ~1.5-2.5 us — the acceptance band for ib_*_lat	~1-2 us	2025	13.7	SemiAnalysis / NVIDIA; provenance.js IB-vs-RoCE
PTP accuracy held across a Spectrum switch; ConnectX-class NIC timestamping under ~4 ns variance	~10 ns	2025	13.7	NVIDIA Technical Blog, Spectrum switch time-sync
fleet PTP offset-from-master target the time-sync gate must demonstrate, under load and across every node	sub-us	2024	13.7	Engineering at Meta (SPTP); IEEE 1588 practice
effective throughput a well-tuned AI Ethernet fabric (Spectrum-X) sustains — the congestion-gate target	~95%	2025	13.7	NVIDIA (Spectrum-X xAI Colossus)
NVLink aggregate per GB200 NVL72 rack the scale-up gate verifies whole (1.8 TB/s/GPU, NVLink 5)	~130 TB/s	2025	13.7	NVIDIA; provenance.js NVLink
GPU node burn-in / soak window (3-day minimum to 7-day strict acceptance)	72–168 hr	2025	13.8	Together AI seven-phase guide; Introl validation frameworks; ClusterMAX 2.0
bring-up burn-in period before a new cluster's failure rate decays toward the mature baseline	3–4 weeks	2025	13.8	SemiAnalysis (100k H100 clusters)
mature best-in-class H100 MTBF; freshly-racked clusters fail far more often	~7 days / 512 GPUs	2025	13.8	SemiAnalysis (100k H100 clusters)
unplanned interruptions on 16,384 H100s during Llama 3 405B — ~1 every 3 hr	419 in 54 days	2024	13.8	Meta (Llama 3 paper) / Tom's Hardware
Llama 3 interruptions attributed to faulty GPU and to HBM3 — together >½ of hardware faults	~30% + ~17%	2024	13.8	Meta (Llama 3 paper) / DataCenterDynamics
machines affected by silent data corruption (SDC) at fleet scale	~1 in 1,000	2025	13.8	Meta Engineering (How Meta keeps its AI hardware reliable)
expected SDC events during a large-scale training run (Meta; Google reports similar for Gemini)	every 1–2 weeks	2025–2026	13.8	Meta Engineering; IEEE / arXiv SDC studies
DCGM -r 4 (deep, incl. memtest + EUD) runtime per node, GPU-count dependent	~1.5 hr	2026	13.8	NVIDIA DCGM Diagnostics documentation
in-domain all-reduce busbw on GB200 NVL72 (vs 900 GB/s/GPU NVLink5 ceiling); scale-out gate set as % of this	870-928 GB/s	2025	13.9	NCCL tests on GB200 NVL72 (Crusoe / Nebius / NVIDIA tuning guide)
checkpoint state to size the write path; keep checkpoint stall <10% of step time	~14 bytes/param	2025	13.9	VAST Data checkpoint survey (85k+ checkpoints)
failure cadence in a 100k-accelerator cluster at full utilization — why checkpoint bandwidth is an acceptance gate	~every 30 min	2025	13.9	MLCommons MLPerf Storage v2.0
aggregate storage bandwidth per ~1,024 GPUs (write ≥ ½ read design rule)	250-400 GB/s	2025	13.9	NVIDIA DGX SuperPOD reference architecture
industry-average vs best-in-class measured training goodput — the number the SLA is set against	~90% / ~96%	2025	13.9	SemiAnalysis ClusterMAX 2.0 / CoreWeave

Refine your search or pick a Part to narrow the 1,420 matches.