The Definitive Guide toAI Data Centers
Ask the Guide

Numbers Provenance Register

Every date-stamped figure in the guide — 1,420 entries, sourced and flagged where contested.

1,420 matches · showing 400
MetricValueAs ofWhereSource
rack power across the inflection: legacy → GB200 NVL72 (~132 kW) → Rubin Ultra Kyber (~600 kW, 2027 roadmap)~10–15 kW → 120–600 kW20260.1SemiAnalysis / NVIDIA roadmap
practical air-cooling ceiling per rack — the discontinuity that forces liquid and rewrites the building~41 kW20250.1ASHRAE TC 9.9; SemiAnalysis Datacenter Anatomy
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80–90% of draw at large operators~2/320260.1Deloitte TMT Predictions 2026; McKinsey
US large-load grid interconnection lead time end-to-end; up to ~10 yr in the worst queues — the binding constraint~3–7+ yr20250.1ERCOT / PJM filings synthesis
HV/substation power transformer lead time (standard); up to ~60 months in constrained markets — often the schedule's long pole~128 wk20250.1Wood Mackenzie / pv magazine
global data center capex in 2026 (~21% CAGR through 2029; GPUs ~1/3 of capex)approaching ~$1T20260.1Dell'Oro Group
cumulative global data center capex by 2030 (~$5.2T AI-capable) — the scale that makes mis-coordination catastrophic~$6.7T20250.1McKinsey, 'The cost of compute'
end-to-end electrical-chain efficiency, 800VDC/DC chain vs legacy AC (utility-to-VRM) — a system gain only co-design captures>92% vs ~61–87.5%20250.1SemiAnalysis, Datacenter Anatomy Pt 1
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023) — a fast-moving figure read as direction, not a fixed level~2/320260.2Deloitte TMT Predictions 2026
global data center capex 2026, approaching — volatile market figure; analyst estimates differ by capex-scope definition~$1T20260.2Dell'Oro Group
per GB200 NVL72 rack (shipping, ~115 kW liquid + ~17 kW air) — a semi-durable hardware spec you can design against~132 kW20250.2NVIDIA OCP / Introl
per Rubin Ultra Kyber-class rack — marked roadmap/announced, not shipping; do not budget as a level~600 kW2027 (announced)0.2SemiAnalysis / NVIDIA roadmap
practical air-cooling ceiling per rack — a durable physics number, safe to treat as a hard constraint~41 kW20250.2ASHRAE TC 9.9 / SemiAnalysis
large-load grid interconnection lead time — volatile and region-dependent; up to ~10 yr in worst queues3–7+ yr20250.2ERCOT / PJM filings synthesis
GPU economic vs book life — flagged CONTESTED; run irreversible decisions across the range, not a point estimate2–3 yr vs 5–6 yr20260.2CNBC / SemiAnalysis synthesis
best-in-class vs industry-average training goodput — a GOODPUT-thread target, vendor-marketed upper bound~96% vs ~90%20250.2SemiAnalysis ClusterMAX / CoreWeave
industry-weighted average PUE, flat for a 6th year; best-in-class liquid 1.05-1.15~1.5420250.3Uptime Institute Global Data Center Survey 2025
WUE range: industry avg ~1.8-1.9; best-in-class 0.3-0.7; closed-loop ~0~0-1.9 L/kWh20250.3Vertiv / NREL synthesis; Microsoft FY2025 fleet ~0.30
goodput (effective training time): industry average vs best-in-class~90% / ~96%20250.3SemiAnalysis ClusterMAX 2.0 / CoreWeave
scale-up (NVLink) domain size: HGX node, NVL72 rack, announced Rubin Ultra Kyber8 - 72 - 57620260.3NVIDIA NVLink / Rubin platform roadmap
scale-up (NVLink5/GPU) vs scale-out (per-NIC) bandwidth — roughly 18x apart~1.8 TB/s vs ~400 Gb/s20250.3NVIDIA / SemiAnalysis
self-operated TCO at 2048-GPU scale, 90% util; ~$1.03-3.50 rented (contested — single-source)~$0.74/GPU-hr2025-20260.3SemiAnalysis H100 cost / rental analyses
inference cost per million tokens: self-hosted 70B worked example vs market average~$1.90-2.5020250.3Introl / SemiAnalysis synthesis
Uptime Tier III vs Tier IV availability (~1.6 hr vs ~26 min downtime/yr)99.982% / 99.995%20250.3Uptime Institute (figures Uptime no longer formally endorses)
Uptime: concurrent maintainability vs fault tolerance; legacy ~99.982% (~1.6 h/yr) vs ~99.995% (~26 min/yr), now Uptime-disavowedTier III / IV20250.4Uptime Institute Tier Standard
TIA-942-C resilience scale; full-facility telecom + M&E standard, May 2024 (C) revisionRated 1–420240.4ANSI/TIA-942-C
EN 50600 / ISO/IEC 22237 Availability Classes (+ Protection Classes); basis of the EU DC sustainability schemeClass 1–420240.4CEN / ISO/IEC JTC 1
ASHRAE TC 9.9 air classes and liquid W-classes (5th ed. + 2024 liquid-cooling resiliency addendum)A1–A4 / W17–W4520240.4ASHRAE TC 9.9 Thermal Guidelines
OCP Diablo 400 (Mt. Diablo) sidecar-power spec; ±400/800 VDC, ~100 kW to ~1 MW racksv0.5.2May 20250.4OCP (Google/Meta/Microsoft)
FedRAMP 20x Key Security Indicators replacing 325+ NIST 800-53 controls; Phase 3 opens to all Q3 202656–61 KSIs20260.4FedRAMP PMO (RFC-0006)
ISO/IEC 42001 (first AI management-system standard) from publication to operationalized certification bodies2023 → 202620260.4ISO/IEC; ANAB/BSI accreditation
industry-weighted PUE (flat YoY) — the ISO/IEC 30134-2 KPI that lands in leases and disclosures~1.5420250.4Uptime Institute Global DC Survey 2025
published Tier III (~1.6 hr/yr down) vs Tier IV (~26 min/yr) availability — Uptime no longer endorses the specific %99.982% / 99.995%20250.5Uptime Institute Tier Standard
Tier IV capital premium over Tier III for the fault-tolerance step; total build often 2-3x in practice20-40%20260.5Uptime Institute; INGENIOUS.BUILD; market data
of impactful data-center outages root-caused to power (most often UPS); IT/networking ~23%45%20250.5Uptime Institute Annual Outage Analysis
of recent major outages cost over $100k / over $1M respectively~57% / ~20%20250.5Uptime Institute Global Survey
of human-error outages caused by staff not following procedures (up 10 pts YoY) — process, not topology58%20250.5Uptime Institute Annual Outage Analysis
best-in-class H100 cluster failure rate; one failure restarts a synchronous job from checkpoint~1 failure / 512 GPUs / week20250.5SemiAnalysis (100k H100 clusters)
training goodput: industry average vs best-in-class; reliability overhead 6-21% of TCO~90% / ~96%20250.5SemiAnalysis ClusterMAX / CoreWeave
data-center load tripped on a single 230 kV fault, triggering a rare NERC Level 3 alert — a grid-scale blast radius~1,500 MW20260.5NERC / Utility Dive
per GB200 NVL72 rack (≈132 kW typical: ~115 kW liquid + ~17 kW air)120–140 kW20251.1NVIDIA GB200 NVL72 / HPE & Supermicro datasheets
per Rubin Ultra Kyber NVL576 rack on 800 VDC~600 kWH2 2027 (announced)1.1NVIDIA GTC (Jensen Huang); DCD, Tom's Hardware
practical air-cooling ceiling per rack; RDHx ~50–100 kW; DLC 200+ kW~41 kW20251.1ASHRAE TC 9.9; SemiAnalysis Datacenter Anatomy
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80–90% of draw at large operators~2/320261.1Deloitte TMT Predictions 2026; McKinsey
active generation + storage in US interconnection queues (end-2024; ~twice US installed capacity); large-load waits 4–7 yr in top hubs~2,290 GWend-20241.1LBNL, Queued Up 2025 Edition
all-in cost per 8-GPU H100 server (excl. storage); ~$31k/GPU/yr enterprise all-in$283–318k20251.1SemiAnalysis AI Neocloud Playbook
TCO at 2048-GPU scale, 90% utilization; ~$1.03 small clusters; cloud H100 ~$1.49 (contested — single-source)~$0.74/GPU-hr20251.1SemiAnalysis H100 cost/rental analyses
accelerated economic life vs 5–6 yr book life; used GPUs retain ~20–40% residual after 3 yr2–3 yr20251.1Goldman Sachs; CNBC/secondary-market analyses
per dense training rack (GB200 NVL72 ~120–132 kW; GB300 ~142 kW)120–142 kW20251.2NVIDIA OCP / SemiAnalysis / Introl
per Rubin Ultra Kyber NVL576 rack on 800 VDC (announced roadmap)~600 kW2027 (announced)1.2NVIDIA GTC; SemiAnalysis 800 VDC
practical air-cooling ceiling/rack; RDHx ~50–100 kW; DLC 200+ kW~41 kW20251.2ASHRAE TC 9.9 / SemiAnalysis
GB200 NVL72 DLC inlet & flow; deviation throttles GPUs up to ~50%20–25 °C / ~80 L/min20251.2NVIDIA OCP / Introl
training back-end fabric non-blocking; 2:1 'optimized' cuts back-end cost ~31% (contested — single-source)1:1 vs 2:120251.2SemiAnalysis AI Neocloud Playbook
NVLink5 per-GPU BW (1.8 TB/s) vs ~400G scale-out NIC — keep collectives in scale-up~18x20251.2NVIDIA / SemiAnalysis
unplanned interruptions on 16,384 H100s (~1 / 3 hr); 78% hardware-caused419 / 54 days20241.2Meta Llama 3 paper (Table 5)
best-in-class mature H100 cluster MTBF; one failure restarts a synchronous job~7 days / 512 GPUs20251.2SemiAnalysis 100k-H100 clusters
training goodput: industry average / best-in-class; reliability overhead 6–21% of TCO~90% / ~96%20251.2SemiAnalysis ClusterMAX / CoreWeave
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80-90% of draw at large operators~2/320261.3Deloitte TMT Predictions 2026
AI inference capacity to 2030 (~35% CAGR) vs training 23.1 → 62.2 GW (~22%)20.9 → 93.3 GW20261.3McKinsey, 'The next big shifts in AI workloads'
market for inference-optimized chips in 2026; most inference stays in data centers, not at the edge>$50B20261.3Deloitte TMT Predictions 2026
power-oversubscription headroom: inference (uncorrelated per-request peaks) vs training (synchronous peaks)~21% vs ~3%20261.3Uptime Institute Journal; arXiv power-profile studies
inference fabric oversubscription (vs 1:1 non-blocking for training); 2:1 cuts back-end cost ~31% (contested — single-source)2:1-3:120251.3SemiAnalysis AI Neocloud Playbook; Juniper
HBM3E per Ironwood TPU v7 (inference-era ASIC); 9,216-chip pods, 42.5 FP8 ExaFLOPS, 4,614 FP8 TFLOPS/chip192 GiB / 7.4 TB/s20251.3Google Cloud; SemiAnalysis
self-hosted vs market-avg inference cost per million tokens; ~10x/yr token-price deflation (LLMflation)~$1.90 → ~$2.50/M tok20251.3Introl / NVIDIA synthesis; a16z
inference uptime target (99.995%) vs training's checkpoint-tolerant N/N+1 postureTier IV ~26 min/yr20251.3Uptime Institute (Tier classes)
of wall-clock spent on rollout generation in agentic/reasoning RL post-training~80%20261.42025–2026 RL-systems papers (ROLL Flash, ROLLART) & Introl RLHF infra report
of compute consumed by rollouts at 16K-token generation length (RLVR long-CoT)~70%20251.4RLVR / long-CoT RL-systems analyses (arXiv)
tokens per RL trajectory for reasoning/agentic tasks — the rollout that dominates cost10K–100K+20261.4domain-research keyNumbers; reasoning-model RL reports
wall-clock speedup of variance-controlled async RL vs synchronous at equal accuracy (~42h vs ~105h)2.5x20261.4Stable Asynchrony / VCPO (arXiv 2602.17616)
just to hold weights for a 70B PPO-RLHF stack (actor + reference + reward + critic), pre-optimizer8–16 GPUs20251.4Introl RLHF infrastructure report
QLoRA fine-tune on a single 48 GB GPU; memory cut from >780 GB to <48 GB without quality loss65B on 48 GB20231.4QLoRA (Dettmers et al., arXiv 2305.14314)
share of parameters trained by a LoRA adapter vs full fine-tune (model-dependent)~0.1%20261.4LoRA (Hu et al.) / 2026 PEFT practitioner guides
GPU:CPU norm rebalancing toward more CPU per node as agentic RL adds rollout/tool/env loadfrom 8:120261.4domain-research (System Composition); SemiAnalysis
one-way fiber latency from distance alone (~5 ms per 1,000 km); ~1.64 ms RT per 100 mi before any processing~0.82 ms / 100 mi20251.5M2 Optics fiber-latency analysis (≈2/3 c in glass)
MEC round-trip at the access edge; under ~50 ms from a regional 5G URLLC breakoutsub-10 ms20251.5ETSI ISG MEC; arXiv 2504.03708 (telco-LLM latency)
perceptibility thresholds: hard real-time / interactive (AR-VR, agentic) / 'instant' conversational~30 / 50 / 100 ms20261.5Spheron hybrid edge guide; AR/VR latency literature
edge data center market, 2026 to 2033, ~14.9% CAGR; AI/ML inference the fastest-growing segment~$40B → ~$106B20261.5Grand View Research; Coherent Market Insights
micro data centers' share of the edge market (global 2025) / of US edge by 2026~35% / ~54%20261.5Grand View Research; Coherent Market Insights (US)
inference share of AI compute in 2026 (½ in 2025); the growth pool the edge competes for~2/320261.5Deloitte TMT Predictions 2026
edge-site deploy time and install-time reduction under zero-touch provisioning (Vapor IO; ZTP fleet tooling)~1 hr / 90%+20261.5Vapor IO; Scale Computing / VMware VCF Edge
practical power envelope per edge micro-site (vs ~132 kW for a centralized NVL72 rack)a few kW – ~30 kW20261.5research/domain-research.json; practitioner ranges
time-to-power: greenfield self-build vs wholesale colo (live 50k+ GPU cluster) vs neocloud24–36 mo / 6–12 mo / days–weeks20261.6SemiAnalysis; JLL 2026 Outlook; Introl
brownfield retrofit cost: cooling-only vs full AI retrofit; ~2/3 of pre-2015 DCs unsuitable for frontier density$2–3M / $5–10M per MW20251.6Introl / Tetra Tech / Schneider synthesis
global wholesale colo average 2025 (record); ~$120 Atlanta to ~$450 Singapore; ~1% vacancy~$217/kW-month20251.6JLL / CBRE synthesis
self-build TCO at 2,048-GPU scale, 90% utilization (~$1.03 small clusters) vs neocloud median ~$2.3–3.5/hr (contested — single-source)~$0.74/GPU-hr20251.6SemiAnalysis cost / H100 rental analyses
neocloud GPU rental vs hyperscaler pricing (8-GPU node ~$34/hr neocloud vs ~$98/hr hyperscaler)40–85% below20261.6SemiAnalysis H100 Index / AM Compute
rise in the 1-year H100 rental contract index, Oct 2025 to Mar 2026, as capacity tightened; on-demand largely sold out~+40%20261.6SemiAnalysis H100 Rental Index
breakeven utilization for a debt-financed cluster; swings -$330k to +$340k/mo (55% vs 85%) on a 1,024-GPU H100 build (contested — single-source)~70%20251.6AM Compute / McKinsey
US large-load grid interconnection lead time end-to-end; up to ~10 yr in worst queues — the gate behind self-build~3–7+ yr20261.6LBNL Queued Up; ERCOT / PJM filings
practical air-cooling ceiling per rack; RDHx ~50–100 kW; DLC 100–200 kW+~41 kW20251.7ASHRAE TC 9.9; SemiAnalysis Datacenter Anatomy
per GB200 NVL72 rack (~115 kW liquid + ~17 kW air); GB300 ~142 kW; Rubin Ultra Kyber ~600 kW120–132 kW20261.7NVIDIA OCP / SemiAnalysis roadmap
GB200 NVL72 DLC inlet & flow; deviation can throttle GPUs up to ~50%20–25 °C / ~80 L/min20251.7NVIDIA OCP / Introl
training non-blocking vs inference oversubscribed; 2:1 cuts back-end cost ~31% (contested — single-source); Meta ran 7:1 on 24k H1001:1 vs 2:1–3:120251.7SemiAnalysis AI Neocloud Playbook / Meta
GPU:CPU ratio shifting from training-era norm toward agentic-inference host demand~8:1 → 4–8:120261.7TrendForce Insights; Introl
full AI liquid retrofit cost crossing the cooling cliff; still strands capacity~$5–10M/MW20261.7Introl / Vera Rubin deployment analysis
~1.6 hr/yr vs ~26 min/yr downtime; Tier IV ~20–40% capital premiumTier III 99.982% / Tier IV 99.995%20251.7Uptime Institute
goodput (effective training time): industry avg vs best-in-class; reliability overhead 6–21% of TCO~90% / ~96%20251.7SemiAnalysis ClusterMAX / CoreWeave
1 GW AI data center: total-program capex (core stack ~$27.9/W plus land, build-out, financing) and all-in annual TCO (~$8.5M/MW-yr)~$38B / ~$8.5B/yr20261.8Epoch AI, AI datacenter cost breakdown
1 GW annual TCO at 3-yr / 5-yr / 7-yr IT useful life — the dominant lever$12B / $8.5B / $7B20251.8Epoch AI / AM Compute synthesis
self-operated TCO at 2048-GPU scale, 90% util; ~$1.03 small clusters (contested — single-source)~$0.74/GPU-hr20251.8SemiAnalysis, GPU cluster cost
breakeven utilization (debt-financed); 1,024-GPU cluster swings -$330k to +$340k/mo (contested — single-source)~70%20251.8AM Compute / McKinsey
LLMflation: inference cost decline at fixed quality (Epoch: ~50x/yr median)~10x/yr2024-20261.8a16z; Epoch AI
AI-app gross margin vs 70-90% for mature SaaS~41% to ~52%20261.8ICONIQ State of AI 2026; Bessemer
wholesale colo global avg 2025; BTS/CTL ~$150-220/kW-mo over 15 yr~$217/kW-mo20251.8JLL / CBRE synthesis
estimated understated AI D&A 2026-2028 (CONTESTED); industry AI D&A ~$400B/yr~$176B20261.8Burry / secondary analyses; filings
AI/HPC scheduler share: Slurm / Kubernetes / in-house (rule of thumb)~70% / ~20% / ~10%202610.1HPCwire, ‘Slurm vs Kubernetes in the Age of AI’; ClusterMAX
GPU-cloud customers using K8s for inference vs Slurm for training~90% K8s / ~50% Slurm202510.1SemiAnalysis ClusterMAX
Kubernetes Dynamic Resource Allocation graduated to stable (Sept 2025)GA in 1.34202510.1Kubernetes blog, ‘v1.34: DRA has graduated to GA’
reported GPU utilization, device plugins vs DRA (better packing/sharing)45–60% → 70–85%202610.1Red Hat / vendor DRA analyses
goodput (effective-training-time): industry avg vs best-in-class~90% / ~96%202510.1SemiAnalysis ClusterMAX / CoreWeave
failure interval for a 16k-GPU cluster at ~80,000-hr per-GPU MTBF~every 3 hr202510.1Meta Llama 3 / domain reliability math
demonstrated scale of Slurm-on-Kubernetes (Slinky) side-by-side workloads8,000+ GPUs202510.1NVIDIA Developer, Slinky / slurm-bridge blog
NVIDIA open-sourced KAI Scheduler (gang, fair-share, DRA, topology)Apache-2.0, Apr 2025202510.1NVIDIA / KAI-Scheduler GitHub
Bartz v. Anthropic settlement — largest US copyright payout; ~$3,000 per work across ~500,000 works; pirated copies ordered destroyed$1.5B202510.10Bartz v. Anthropic (N.D. Cal.); Authors Guild; Fortune
Italian Garante fine on OpenAI for training ChatGPT without adequate legal basis + transparency failures; plus a 6-month awareness campaign€15MDec 202410.10Garante per la protezione dei dati personali
EU AI Act GPAI obligations apply to new models; training-content summary template (Commission) mandatoryAug 2, 2025202510.10European Commission; EU AI Act
Deadline for pre-existing (placed before Aug 2025) GPAI models to publish their training-content summaryAug 2, 2027202510.10European Commission; Mayer Brown analysis
ChatGPT conversation logs OpenAI was ordered to produce in NYT v. OpenAI discovery — output logs deemed relevant to fair-use defense20M logs202510.10NYT v. OpenAI (S.D.N.Y.); Bloomberg Law
EDPB legitimate-interest assessment for AI training (interest, necessity, balancing); high bar to claim a trained model is anonymous3-step testDec 202410.10EDPB Opinion 28/2024
UK High Court: AI model weights are not an infringing 'copy' under the CDPA (statistical parameters, not stored images)weights ≠ copyNov 202510.10Getty Images v. Stability AI (UK High Court)
EU DSM TDM exception is opt-out by default — crawlers must honor machine-readable reservations (robots.txt / TDM Reservation Protocol)opt-out202510.10EU DSM Directive 2019/790, Art. 4; AI Act Code of Practice
TPOT roughly matching human reading speed (~20-25 tok/s); common interactive decode target~40-50 ms202510.11Practitioner consensus; NVIDIA / vLLM serving guides
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80-90% of draw at large operators2/3 (~66%)202610.11Deloitte TMT Predictions 2026; McKinsey
market-average self-hosted inference cost, fell ~$10→~$2.50 in a year; worked example ~$1.90/M (8xH100, Llama-70B FP16)~$2.50/M tok202510.11Introl / NVIDIA synthesis (via provenance)
goodput gain from hybrid aggregation/disaggregation over SOTA when both TTFT and TPOT bindup to ~77%202510.11TaiChi (arXiv 2508.01989); see also FlowKV/HexGen-2
inference back-end fabric oversubscription (training is 1:1 non-blocking); 2:1 cuts back-end cost ~31% (contested — single-source)2:1-3:1202510.11SemiAnalysis AI Neocloud Playbook
best-in-class vs industry-average goodput (training framing); reliability overhead 6-21% of TCO~96% / ~90%202510.11SemiAnalysis ClusterMAX 2.0 / CoreWeave
runtime-reconfigurable disaggregation: x prefill workers feeding y decode workers, re-balanced livexPyD202610.11NVIDIA Dynamo / TensorRT-LLM disaggregated serving docs
GB200 NVL72 coherent NVLink domain — the rack-scale block the scheduler treats as atomic72 GPUs / 130 TB/s202510.2NVIDIA GB200 NVL72 / NVLink product page
NVLink 5 per-GPU bidirectional bandwidth (scale-up); ~3.6 TB/s on Rubin (roadmap)~1.8 TB/s202610.2NVIDIA NVLink
scale-up (NVLink) vs scale-out (~400G NIC) per-GPU bandwidth — the cliff the scheduler defends~5–18x202510.2NVIDIA / SemiAnalysis
Slurm block size for one NVL72 NVLink domain in topology.yaml (topology/block plugin)18 nodes202510.2NVIDIA Developer — Slurm block scheduling on GB200 NVL72
max IMEX channels (ComputeDomains) per node in Kubernetes DRA — strands partial-node GPUs1 / node202510.2NVIDIA Developer — MNNVL on Kubernetes
minimum Kubernetes with DRA APIs enabled for ComputeDomains; GPU Operator 25.3+K8s 1.32+202510.2NVIDIA / AWS EKS GB200 guidance
share of Llama-3 training job interruptions traced to network/config issues — the cost of getting topology wrong~10.7%202410.2Meta (via Introl topology analysis)
training goodput, industry average vs best-in-class — topology-aware placement is a lever on the gap~90% → ~96%202510.2SemiAnalysis ClusterMAX / CoreWeave
MIG instances per GPU (B200/GB200): 2×~93GB, 4×~46GB, or 7×~23GB profilesup to 7202510.3NVIDIA MIG User Guide (r580); MIG supported-profiles docs
HBM per Blackwell GPU available to partition across tenants (B200/GB200 class)180–192 GB202610.3NVIDIA Blackwell datasheets; provenance.js HBM trajectory
NVIDIAScape (CVE-2025-23266) Container Toolkit escape — container-to-host on shared GPU nodesCVSS 9.0202510.3Wiz Research; NVIDIA security bulletin
CVE-2025-23290 — first acknowledged cross-VM GPU-metric leak via vGPU Manager (co-tenant side channel)CVSS 2.5202510.3NVIDIA security bulletin; Tenable
Slurm vs Kubernetes share of AI clusters — the two quota/fairness enforcement planes operators must master~70% / ~20%202610.3HPCwire, 'Slurm vs Kubernetes in the Age of AI'
inference share of AI compute in 2026 — the workload class that most rewards fractional/MIG sharing~2/3202610.3Deloitte TMT Predictions 2026; McKinsey
accelerated GPU economic life — the depreciation clock that makes reclaiming idle silicon urgent2–3 yr202510.3Goldman Sachs; secondary-market analyses
current CUDA Toolkit (released May 2026); paired with R580 LTSB data-center driverCUDA 13.3mid-202610.4NVIDIA CUDA Toolkit Release Notes; Data Center Driver docs
current NVIDIA data-center LTS driver branch; ~3-yr lifecycle, EOL ~Aug 2028R580mid-202610.4NVIDIA Data Center Drivers; AI Enterprise lifecycle policy
AMD stack with RCCL NCCL-API parity; MI350X/MI355X support (7.0 Sep 2025)ROCm 7.x2025-202610.4AMD ROCm 7.0 release notes & compatibility matrix
NVLink-SHARP Multimem multicast + symmetric-memory kernels within an NVL72 domainNCCL 2.28+202610.4NVIDIA NCCL release notes; GitHub releases
GPU SMs consumed by a reduction after composing NVLink-SHARP + IB-SHARP in NCCL 2.27~16 → ≤6 SMs202510.4NVIDIA Developer (NCCL 2.27); SHARP in-network computing
NCCL all_reduce busbw vs theoretical (acceptance gate); ≈370 GB/s on 400G NDR~92%202510.4NVIDIA DGX BasePOD NCCL validation; OCI/Together AI
lower hardware cost for AMD vs NVIDIA — the prize that funds the ROCm tax15-30%202610.4domain-research keyNumbers; SemiAnalysis AMD vs NVIDIA
failure cadence of a 16k-GPU cluster (Llama 3: 419 unplanned/54 days) the stack must absorbevery ~3 hr202410.4Meta Llama 3 405B disclosure
of fabric line rate is the NCCL all_reduce acceptance bar (~370 GB/s on a 400 GB/s fabric)~92%202510.5Together AI — Practitioner's Guide to Testing Large GPU Clusters
typical burn-in soak before a new cluster is admitted to production72–168 hr202510.5Introl validation frameworks; neocloud operator reports
mean time between failures for a 16,000-GPU cluster — why provisioning is a continuous day-2 loop, not a one-time event~3 hr202410.5Meta Llama 3 (16,384 H100); ~80,000-hr per-GPU MTBF
MTBF per 512 GPUs at a top-tier H100 operator; new clusters fail far more during 3–4 week burn-in~7 days202510.5SemiAnalysis (100k H100 clusters)
automated node replacement on a best-in-class fleet — the day-2 lifecycle target~90 sec202610.5SemiAnalysis AI Neocloud Playbook / ClusterMAX
to provision 128 GPUs to a customer at a top-rated neocloud — the bring-up-as-competitive-lever benchmark<2 days202610.5SemiAnalysis ClusterMAX 2.0
goodput (effective-training-time) achievable despite ~3-hr cluster MTBF, given automated validation + recovery>90%202510.5Google Cloud goodput; NVIDIA Mission Control
revenue per GW per year — the depreciation clock that makes time-from-rack-to-first-job a million-dollar-per-week metric (contested — single-source)$10–12B202610.5Domain synthesis; SemiAnalysis
unplanned interruptions on 16,384 H100s (~1 every 3 hr); 78% hardware, 58.7% GPU-related419 / 54 days202410.6Meta (Llama 3 paper) / Tom's Hardware
MTBF per 512 GPUs at a best-in-class mature H100 operator (new clusters fail far more)~7 days202510.6SemiAnalysis (100k H100 clusters)
machines harboring an SDC-prone defect; SDC expected every 1-2 weeks in large training~1 in 1,000202510.6Meta Engineering; OCP SDC-in-AI whitepaper
SDC test seeds per month across Meta's fleet (Fleetscanner + Ripple)~2.5 billion202510.6Meta Engineering (How Meta keeps AI hardware reliable)
industry-average vs best-in-class goodput (effective training time)~90% / ~96%202510.6SemiAnalysis ClusterMAX / CoreWeave
large-LLM-job failure rate (~37% hardware-attributed; ~73% recoverable via restart)~43.4%202410.6Alibaba (Unicron) via SemiAnalysis
reliability/recovery overhead — the cost the observability loop exists to shrink6-21% of TCO202510.6SemiAnalysis ClusterMAX 2.0
MTTR achievable with multi-tier checkpointing vs 15-30 min naive restart<2 min202510.6Google Cloud (multi-tier checkpointing)
mean-time-to-failure of a 1,024-GPU job vs 47.7 days for an 8-GPU job — the single-point-of-failure penalty of scale7.9 hr202510.7Meta, Revisiting Reliability in Large-Scale ML Clusters (arXiv 2410.21680)
failures per thousand node-days on Meta's RSC-1 cluster (11 months, ~80%+ utilization)6.50 / 1000202510.7Meta, Revisiting Reliability (arXiv 2410.21680)
unplanned interruptions on 16,384 H100s during Llama 3 405B (~1 every 3 hr); 78% hardware, 58.7% GPU-related419 / 54 days202410.7Meta (Llama 3 paper) / Tom's Hardware
checkpoint-and-restart overhead required to hold ETTR ~0.9 on a 100,000-GPU run at RSC-2-like failure rates~2 min202510.7Meta, Revisiting Reliability (arXiv 2410.21680)
best-in-class MTBF per 512 GPUs on a mature 100k-H100 cluster (burn-in 3–4 weeks first)~7 days202510.7SemiAnalysis, 100k H100 Clusters
large-LLM-job failure rate; ~37% hardware-attributed; ~73% recoverable via restart43.4%202410.7Alibaba Unicron production study
goodput (effective training time): industry average / best-in-class; reliability overhead 6–21% of TCO~90% / ~96%202510.7SemiAnalysis ClusterMAX / CoreWeave
training restart latency, storage-only vs multi-tier/in-memory checkpointing15–30 min → <2 min202510.7Google Cloud multi-tier checkpointing
typical model-FLOPS-utilization (MFU) for large LLM training; best-in-class >50% on Hopper~30–50%202510.8SemiAnalysis; provenance.js (domain economics)
BF16 MFU gain on GB200 NVL72 from software/kernel maturation over ~12 months (≈57% throughput from software alone)34% → 54%202510.8SemiAnalysis (H100 vs GB200 NVL72 training benchmarks)
BF16 MFU achieved pre-training Llama 3 on 16k H100s (frontier-scale reference point)~41%202410.8Meta, The Llama 3 Herd of Models
training-state footprint with Adam (4 weight + 4 grad + 8–12 optimizer); the number the framework must sleep across the fleet~16–18 B/param202510.8Standard mixed-precision Adam accounting; DeepSpeed/ZeRO docs
rule-of-thumb checkpoint size on disk (weights + optimizer state); sets async-drain bandwidth need~14 B/param202510.8VAST Data (checkpoint bandwidth analysis)
training goodput: industry average vs best-in-class effective-training-time fraction~90% / ~96%202510.8SemiAnalysis ClusterMAX / CoreWeave; provenance.js
large-LLM-job failure rate in a production fleet (~37% hardware-attributed; ~73% restart-recoverable) — why elastic orchestration matters~43.4%202410.8Alibaba Unicron; provenance.js
MTBF per 512 GPUs at a top-tier operator; one failure restarts a synchronous job from its last checkpoint~7 days202510.8SemiAnalysis (100k H100 clusters); provenance.js
goodput (effective training time): industry average vs best-in-class marketed; reliability overhead 6-21% of TCO~90% / ~96%202510.9SemiAnalysis ClusterMAX / CoreWeave
ClusterMAX 2.0 GPU-cloud rating: Security, Lifecycle, Orchestration, Storage, Networking, Reliability, Monitoring, Pricing, Partnerships, Availability — Platinum to UnderPerform10 dimensions / 5 tiers202510.9SemiAnalysis ClusterMAX 2.0
breakeven utilization for a debt-financed fleet; the cliff the on-demand/take-or-pay mix transfers or carries (contested — single-source)~70%202510.9AM Compute / McKinsey
serverless GPU time-to-first-token (H100): warm-pool vs scale-from-zero; snapshots claim ~10x cold-start gains8-15s warm / 30-90s cold202610.9RunPod / Modal serverless comparisons
spot/preemptible discount vs on-demand; the price of transferring interruption risk to the customer~60-80% off202610.9Spheron / GCP GPU pricing synthesis
H100 on-demand ladder: spot floor to Azure managed; neocloud median ~$2.29-3.50 (the value-stack premium, monetized)~$1.03 - $12.29/GPU-hr202610.9SemiAnalysis H100 Index / AM Compute
Tier III vs Tier IV facility availability (~1.6 hr vs ~26 min/yr) — the easy SLA, distinct from goodput99.982% / 99.995%202510.9Uptime Institute
RAND Weights Security Levels and adversary operational-capacity tiers; 38 distinct attack vectors enumeratedSL1–SL5 / OC1–OC5202411.1RAND, Securing AI Model Weights (RRA2849-1)
attack vectors infeasible for OC1–OC3 but feasible for OC4–OC5 — why nation-state defense is categorically harder8 of 38202411.1RAND RRA2849-1
assessed posture of frontier labs vs OC4–OC5 adversaries that want the weights — the central gap~SL2–SL3202611.1RAND RRA2849-1; IFP / IST SL5 Task Force
allocation-constrained silicon per GB200 NVL72 rack — theft/destruction economics differ from generic cloud hardware~$3M+202511.1Guide domain research; OEM rack pricing
RAND theft-window benchmark: a Security Level is defined by thwarting weight theft within roughly this horizon&lt;2 months202411.1RAND RRA2849-1
IRGC drones struck 3 AWS facilities in UAE/Bahrain — first deliberate state targeting of commercial data centers in wartime1 Mar 2026202611.1CNBC; The Conversation
projected data-center physical-security spend by 2030 (roughly doubling) as kinetic/drone threats enter the planning case~$4B2030 (proj.)11.1Guide domain research; industry security forecasts
concentric physical model (perimeter → facility → data hall → cage/rack) with escalating MFA at each boundary4 zones202611.1NIST / DCK physical-security guidance
of organizations run BMS affected by known-exploited vulnerabilities (KEVs); data centers the worst case75%202511.10Claroty Team82, State of CPS Security 2025: BMS Exposures
of organizations exposed to KEVs that are ransomware-linked AND insecurely internet-connected51%202511.10Claroty Team82, State of CPS Security 2025
heat lost in Lviv via FrostyGoop — Modbus firmware downgrade, no zero-day600 buildings / ~2 days202411.10Dragos / CISA / SANS
data-center load lost instantaneously on a single 230-kV fault (the weaponizable swing)~1,500 MW202411.10NERC Level 3 Alert / Utility Dive
load shed in a single Virginia event — the synchronized-load-step primitive, unweaponized1.5 GW / 82 s202411.10NERC / Utility Dive
time from CDU flow-loss to GB200 NVL72 throttle/over-temp trip (no chilled-water inertia)seconds-tens202511.10NVIDIA OCP DLC spec / Introl
IEC 62443 security levels by attacker capability; destructive primitives are SL3-SL4SL1-SL4202511.10ISA/IEC 62443
Modbus TCP — unauthenticated by design; the protocol FrostyGoop and most BMS/CDU controllers speakPort 502202411.10Dragos / SANS ICS
FedRAMP 20x Key Security Indicators (Low / Moderate baseline) — automated, measurable outcomes replacing control-narrative essays56 / 61202611.11FedRAMP PMO RFC-0006
RFC-0024 deadline: machine-readable (OSCAL) packages mandatory for all FedRAMP providersSep 2026202611.11FedRAMP PMO RFC-0024
CMMC Level 2 third-party certification becomes mandatory for CUI-handling DoD contracts (Phase 2)10 Nov 20262025-202611.11DoD 48 CFR final rule
EU AI Act full enforcement powers activate; most high-risk obligations apply; fines up to 7% global turnover2 Aug 2026202611.11European Commission
EU / North American enterprise AI-vendor RFPs asking for ISO 42001 certification or implementation~40% / ~25%202611.11Industry RFP analyses
OCP S.A.F.E. accredited Security Review Providers (Atredis, IOActive, NCC Group) for firmware-security conformance audits3 SRPs2025-202611.11Open Compute Project
single-event load loss that pushed NERC to treat large AI loads as grid actors subject to CIP-adjacent scrutiny~1,500 MW202611.11NERC Level 3 Alert / Utility Dive
ISO 27001 / 42001 certification validity with annual surveillance audits; SOC 2 Type II re-issued every 6–12 mo3 yr202611.11ISO; AICPA
mean time to identify + contain a breach in 2025 (lowest in 9 years) — the dwell window your retention must outlast241 days202511.12IBM Cost of a Data Breach 2025
average breach cost: global down 9% to $4.44M; US at an all-time high of $10.22M$4.44M / $10.22M202511.12IBM Cost of a Data Breach 2025
NIST IR guidance restructured onto CSF 2.0 (Govern/Identify/Protect/Detect/Respond/Recover); first revision since 2012SP 800-61r3Apr 202511.12NIST SP 800-61 Rev. 3
GPU registers hidden by the BAR0 decoupler in confidential mode (vs ~7.94% normal) — the forensic opacity the SOC works around~99.78%202511.12NVIDIA WP-12554 / arXiv 2507.02770
certificate device-identity chain and structured measurement records (NRAS/RIM) that become the forensic record on confidential systems5 / 64202611.12NVIDIA Secure AI whitepaper (domain synthesis)
documented multi-tenant GPU escape (cross-VM disclosure) and cross-tenant DoS — the isolation-breach playbook's design caseCVE-2025-23290 / -23285202511.12NVIDIA security bulletins (domain research)
IRGC drone strikes on AWS facilities (UAE/Bahrain) — aerial/kinetic attack now an IR design case, not tail-riskMar 1, 2026202611.12Domain research / open reporting
projected data-center physical-security spend by 2030 (~2x), reflecting the converged cyber-physical posture~$4B202611.12Security-domain research synthesis
AWS facilities directly hit by drones (UAE) + 1 blast-damaged (Bahrain), Mar 2026 — first confirmed combat strike on US-run hyperscale DC2 + blastMar 202611.2DefenseScoop / DCK / MWI (West Point)
first US statute letting certified state/local/tribal law enforcement deploy counter-UAS (after DOJ training); private operators still cannot legally defeat a droneFY2026 NDAA202611.2FY2026 NDAA; CRS; Route Fifty
data center security market 2026, growing to ~$90B by 2034 (~17% CAGR); biometrics the fastest-growing sub-segment~$25.7B202611.2Fortune Business Insights; market.us
data center access-control market 2025, to ~$2.53B by 2030 (~10% CAGR)~$1.55B202511.2MarketsandMarkets
load lost on a single substation fault; 1.5 GW dropped in 82 s (VA, 2024) — the prize a saboteur targets outside the fence~1,500 MW202411.2NERC Level 3 Alert / Utility Dive
grid interconnection lead time for a large load — making the utility tie an irreplaceable single-point-of-failure if attacked~3–7+ yr202511.2ERCOT / PJM filings synthesis
typical standoff range a rural power-first campus gets free; urban inference sites often have near-zero50–150 m202611.2Practitioner / CPTED siting guidance
cost of a commercial FPV/one-way drone — the asymmetry against a multi-hundred-million-dollar facility~$300–500/round202611.2MWI (West Point) / open-source defense reporting
suspect counterfeit-part submissions logged in 2025 (down from 1,055 in 2024, partly a one-off batch); active components ~36% of reports748202511.3ERAI 2025 Annual Counterfeit Report
of suspect counterfeit parts that PASSED electrical test — would evade detection if electrical test were the only screen~24%202511.3ERAI 2025 report
NIST SP 1800-34 'Validating the Integrity of Computing Devices' finalized — the platform-certificate / provenance reference architectureDec 2022202211.3NIST / NCCoE SP 1800-34
NIST SP 800-88 Rev 2 released — media sanitization modernized for encrypted/virtual/cloud media (Clear / Purge / Destroy)Sept 2025202511.3NIST SP 800-88 Rev 2
IEEE 2883-2022: no overwrite-based method meets the Purge threshold for SSD/NVMe — only verified cryptographic erase or physical destruction qualifiesPurge = CE or destroy202511.3IEEE 2883-2022 / NIST 800-88 r2
of used drives resold on the secondary market found to contain residual recoverable data (PII, financial, IP) — the data-remanence base rate42%201911.3Blancco Technology Group study
approximate silicon value concentrated in a single GB200 NVL72 rack (1.36 t) — the asset-value density driving target priority$3–4M202511.3NVIDIA / SemiAnalysis (derived)
OCP S.A.F.E. project cadence; AMI the first approved independent firmware vendor SRP — the centralized, inheritable firmware-audit framework1st Thu/mo202511.3Open Compute Project S.A.F.E.
CVE-2024-54085 AMI MegaRAC BMC auth-bypass via Redfish; added to CISA KEV 25 Jun 2025; OEM'd across 12+ server vendorsCVSS 10.0202511.4Eclypsium / CISA KEV / The Hacker News
internet-exposed MegaRAC SP-X Redfish instances found, each potentially exploitable for remote takeover/bricking1,000+202511.4Eclypsium (Shodan scan)
Caliptra open silicon RoT co-developed by Microsoft, Google, AMD, NVIDIA; committed in their first-party/server silicon4 contributors202511.4OCP / CHIPS Alliance / Microsoft Azure
post-quantum signatures + KEM in Caliptra 2.x via open-source Adams Bridge accelerator (CNSA 2.0 path), side-channel hardenedML-DSA + ML-KEM202511.4Microsoft Azure / CHIPS Alliance
irreplaceable, allocation-constrained silicon per GB200 NVL72 rack a management-plane implant can brick or wiretap$3M+202511.4RAND / domain research synthesis
NIST Platform Firmware Resiliency (protect/detect/recover); with SP 1800-34 and IR 8320 the standards backbone for firmware integrity800-193202411.4NIST / NCCoE
OCP module decoupling BMC + RoT + TPM from the motherboard; 2.1 open reference designs appeared in 2025DC-SCM 2.0202511.4OCP / Cloudflare Project Argus / Antmicro
BMC runs on standby power and boots before the host; a rooted BMC is an OS-invisible, persistent foothold under the CPUalways-on202511.4Eclypsium / OCP DC-SCM
of GPU HBM placed inside the encrypted, integrity-protected Compute Protected Region (CPR)~90%202511.5arXiv 2507.02770 (GPU CC Demystified); NVIDIA WP-12554
of GPU memory-mapped registers hidden by the BAR0 decoupler in CC mode (vs ~8% in normal mode)~99.78%202511.5arXiv 2507.02770
device-identity chain length and structured measurement records validated against NRAS + RIM goldens5-cert / 64 records202511.5arXiv 2507.02770; NVIDIA attestation docs
per-channel session keys derived from one SPDM-negotiated master secret (RPC / DMA / fault / workload)44+ keys202511.5arXiv 2507.02770
training / inference advantage HGX B200 retains over H200 with confidential computing fully enabled~2x / ~2.5x202511.5NVIDIA Secure AI WP-12554; Corvex/Spheron benchmarks
Blackwell CC overhead on large matrix ops (encrypted HBM + TEE-I/O over NVLink); Hopper far heavier on small/PCIe transfersunder ~3%202511.5NVIDIA; independent Hopper CC benchmark (arXiv 2409.03992)
Hopper-class confidential-computing scope; multi-GPU TEE-I/O across NVLink is Blackwell-and-latersingle-GPU202511.5NVIDIA Secure AI with Blackwell and Hopper GPUs (WP-12554)
year AMD SEV-SNP + Intel TDX + NVIDIA GPU CC reached broad cloud GA as a paired confidential-AI stack20252025-202611.5NVIDIA / cloud-provider CC GA announcements
CVSS of NVIDIAScape (CVE-2025-23266) — three-line container escape to host root in NVIDIA Container Toolkit9.0Jul 202511.6Wiz Research; NVIDIA Security Bulletin
NVIDIA Container Toolkit versions vulnerable to NVIDIAScape (GPU Operator ≤25.3.0)≤1.17.7Jul 202511.6Wiz; NVIDIA
first publicly acknowledged cross-VM co-tenant information disclosure via the vGPU ManagerCVE-2025-23290Jul 202511.6NVIDIA Security Bulletins
max MIG instances per GPU — the only hardware-enforced fractional partition (dedicated SMs, L2 slice, memory controllers, HBM slice)7202511.6NVIDIA Multi-Instance GPU
LLM-response data recoverable per query via LeftoverLocals (CVE-2023-4969) from un-scrubbed GPU local memory≈181 MB202411.6Trail of Bits
memory and fault isolation guarantees provided by time-slicing / MPS between tenants0202511.6Introl; NVIDIA MPS docs
ClusterMAX 2.0 operator-maturity rubric grades tenant/fabric isolation, health-checks, and goodput as first-class10-dimension202511.6SemiAnalysis ClusterMAX 2.0
share of data-center traffic that is east-west (interior); approaches 100% on a training back-end fabric76-80%2024-202611.7Akamai / Gigamon
average eCrime breakout time (initial access to first lateral movement) in 2025, down from 48 min in 2024; fastest 27 s29 min202511.7CrowdStrike 2026 Global Threat Report
BlueField-4 DPU throughput; 64 Arm cores, ~6x BlueField-3 compute; zero-trust east-west enforcement at line rate800 Gb/s2026 (Vera Rubin platform)11.7NVIDIA / HPCwire
NIST Zero Trust Architecture — 'never trust, always verify'; no trust from network locationSP 800-207Aug 2020 (current)11.7NIST
training back-end fabric design; sub-2 us latency — why inline L7 inspection is a goodput tax there1:1 non-blocking202511.7SemiAnalysis / NVIDIA
industry-avg vs best-in-class goodput; inline enforcement on collectives erodes exactly this metric~90% / ~96%202511.7SemiAnalysis ClusterMAX / CoreWeave
configuration boundaries (subnet-manager / adapter enforced) — segmentation, not cryptographic isolationVLAN/PKey202511.7NVIDIA InfiniBand / SemiAnalysis ClusterMAX
egress posture for the weights enclave: allow-listed proxy + blocked/alerted bulk transfers — the anti-exfil linchpindefault-deny202511.7RAND RRA2849-1 (weight-security egress controls)
RAND Weights Security Levels (SL1-5), attacker operational-capacity tiers (OC1-5), and catalogued attack vectors5 levels / 5 tiers / 38 vectors202411.8RAND RRA2849-1 (Securing AI Model Weights)
where RAND assesses most frontier labs currently sit — stops opportunistic actors and basic insiders, not OC4-OC5 nation-states~SL22024-202611.8RAND RRA2849-1
SL5 Task Force target for nation-state-resistant frontier AI infrastructure; SL5 standard = 43 controls / 10 families (NIST SP 800-53 overlay)2028/20292025-202611.8SL5 Task Force / Institute for Security &amp; Technology
to exfiltrate a ~1,000 Gb model even under an 800 GB/day egress cap — why fixed-rate limits are necessary but not sufficient~1.25 days202511.8LessWrong/Alignment Forum egress-limit analyses
token output of a single production inference server — the channel that cannot be rate-capped without breaking the service~1 TB/day202511.8Inference-verification exfiltration research
preliminary feasible weight-compression floor in a theft context — shrinks the payload an attacker must move, undercutting fixed egress caps~1 bit/param202611.8arXiv 'Aggressive Compression Enables LLM Weight Theft'
GPU HBM inside the encrypted Compute Protected Region; memory-mapped registers hidden by the BAR0 decoupler in CC mode~90% / ~99.78%202511.8arXiv 2507.02770; NVIDIA WP-12554
checkpoint size for a 175B to 1T-param model at ~14 bytes/param incl. optimizer state — the at-rest bulk the crypto must wrap2.3-13.8 TB202511.8NVIDIA storage guidance; checkpoint-sizing rules of thumb
where consensus assesses frontier labs sit; insider threat is the dominant gap blocking SL4-5, which need human-layer controls not more crypto~SL22024-202511.9RAND RRA2849-1 (Securing AI Model Weights); IST SL5 Task Force
RAND theft benchmark: a Security Level is defined by stopping an adversary attempting weight theft inside this window&lt;2 months202411.9RAND RRA2849-1
distinct attack vectors in RAND's model; insider threat spans most of them rather than being one isolated path38 vectors202411.9RAND RRA2849-1 (5 SL, 5 OC tiers, 38 vectors)
average annual cost of insider risk per organization (largest Ponemon insider study to date)$17.4M202511.9Ponemon / DTEX 2025 Cost of Insider Risks
share of insider incidents that are negligent vs malicious; credential theft ~20% but costliest at ~$779,797/event~55% / ~25%202511.9Ponemon 2025 Cost of Insider Risks
average time to detect and contain an insider incident (down from 86 in 2023); far longer than a checkpoint copy takes81 days202511.9Ponemon 2025 Cost of Insider Risks
of breaches involve the human element; convenience (60%) now leads deliberate-misuse motive ahead of financial gain (33%)~60%202511.9Verizon 2025 DBIR (12,195 breaches)
frontier pattern: time-limited, peer-approved, business-justified grants to weight infrastructure (multi-party authorization)no standing access202511.9Anthropic Frontier Model Security; OpenAI frontier-risk
legacy Tier III / Tier IV availability (~1.6 hr vs ~26 min/yr down) — figures Uptime no longer endorses99.982% / 99.995%202512.1Uptime Institute Tier Standard
MEP construction-cost swing of 2N over N+1; 2N strands ~50% of capacity idle+30–50%202512.1SemiAnalysis Datacenter Anatomy; STACK Infrastructure
Tier IV capital premium over Tier III — for ~70 extra minutes/yr of facility uptime~20–40%202512.1Uptime Institute / practitioner data
share of impactful outages caused by power (most often UPS) — the leading cause, 4th year of falling overall frequency45%202512.1Uptime Institute Annual Outage Analysis 2025
of human-error outages caused by staff not following procedures (up from 48%); ~40% of orgs hit a major human-error outage in 3 yr58%202512.1Uptime Institute Annual Outage Analysis 2025
Llama 3 405B training interruptions on 16,384 H100s (~1 every 3 hr; 78% hardware) yet &gt;90% effective training time466 / 54 days202412.1Meta (Llama 3 paper)
best-in-class H100 cluster MTBF per 512 GPUs — the job is its own availability risk, not the building~7 days202512.1SemiAnalysis (100k H100 clusters)
rack BBU (OCP ORv3, 5+1 redundant) switchover — backup energy migrating down to the rack/silicon&lt;5 ms202512.1OCP ORv3 / Open Rack BBU specs
unplanned interruptions on 16,384 H100s (~1 every 3 hr); 78% hardware, 58.7% GPU/HBM — all at 100% facility availability419 / 54 days202412.2Meta (Llama 3 405B paper) / Tom's Hardware
goodput (effective training time): industry average vs best-in-class; reliability overhead 6–21% of TCO~90% / ~96%202512.2SemiAnalysis ClusterMAX / CoreWeave
best-in-class MTBF per 512 GPUs on mature H100 clusters; far worse during 3–4 week burn-in~7 days202512.2SemiAnalysis (100k H100 clusters)
Uptime Tier III vs Tier IV availability (~1.6 hr vs ~26 min/yr); Tier IV ~20–40% capital premium99.982% / 99.995%202512.2Uptime Institute (% figures Uptime-disavowed)
training MTTR cut by multi-tier checkpointing — a goodput gain no facility tier delivers15–30 min → <2 min202512.2Google Cloud (multi-tier checkpointing)
data-center load lost on a single 230 kV fault (1.5 GW in 82 s, VA); triggered NERC's rare Level 3 alert~1,500 MW202612.2NERC Level 3 Alert / Utility Dive
per-GPU capacitance, GB300 → Vera Rubin (~6x); ~30% peak-grid-demand reduction demonstrated65 → ~400 J/GPU202612.2NVIDIA / SemiAnalysis
large-LLM job failure rate (Alibaba Unicron); ~37% hardware-attributed, ~73% restart-recoverable~43.4%202412.2Alibaba (Unicron) via SemiAnalysis
practitioner RTO / RPO target for production interactive inference~15 min / ~5 min202512.3Introl, Disaster Recovery for AI Infrastructure
training RPO floor — set by checkpoint interval, not by replication; RTO bounded by GPU re-acquire + resume2-4 hr202512.3Introl DR analysis; checkpoint practice
infrastructure cost of active-active (carry a second live fleet); hot warm standby ~60% cheaper; pilot light ~20% of full redundancy~2x202512.3Introl DR analysis; cloud DR-pattern taxonomy
training throughput (goodput) penalty of forcing a zero-RPO posture vs setting RPO = checkpoint interval~15-20%202512.3Introl DR analysis
duration of the AWS US-EAST-1 outage (Oct 19-20, 2025) — a single-region control-plane/DNS dependency cascading estate-wide~15 hr202512.3AWS post-event summary; InfoQ; ThousandEyes
availability achievable for inference spanning multiple active regions (e.g. Uber's 3-region inference posture)99.99%202512.3Introl / Uber engineering synthesis
continuous replication bandwidth (~200 Gbps) to hold a 1-hour RPO on ~100 TB of training state across regions~$50k/mo202512.3Introl DR analysis
large-load grid interconnection lead time — why failover capacity must be energized in advance, not acquired on the day3-7+ yr202512.3ERCOT/PJM filings synthesis (provenance register)
training goodput: industry average vs best-in-class marketed (CoreWeave); the gap the contract prices90% / ~96%202512.4SemiAnalysis ClusterMAX 2.0 / CoreWeave
GPU-cloud SLA baseline: node uptime / rack uptime, with penalties (ClusterMAX baseline)99.9% / 99%202512.4SemiAnalysis ClusterMAX
hyperscaler compute SLA: multi-AZ region-level vs single-instance Monthly Uptime99.99% / 99.5%202612.4Amazon EC2 / Compute SLA
reference service-credit ladder rungs (% of monthly bill) as uptime falls through bands~10% / 25% / 100%202612.4Amazon EC2 / Compute SLA
Uptime Tier III vs Tier IV availability (~1.6 hr vs ~26 min downtime/yr); Uptime now disavows the %99.982% / 99.995%202512.4Uptime Institute Tier Standard
best-in-class H100 MTBF per 512 GPUs — the failure environment any cluster SLA is written against~7 days202512.4SemiAnalysis (100k H100 clusters)
Llama-3 405B interruption rate (16,384 H100, 54 days): 466 interruptions, 78% hardware~1 / 3 hr202412.4Meta Llama 3 Herd of Models
reliability overhead as a share of cluster TCO — the cost of closing the goodput gap6–21%202512.4SemiAnalysis ClusterMAX
failures per 1,000 node-days, Meta RSC-1 vs RSC-2 — the empirical &lambda; that drives any cluster goodput model6.50 vs 2.34202412.5Meta, Revisiting Reliability in Large-Scale ML Clusters (arXiv 2410.21680)
projected mean time between failures for a 16,384-GPU vs 131,072-GPU synchronous job1.8 hr → 14 min202412.5Meta (arXiv 2410.21680); SemiAnalysis
modeled ETTR (goodput) for a 16k-GPU run moving from 60-min to 5-min checkpoint interval0.70 → 0.93202412.5Meta, Revisiting Reliability (arXiv 2410.21680)
512+ GPU job failure rate after lemon-node ejection — a sensitivity result the model must reproduce14% → 4%202412.5Meta, Revisiting Reliability (arXiv 2410.21680)
IEC 61508 beta-factor range for common-cause failure; ~10% the default if no diversity measures applied0.5%–10%202512.5IEC 61508-6 Annex D; exida
annualized GPU failure rate feeding the per-node &lambda; in fleet roll-up models~9% AFR202612.5domain synthesis / Chapter 14.3 fleet data
Uptime Tier III / Tier IV availability targets (~1.6 hr vs ~26 min/yr) — the facility-model benchmark99.982% / 99.995%202512.5Uptime Institute (Tier classes; % figures Uptime-disavowed)
industry-average vs best-in-class training goodput — the validation band any goodput model must land in~90% / ~96%202512.5SemiAnalysis ClusterMAX / CoreWeave
the commissioning ladder: FAT → SAT → pre-functional → functional → Integrated Systems Test (IST)L1–L5202513.1Construct &amp; Commission; BMP MEP; CxPlanner
concurrent maintainability (any path serviceable, no load impact) vs fault tolerance (survive any single unplanned fault)Tier III vs IV202513.1Uptime Institute Tier Standard
Tier III (~1.6 hr/yr) vs Tier IV (~26 min/yr) availability; ~20–40% capital premium for IV99.982% / 99.995%202513.1Uptime Institute (% figures Uptime-disavowed)
ASHRAE commissioning-process / Basis-of-Design / data-center-specific Cx guidelines; Std 202 formalizes the Cx-ProcessGd 0 / 1.1 / 1.6202513.1ASHRAE; ACHR News
commissioning as a share of construction cost; CxAs now locked in 12–18 months ahead of energization0.5–2%202513.1CxPlanner; iRecruit / industry practice
lost-revenue cost of delaying commissioning a 60 MW facility — the schedule pressure that tempts truncating L5~$14.2M/mo202513.1Mastt / industry build-cost analyses
unplanned interruptions on a 16,384-GPU Llama 3 run (~1 every 3 hr); the day-2 reality a thin Cx program hands forward419 / 54 days202413.1Meta (Llama 3 paper) / Tom's Hardware
ANSI/BICSI 002-2024 — the most comprehensive lifecycle design+implementation standard; 2024 ed. expanded liquid/immersion~575 pp202413.1BICSI
of serious data-center outages involve human error — most trace to missing or unfollowed procedures (the case for the handover package)~70-80%202513.10Uptime Institute Global Data Center Survey / Outage Analysis
revenue per GW of AI capacity per year — the clock that pressures teams to override the readiness gate (contested — single-source)~$10-12B202513.10SemiAnalysis (onsite gas economics)
data-center load dropped in 82 s (VA, 2024); ~1,500 MW lost on a single fault — the swing go-live first exposes~1.5 GW202613.10NERC Level 3 Alert / Utility Dive
NERC Level 2 Recommendation on large loads (commissioning + ramp coordination); Project 2026-02 Computational Loads under waySept 2025202613.10NERC Large Loads Action Plan / Utility Dive
industry-average vs best-in-class goodput — the acceptance floor the full-load stage must clear~90% / ~96%202513.10SemiAnalysis ClusterMAX / CoreWeave
Tier III vs Tier IV availability — the redundancy that must hold at every point on the ramp, not just at the end99.982% / 99.995%202513.10Uptime Institute Tier Classification
per GB200/GB300 NVL72 rack — the heat flux and power transient the cooling/smoothing stack must absorb at full load120-142 kW202613.10SemiAnalysis / NVIDIA roadmap
MTBF per 512 GPUs at a mature operator — the failure cadence operations inherits the instant handover completes~7 days202513.10SemiAnalysis (100k H100 clusters)
commissioning as share of total project cost; prevents multiples in rework/downtime1–3%202513.2Industry Cx cost guidance (TrueLook / practitioner)
lead time operators now lock in commissioning agents ahead of energization12–18 mo202513.2iRecruit / DC construction-trend reporting
default fabric BER acceptance threshold per port (InfiniBand ibdiagnet)1e-12202513.2NVIDIA/Mellanox ibdiagnet manual
GPU node burn-in/soak duration gated before cluster acceptance72–168 hr202513.2Together AI / Introl validation guides
goodput acceptance bar: industry-avg vs best-in-class effective training time~90% / ~96%202513.2SemiAnalysis ClusterMAX / CoreWeave
CDU coolant inlet acceptance band; deviation can throttle GPUs up to ~50%20–25 °C202513.2NVIDIA OCP / Introl (GB200 NVL72)
Tier III vs Tier IV availability the redundancy-topology scripts must demonstrate99.982% / 99.995%202513.2Uptime Institute Tier classification
NVL72 heat split (liquid vs air) — the load a facility load bank cannot reproduce in the loop~115 / ~17 kW202513.2NVIDIA OCP / Introl
ANSI/NETA Acceptance Testing Specifications — the current as-installed bar for switchgear, breakers, relays and primary injectionATS-2025202513.3ANSI/NETA ATS-2025; NETA World Journal
data-center load lost on a single 230 kV fault — the synchronized ride-through failure NETA/Cx must now design against~1,500 MW202613.3NERC Level 3 Alert / Utility Dive
peak grid-demand reduction from GB300 NVL72 power-shelf energy storage (capacitor smoothing) — an L4 acceptance criterion now, not a spec sheet curiosityup to 30%202513.3NVIDIA Developer Blog; ServeTheHome
rack-level electrolytic-capacitance energy storage in GB300 NVL72 power shelves (≈half the PSU volume)65 J/GPU202513.3NVIDIA Developer Blog / LITEON
Vera Rubin NVL72 rack-level storage — ~6x GB300 — with closed-loop state-of-charge control for fast transient smoothing400 J/GPU2026 (roadmap)13.3NVIDIA Vera Rubin POD blog
power-oversubscription headroom: training vs inference — the swing magnitude electrical acceptance must absorb3% vs 21%202513.3Uptime Institute Journal
Rubin Ultra Kyber rack on 800 VDC — the density ramp the irreversible power substrate must accept~600 kW2027 (announced)13.3SemiAnalysis / NVIDIA roadmap; The Next Platform
lagging power factor a reactive load bank loads the chain to — proving generator/UPS at kVA rating, not just kW0.8 PF202513.3Aggreko / CxPlanner commissioning practice
behind-the-meter gas announced by 2026 (~7 GW under construction) — the scale of the islanding problem~82 GW202613.4Cleanview / SemiAnalysis
LM2500XPRESS aeroderivative unit rating and start time; black-start-capable, grid-independent35 MW / 5 min202513.4GE Vernova / Crusoe (29-unit order)
aeroderivative gas-turbine lead time (refurb under 12 mo); the speed-to-power constraint behind islanding18–36 mo+202513.4Data Center Frontier / Grid Capacity Intelligence
Vera Rubin rack-level energy storage for power smoothing (~6x prior gen); cuts peak current ~25%~400 J / GPU202513.4NVIDIA developer blog
data-center load lost on a single 230 kV fault; 1.5 GW dropped in 82 s (VA, 2024) — triggered NERC Level 3 alert~1,500 MW202613.4NERC Level 3 Alert / Utility Dive
microgrid-controller specification (2017) and conformance-test method (2018) — the Cx acceptance basisIEEE 2030.7 / 2030.82017–201813.4IEEE Standards
best-in-class cluster MTBF; a single power transient that drops a synchronous job restarts from checkpoint~7 days / 512 GPUs202513.4SemiAnalysis (100k H100 clusters)
GB200 NVL72 coolant inlet spec; deviation can throttle GPUs up to ~50%20–25 °C202513.5NVIDIA OCP / Introl
DLC flow per GB200 NVL72 rack (~1.2–2.0 L/min per kW design rule)~80 L/min202513.5Dober / NVIDIA OCP
NVL72 CDU/row-level cooling capacity (per-rack heat is ~132 kW: ~115 kW liquid + ~17 kW air)~2.4 MW202513.5NVIDIA OCP / Introl
secondary-loop conductivity floor flushed to before coolant charge (DI ≥0.5 MΩ·cm)≤5 µS/cm202613.5Liquid-cooling commissioning practice (XD Thermal / Introl synthesis)
rated working pressure for hydrostatic acceptance hold (ASME B31.x / EN 13480 basis)1.5×202513.5Liquid-cooling commissioning practice; ASME B31
install + commissioning per GB200 NVL72 system; load staged 25→50→75→100%2–3 weeks202613.5Introl GB200 NVL72 deployment
single-phase direct-to-chip share of the liquid-cooling market (the loop you are commissioning)~55%202613.5DCD / IDTechEx
best-in-class training goodput the loop must protect; a cooling trip is lost goodput~96%202513.5SemiAnalysis ClusterMAX / CoreWeave
GB300 NVL72 in-shelf energy storage for power smoothing; ~30% peak-grid reduction on Megatron training65 J/GPU202513.6NVIDIA Developer (GB300 steady power)
Vera Rubin power-smoothing reservoir target; facility BESS roles for transient/ride-through/DR~400 J/GPU202513.6NVIDIA (production-ready BESS for AI factories)
single-event large-load loss on a 230 kV fault; 1.5 GW dropped in 82 s (VA, 2024) — the ride-through problem IST must prove against~1,500 MW202613.6NERC Level 3 Alert / Utility Dive
GB200/GB300 NVL72 coolant inlet window; deviation throttles GPUs up to ~50% — the thermal ride-through envelope20-25 &deg;C202513.6NVIDIA OCP / Introl
power-oversubscription headroom training vs inference — why transient behavior differs by workload IST cannot run3% vs 21%202513.6Uptime Institute Journal
single-phase direct-to-chip share of liquid-cooling market — the loop IST load banks cannot exercise at real heat flux~55%202613.6DCD / IDTechEx
typical IST planning horizon before a full-facility Level 5 campaignweeks-to-months202513.6Construct & Commission (L5 IST guide)
post-FEC BER pass floor for AI fabric links (tightening toward 1e-13 at the highest lane rates)~1e-12202513.7IEEE 802.3 / IBTA link specifications; practitioner acceptance plans
PAM4 SerDes per-lane rate driving 800G/1.6T links — FEC-mandatory, BER-screening-critical100-200 Gb/s202513.7SemiAnalysis (AI networks); provenance.js optics ladder
minimum link-flap soak under line-rate load at operating temperature before a link is accepted≥ 24 h202513.7Practitioner fabric-commissioning practice; Keysight test methodology
InfiniBand point-to-point latency; tuned RoCEv2 ~1.5-2.5 us — the acceptance band for ib_*_lat~1-2 us202513.7SemiAnalysis / NVIDIA; provenance.js IB-vs-RoCE
PTP accuracy held across a Spectrum switch; ConnectX-class NIC timestamping under ~4 ns variance~10 ns202513.7NVIDIA Technical Blog, Spectrum switch time-sync
fleet PTP offset-from-master target the time-sync gate must demonstrate, under load and across every nodesub-us202413.7Engineering at Meta (SPTP); IEEE 1588 practice
effective throughput a well-tuned AI Ethernet fabric (Spectrum-X) sustains — the congestion-gate target~95%202513.7NVIDIA (Spectrum-X xAI Colossus)
NVLink aggregate per GB200 NVL72 rack the scale-up gate verifies whole (1.8 TB/s/GPU, NVLink 5)~130 TB/s202513.7NVIDIA; provenance.js NVLink
GPU node burn-in / soak window (3-day minimum to 7-day strict acceptance)72–168 hr202513.8Together AI seven-phase guide; Introl validation frameworks; ClusterMAX 2.0
bring-up burn-in period before a new cluster's failure rate decays toward the mature baseline3–4 weeks202513.8SemiAnalysis (100k H100 clusters)
mature best-in-class H100 MTBF; freshly-racked clusters fail far more often~7 days / 512 GPUs202513.8SemiAnalysis (100k H100 clusters)
unplanned interruptions on 16,384 H100s during Llama 3 405B — ~1 every 3 hr419 in 54 days202413.8Meta (Llama 3 paper) / Tom's Hardware
Llama 3 interruptions attributed to faulty GPU and to HBM3 — together &gt;½ of hardware faults~30% + ~17%202413.8Meta (Llama 3 paper) / DataCenterDynamics
machines affected by silent data corruption (SDC) at fleet scale~1 in 1,000202513.8Meta Engineering (How Meta keeps its AI hardware reliable)
expected SDC events during a large-scale training run (Meta; Google reports similar for Gemini)every 1–2 weeks2025–202613.8Meta Engineering; IEEE / arXiv SDC studies
DCGM -r 4 (deep, incl. memtest + EUD) runtime per node, GPU-count dependent~1.5 hr202613.8NVIDIA DCGM Diagnostics documentation
in-domain all-reduce busbw on GB200 NVL72 (vs 900 GB/s/GPU NVLink5 ceiling); scale-out gate set as % of this870-928 GB/s202513.9NCCL tests on GB200 NVL72 (Crusoe / Nebius / NVIDIA tuning guide)
checkpoint state to size the write path; keep checkpoint stall <10% of step time~14 bytes/param202513.9VAST Data checkpoint survey (85k+ checkpoints)
failure cadence in a 100k-accelerator cluster at full utilization — why checkpoint bandwidth is an acceptance gate~every 30 min202513.9MLCommons MLPerf Storage v2.0
aggregate storage bandwidth per ~1,024 GPUs (write ≥ ½ read design rule)250-400 GB/s202513.9NVIDIA DGX SuperPOD reference architecture
industry-average vs best-in-class measured training goodput — the number the SLA is set against~90% / ~96%202513.9SemiAnalysis ClusterMAX 2.0 / CoreWeave

Refine your search or pick a Part to narrow the 1,420 matches.