Guide › Day-2 Operations, Upgrades & Lifecycle › 14.1

Chapter 14.1

Operational KPIs, Goodput & the Reliability Economics of AI Factories

An AI factory does not earn money when it is 'up' — it earns money when accelerators are doing useful work on the critical path, so the number that governs day-2 economics is not facility availability but goodput, and every reliability dollar must be justified against the goodput it buys, not the nines it adds.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Which top-line operating metric you actually manage the fleet against — facility availability (the data-center-industry default), or ML goodput / ETTR (the metric the workload's economics obey) — because that single choice re-weights every reliability investment downstream.
What goodput target you contract and design for — and therefore how much you spend on fast checkpointing, hot spares, health-checking, and silent-corruption detection to close the gap between the ~90% industry-average and the ~96% best-in-class effective training time.
How you set the blast-radius policy: how large a synchronous failure domain you tolerate before splitting it (NVL36x2-style), given that one failed GPU in a tightly-coupled job idles the whole job and one bad node can cascade into preemptions across the cluster.
Which reliability spend is goodput-accretive (recovery speed, lemon-node ejection, SDC scanning) versus availability-accretive (2N power, Tier-IV cooling) — and, for a checkpointable training fleet, whether you are buying nines the workload does not value.
Which operations scorecard the board and the customer see — the KPI set, its denominators, and the SLA definitions — because an un-named denominator (uptime of what, measured how) is where day-2 disputes and stranded-asset surprises hide.

Useful output (goodput) peaks below raw throughput — running at maximum utilization destroys the work you are paid for.

Parts 1 through 13 of this guide build the machine. This part operates it — and operating it is where most of the lifetime money is won or lost, because the asset depreciates whether or not it is producing. The reflex inherited from twenty years of enterprise data-center operations is to manage the facility against availability: nines of uptime, hours of downtime per year, Tier classes. That reflex is not wrong so much as it measures the wrong thing. An AI factory is a depreciating compute asset whose return is set by how many accelerator-hours land on the revenue-bearing critical path. A hall can be 99.995% 'available' and still throw away a quarter of its compute to stragglers, restarts, silent corruption, and idle hot spares — and the income statement will not care that the UPS never dropped.

This chapter is the framework for Part 14: it reframes the operating objective from availability to goodput, defines the goodput stack (ETTR, ML goodput, MFU/MBU) at engineering depth, characterizes the AI failure environment that makes day-2 different from traditional IT, sets out the blast-radius problem that turns one component fault into a cluster-wide stall, and assembles the operations scorecard that the rest of Part 14 instruments. The deep treatment of why goodput beats availability as a design target lives in Chapter 12.2; this chapter is the operational counterpart — how you measure it, what it costs to move it, and what the scorecard looks like once the building is live.

The master operating fork: availability vs goodput

The first day-2 decision is which number sits at the top of your operations dashboard, because it silently re-weights every reliability investment below it. Availability asks: what fraction of wallclock time is the facility energized and the IT reachable? It is a facility-centric, binary-per-component view inherited from the Uptime Institute Tier model — Tier III at ~99.982% (about 1.6 hours of downtime per year), Tier IV at ~99.995% (about 26 minutes per year). Goodput asks a different and harder question: what fraction of the accelerator-hours you paid for actually advanced a job on the critical path? The two diverge sharply for AI, and that divergence is what Part 14 instruments.

Consider a synchronous pre-training run on 16,384 GPUs. The facility never loses power; by the availability definition it is at five-nines. But a single GPU fails roughly every three hours at that scale (Meta's Llama 3 405B snapshot), and each failure idles all 16,384 accelerators until the job restarts from its last checkpoint. The 'available' facility is hemorrhaging goodput through a mechanism the availability metric cannot see, because the failure domain is the job, not the component. This is why the AI-cluster reliability rethink (Chapter 12.2) argues that for checkpointable training, spending on 2N facility power to chase nines is often capital misallocated — the same dollars buy far more goodput as faster checkpointing, hot spares, and recovery automation. The choice determines whether your next reliability dollar goes into a redundant feeder or into a flash-checkpoint tier.

Pick the denominator before you write the SLA

Pick the denominator you manage against before you write the SLA or size the redundancy. For a training-shaped fleet, manage against ML goodput / ETTR: the workload already tolerates restart, so availability nines past N+1 buy little, and the marginal reliability dollar belongs in recovery speed (checkpoint frequency, restart time, lemon-node ejection, SDC scanning). For an inference-shaped fleet, availability and goodput converge — an outage is lost revenue and a breached latency SLO in real time — so 2N / Tier-IV-class power plus N+1 cooling on standby is justified, and goodput is measured as served-token throughput against the SLO rather than effective training time. Naming the wrong denominator does not just mis-report performance; it routes your entire reliability budget to the wrong subsystem for two-to-five years. → quantitative reframing in Chapter 12.2; redundancy economics in Chapter 12.5.

The goodput stack: ETTR, ML goodput, MFU/MBU

'Goodput' is not one number; it is a stack of nested ratios, each measuring a different kind of waste, and conflating them is the most common source of dishonest operating reporting. Read from the outside in, each layer multiplies the one below it, and the product is what the dollars actually buy.

ETTR (Effective Training Time Ratio) is the outermost layer and the cleanest measure of cluster reliability: the ratio of productive runtime to the available wallclock time of a job run, ranging from 0 to 1, accounting for queueing delay, restart overhead, and re-computation of lost progress (Meta, Revisiting Reliability in Large-Scale ML Research Clusters, 2024–25). ETTR is model-agnostic — it does not care what fraction of a GPU's FLOPs a kernel extracts — which is exactly why it is the right SLA target between an operator and a training customer. Meta's largest jobs (>1024 GPUs) sustained average ETTR above 0.9 with one-hour checkpoint intervals on shared, congested clusters; pushing a hypothetical 100k-GPU run to 0.9 ETTR requires driving checkpoint and restart overhead down to roughly two minutes each.

ML goodput (Google's formulation) is the aggregate productive work completed per unit time across the cluster, with all non-productive time accounted as 'badput': program startup, data-loading stalls, checkpoint writes that do not overlap compute, failed steps, wasted progress since the last checkpoint, and scheduling gaps. Where ETTR is per-job, ML goodput is the fleet-level rollup, and it is the number the operations scorecard reports. The industry average sits near 90%; best-in-class operators market ~96% (SemiAnalysis ClusterMAX / CoreWeave). That six-point gap is not rounding — at a 1 GW factory it is the difference between roughly 90 MW and 40 MW of accelerator power doing nothing useful.

MFU (Model FLOPs Utilization) and MBU (Model Bandwidth Utilization) are the innermost layer — they measure efficiency while a job is running, independent of failures. MFU is the fraction of the hardware's peak FLOPs the model actually realizes (compute-bound training; commonly 30–50%, >50% considered strong on Hopper). MBU is the analogous ratio for memory bandwidth (the binding constraint on autoregressive inference decode). A job can have a 0.95 ETTR and a 35% MFU — perfectly reliable, half-idle on the math — which is why the three metrics must be reported together. Realized output = ETTR x MFU/MBU x peak, and an operator who quotes only one of the three is hiding the other two. → metric definitions in Chapter 0.3.

The goodput stack — what each metric measures and what moves it

Metric	Question it answers	Failure domain	Typical range (2026)	Primary lever to improve it
ETTR	What fraction of a job's wallclock was productive?	The job (per-run)	0.9 large jobs; >0.9 best-in-class	Faster recovery: checkpoint cadence + restart time
ML goodput	What fraction of fleet GPU-time was useful work?	The cluster (rollup)	~90% avg; ~96% best-in-class	Cut badput: data stalls, async checkpoint, scheduling
MFU	How much of peak FLOPs does a running job realize?	The kernel/parallelism plan	30–50%; >50% strong (Hopper)	Parallelism strategy, kernel/comms overlap, fabric
MBU	How much of peak memory bandwidth does decode realize?	The kernel (inference)	Workload-dependent; decode-bound	Batching, KV-cache layout, quantization, scale-up size
Availability	What fraction of wallclock was the facility 'up'?	The component (binary)	99.982% (T-III) – 99.995% (T-IV)	Redundancy topology, concurrent maintainability

Nested ratios; realized useful output is approximately the product across the stack. Figures are 2026-current; see keynumbers for sources and vintages.

The table is a hierarchy of denominators, and the operational sin is comparing two facilities on different rows. A neocloud quoting '99.99% uptime' is on the bottom row; a customer who actually loses 20% of their training run to stragglers and restarts is living on the top row. The contractual maturity of 2026 is that goodput, not uptime, is becoming the acceptance and SLA basis — operators gate production handoff on sustained NCCL/collective performance and low error rates over multi-day soaks, not power-on uptime, because each failed GPU can cost 15–20% performance and a mid-run failure destroys training economics. The industry is converging, slowly, on a contractual definition of 'goodput acceptance' (sustained collective bandwidth, an error budget, a soak window) that buyers can hold integrators to. → acceptance and IST in Chapter 13.6; goodput-vs-availability theory in Chapter 12.2.

The failure environment: why day-2 AI is not day-2 IT

Traditional enterprise IT operates a fleet of loosely-coupled, independently-failing servers: one box dies, a load balancer routes around it, and the blast radius is one request. The AI factory inverts every assumption behind that model, and the day-2 reliability program has to be rebuilt from the inverted premises.

The components fail far more often. Meta's research clusters log 6.50 failures per thousand node-days (RSC-1) and 2.34 per thousand node-days (RSC-2) — roughly 5x10⁻³ failures per GPU node-day. That sounds small until you multiply by scale: an 8-GPU job has a mean time to failure of ~47.7 days, but the observed cluster MTTF drops roughly in proportion to GPU count — to single-digit hours around 1,024 GPUs (~7.9 hr), to a couple of hours at 16,384 GPUs (~1.8 hr, Meta-observed and inflated by software-induced failures), and to minutes at the ~131,072-GPU scale. The intermediate figures are empirical, not a strict arithmetic chain, so they do not divide cleanly from the single-node base rate. The takeaway holds regardless: failure rate rises with GPU count, so the largest jobs spend a meaningful fraction of their life recovering rather than computing. This is the arithmetic that makes recovery speed, not component MTBF, the dominant goodput lever at frontier scale. → fleet failure-rate data in Chapter 14.3; operational recovery in Chapter 14.4.

The failures are not all loud. The AI fleet has three failure classes, and the dangerous one is invisible. Hard failures (a GPU falls off the bus, a link drops) announce themselves and trigger a restart. Transient failures (a correctable ECC storm, a thermal throttle) degrade goodput without stopping the job. Silent data corruption (SDC) is the third and worst: a marginal device computes a wrong result with no error flag, quietly poisoning gradients or activations. Meta finds roughly 1 in 1,000 machines affected by SDC, and for large training runs an SDC event is expected every one to two weeks; Google estimates an SDC event every week or two during Gemini training. Soft-error rates have worsened with process shrinks — from roughly one failure per year at 65 nm to one per ~1.5 hours at 16 nm — so this is structurally getting harder, not easier. Detection is now a standing fleet program (Meta's Fleetscanner and Ripple run ~2.5 billion test seeds per month). → SDC mechanisms and detection in Chapter 14.3.

Silent corruption is the failure mode that breaks the availability worldview

An SDC event does not register on a single availability or uptime metric — the machine is 'up', the job is 'running', the dashboard is green — yet it can corrupt a multi-week training run that only reveals itself as a mysterious loss spike days later, forcing a rollback to a checkpoint taken before the corruption began. The cost is not the failed step; it is every step of progress between the corruption and its detection, sometimes days of compute across tens of thousands of GPUs. The most expensive failures the building produces are precisely the ones availability cannot see. A day-2 program that lacks a standing SDC-scanning and checkpoint-validation discipline is structurally blind to its largest goodput risk. → detection programs in Chapter 14.3; checkpoint validation in Chapter 14.4.

The blast-radius problem

The defining structural feature of the AI failure environment is that the failure domain is not the failed component. In a tightly-coupled synchronous job, one GPU stalling on an all-reduce stalls the entire collective, and the whole job moves at the speed of its slowest straggler. The scale-up fabric amplifies this further: a failed NVSwitch tray degrades bandwidth for all 72 GPUs in an NVL72 domain, and a tensor-parallel group of 64 GPUs with 0.1% of GPUs failed drops to roughly 94% effective availability for the group. The blast radius of a single fault is the size of the coupling domain you chose at design time, so domain sizing is a reliability decision as much as a performance one.

The cascade goes wider than the job. Meta found that 16% of total failure-related goodput loss came from secondary preemptions — small jobs getting evicted to free resources for a large job's restart. One large-job failure ripples into idle time across unrelated workloads sharing the cluster. This is the day-2 reason operators agonize over the NVL72-vs-NVL36x2 fork (one 120 kW domain vs two 66 kW domains): the larger domain lifts the tensor-/expert-parallel ceiling and the achievable MFU, but it doubles the blast radius of a single tray or cold-plate fault. There is no free choice here — you are trading peak efficiency against failure containment, and the right answer depends on whether you manage against MFU or against goodput. → scale-up domain sizing in Chapter 8.5; lemon-node ejection in Chapter 14.4.

Blast-radius policy — the domain-sizing fork

Policy	Failure domain	Goodput on a single fault	Peak MFU ceiling	Best fit
One large domain (NVL72)	72 GPUs / one rack	Whole 72-GPU domain idles until recovery	Highest (largest TP/EP degree)	Frontier dense / wide-MoE training; manage on MFU
Split domains (NVL36x2)	36 GPUs / half-rack	Half the domain contained; other half progresses	Lower (TP/EP capped at 36)	Goodput-managed fleets; reliability-sensitive runs
Elastic / redundant	Re-routed around the fault	Degraded throughput, no full stall	Variable (nonuniform parallelism)	Largest runs where any full stall is unaffordable

The same physical hardware, two containment policies. Figures are 2026-current reference points for NVL72-class GB200/GB300 hardware.

~90% / ~96%

ML goodput (effective training time): industry average vs best-in-class

2025SemiAnalysis ClusterMAX / CoreWeave

6.14 / 10.53 / 20.91%

goodput loss: gold-tier neocloud vs hyperscaler vs silver-tier provider

2026SemiAnalysis ClusterMAX 2.0/2.1

6–21%

reliability overhead as a share of cluster TCO

2025SemiAnalysis; domain synthesis

~1.8 hr

mean time to failure for a 16,384-GPU synchronous job (~7.9 hr at 1,024 GPUs)

2024Meta, Revisiting Reliability in Large-Scale ML Research Clusters

419 / 54 days

unplanned interruptions on 16,384 H100s training Llama 3 405B (~1 every 3 hr); 78% hardware-caused

2024Meta (Llama 3 paper) / Tom's Hardware

~1 in 1,000

machines affected by silent data corruption; SDC event every 1–2 weeks on large training runs

2025Meta Engineering; Google (Gemini)

1.2–1.7%

of fleet flagged as 'lemon' nodes; ejection cut large-job failure rate 14% to 4% (+30% completion)

2024Meta, Revisiting Reliability

99.982% / 99.995%

availability: Uptime Tier III (~1.6 hr/yr down) vs Tier IV (~26 min/yr)

2025Uptime Institute

The reliability economics: what a goodput point is worth

Day-2 reliability is not a cost to minimize; it is an investment to optimize against a measurable return, and the return is denominated in goodput. Reframe the spend with the canonical economic anchor: a 1 GW AI factory carries roughly $8.5B/yr in all-in TCO (Epoch AI), and at the short, frontier-economic depreciation life the figure runs higher still. If you are throwing away the difference between 96% and 90% goodput, you are wasting roughly six percentage points of an $8.5B/yr asset — on the order of $500M/yr of compute producing nothing. Against that denominator, the reliability program — fast checkpointing, hot spares, health-checking, SDC scanning, lemon-node ejection — is one of the highest-return investments in the entire building, and the industry treats reliability overhead as a deliberate 6–21% slice of TCO precisely because the alternative is more expensive.

The economics also explain the hidden-tax structure that ClusterMAX exposed: holding GPU sticker price constant, a gold-tier neocloud's total cost runs lower than a silver-tier provider's by 5–15% on large training workloads, and a hyperscaler's 36-month total swelled to 1.10x a gold-tier neocloud's — a 10% hidden tax — entirely through goodput loss (10.53% vs 6.14%) that never appears on the rate card. The cheapest GPU-hour on paper can cost the most once delivered, because reliability is priced in goodput, not in the quoted $/GPU-hr. For fault-tolerant workloads (single-node inference), the gap collapses toward zero — which is exactly why the metric fork at the top of this chapter is not optional: the value of a goodput point is workload-dependent, and so is the reliability spend that is rational to chase it. → unit economics in Chapter 1.8.

Deep dive: deriving ETTR and why two-minute recovery is the frontier target

ETTR makes the recovery-speed economics legible. Decompose a long run into productive time Tₙ, and lost time from each interruption: detection latency, restart/re-init overhead R, and re-computation of progress lost since the last checkpoint (on average half the checkpoint interval, τ/2, under the Young/Daly model). With failures arriving at rate λ proportional to GPU count, the expected lost time per failure is roughly R + τ/2, and ETTR ≈ Tₙ / (Tₙ + λ·(R + τ/2)·wallclock). Two levers move it: shrink τ (checkpoint more often, which costs write bandwidth and badput unless the write overlaps compute) and shrink R (faster restart, which costs hot spares and orchestration). The optimal checkpoint interval is the Young/Daly balance between checkpoint-write overhead and expected re-computation — canonical math in Chapter 9.4; here we care about its operational consequence.

The consequence is stark at scale. Meta's largest jobs hold ETTR above 0.9 with hour-long checkpoint intervals because λ is still manageable. But push to a hypothetical 100k-GPU run and λ rises proportionally; holding 0.9 ETTR then demands driving both R and τ down to roughly two minutes each — which is why multi-tier and asynchronous checkpointing (flash-local plus remote, async drain overlapping compute) and elastic restart are no longer optional at frontier scale. The operational target 'recover in two minutes' is not an arbitrary SLA; it falls directly out of the ETTR arithmetic once you fix the failure rate and the goodput floor. → operational checkpoint tuning in Chapter 14.4.

The operations scorecard

An operations program is only as honest as its scorecard, and the scorecard is only as honest as its denominators. The day-2 KPI set spans three layers — the workload, the fleet, and the facility — and the discipline is to report all three with named denominators rather than collapsing them into a single flattering 'uptime' figure. The workload layer is goodput and ETTR (and MFU/MBU underneath); the fleet layer is failure rate per node-day, MTTR decomposition, lemon-node rate, and SDC detection rate; the facility layer is availability, PUE/WUE, and power/thermal headroom. Each layer has a different audience: the customer cares about goodput, the reliability engineer cares about the fleet layer, and the facility team and lenders care about the bottom layer.

The recurring day-2 failure is a scorecard that reports the bottom layer (because it is the one traditional DCIM measures well) and is silent on the top layer (because it requires IT/facility telemetry correlation the legacy stack never had to do). A facility can show a perfect availability scorecard while its customers are quietly losing a fifth of their compute to badput the operator never instrumented. Closing that gap — wiring the facility telemetry to the workload telemetry so that a thermal event, a power transient, and a training stall can be correlated to a single root cause — is the central task of the observability chapter that follows. → telemetry and IT/facility correlation in Chapter 14.2.

The day-2 operations scorecard — three layers, named denominators

Layer	Primary KPIs	Denominator	Who owns it	Where it is instrumented
Workload	ML goodput, ETTR, MFU/MBU	Accelerator-hours on the critical path	Reliability eng + customer	Job telemetry, NCCL/collective monitors
Fleet	Failures/node-day, MTTR, lemon-node %, SDC rate	Node-days / GPU-days in service	SRE / fleet reliability	Health-checks, Fleetscanner-class scanners
Facility	Availability, PUE, WUE, power/thermal headroom	Wallclock; total facility energy	Facilities / DCIM	DCIM, BMS, branch-circuit + CDU telemetry

The KPI set Part 14 instruments. The 'who owns it' column is the accountability split that prevents a metric from falling between teams.

Deep dive: lemon nodes — the highest-ROI day-2 reliability intervention

A 'lemon' node is a machine that passes power-on and basic health checks but fails repeatedly under real workload — a marginal NVLink connector, a cold plate with intermittent flow, a GPU that throttles under sustained load. Because it boots and pings, the availability metric counts it as 'up'; because it fails under load, it silently caps the goodput of every job unlucky enough to land on it. Meta's reliability work identified that just 1.2–1.7% of the fleet were lemon nodes, but ejecting them with >85% detection accuracy cut the large-job failure rate from 14% to 4% and improved large-job completion by ~30%.

The reason this is the highest-ROI intervention in day-2 operations is leverage: a tiny, identifiable fraction of the fleet causes a disproportionate share of goodput loss, and removing it requires no capex — only the telemetry to distinguish a load-failing node from a healthy one, and the orchestration to drain and eject it automatically. The dollars went not into more nines but into finding and removing the nodes that quietly destroy goodput, and the return dwarfed any redundancy upgrade. Automated lemon-node ejection, fault isolation, and remediation are detailed as an operational discipline in Chapter 14.4; the underlying failure taxonomy and detection programs in Chapter 14.3.

Anti-patterns

The same day-2 mistakes recur, each one a consequence of reaching for the facility-availability worldview when the workload obeys goodput economics. Three are worth naming:

Buying nines a checkpointable job does not value. Commissioning 2N / Tier-IV power for a synchronous training fleet that already tolerates checkpoint-and-resume. The capital would return far more as goodput — faster checkpointing, hot spares, more GPUs — than as facility availability the workload cannot monetize. → Chapter 12.2.
Reporting availability and calling it goodput. A green uptime dashboard over a fleet quietly losing 15–20% of its compute to stragglers, restarts, and badput. The denominator is wallclock-of-the-facility, not accelerator-hours-on-the-critical-path, and the gap is invisible until a customer measures their own ETTR and disputes the SLA.
No standing SDC program. Treating silent corruption as a rare anomaly rather than a continuous fleet condition (~1 in 1,000 machines, an event every 1–2 weeks at scale). Without continuous scanning and checkpoint validation, multi-day rollbacks of poisoned runs are the ones the operator is structurally blind to.

This chapter sets the operating frame for all of Part 14. The theory of why goodput beats availability as a design target is in Chapter 12.2, with the quantitative availability/redundancy modeling in Chapter 12.5. The telemetry and IT/facility correlation that make the scorecard observable are in Chapter 14.2; the failure taxonomy, fleet failure-rate data, and SDC detection programs in Chapter 14.3; operational checkpoint/restart tuning, fault isolation, and lemon-node ejection in Chapter 14.4. The checkpoint-interval math behind ETTR is canonical in Chapter 9.4; the scale-up domain-sizing fork behind blast radius in Chapter 8.5; fleet-wide fault tolerance and autonomous recovery in Chapter 10.7; goodput-based acceptance testing in Chapter 13.6; the unit economics that price a goodput point in Chapter 1.8; and the metric definitions in Chapter 0.3.