Chapter 12.3
Disaster Recovery, Business Continuity & Geographic Failover
Disaster recovery for an AI factory is not a backup policy bolted onto a building — it is a per-workload decision about how much spare capacity you pre-pay for, in which geography, kept how warm, against a token-priced revenue clock; and because GPU capacity is power-bound and cannot be conjured on demand, the spare region is the most expensive idle asset on the balance sheet and the most consequential one to size wrong.
What you'll decide here
- Which RTO/RPO tier each workload class actually requires — interactive inference (minutes) versus training (hours, checkpoint-bounded) versus batch (best-effort) — because under-specifying strands an SLA and over-specifying pre-pays for a second factory you rarely use.
- Active-active across regions versus active-passive with warm/cold standby — the fork that sets whether you carry one fleet, two, or 1.x, and therefore the single largest line in the continuity budget.
- How much failover capacity you reserve and where it physically sits — because in a power-bound market you cannot rent your way out of a regional outage on the day it happens; the spare megawatts must already be energized and the GPUs already racked.
- Which single dependencies — a region's control plane, a DNS anchor, a model-weight store, a key engineer, a single CDU vendor — can take the whole estate down at once, and which of them you are willing to leave un-hedged.
- What you have actually contracted to tenants and customers (the DR commitment in the MSA) versus what your topology can deliver under a real correlated failure — and whether your drill cadence proves the gap is closed.
Most of this Part has been about keeping a single facility running: redundancy topologies (Chapter 12.1), the goodput-versus-availability rethink (Chapter 12.2), the thermal and electrical paths that fail inside the fence. This chapter is about the failure that the fence cannot stop — the one that takes the whole site, the whole region, or the whole control plane at once. A transformer fire, a substation lost to weather, a fiber cut that isolates a campus, a cloud region's metadata anchor going dark, a wildfire-evacuation order, a ransomware event in the orchestration layer. When that happens, the question is no longer "is my UPS healthy?" It is: where does the work go, how fast, and did I pay to have somewhere for it to go?
This chapter works through the continuity decisions geography forces. It defines RTO and RPO per workload class and prices each continuity tier; draws the master fork — active-active versus active-passive across regions — and traces what each does to fleet count and cost; confronts the 2026 reality that failover capacity is power-bound and must be energized in advance, not rented on the day; catalogues the single dependencies (control plane, DNS, weights store, key personnel, single-vendor loops) that turn a local fault into an estate-wide one; maps the contractual DR obligations you owe tenants against what the topology actually delivers; and closes on drills, runbooks, and the FMEA tie-in that prove the plan works before you need it.
RTO, RPO, and why AI workloads split the table
Two numbers govern every continuity decision. RTO (Recovery Time Objective) is how long you can be down before recovery — the wall-clock from failure to service restored. RPO (Recovery Point Objective) is how much work you can afford to lose — the gap back to the last durable state. They are independent: you can want fast recovery (low RTO) while tolerating some lost work (looser RPO), or the reverse. The cost of each tier rises steeply and non-linearly as either target approaches zero, which is why the first act of continuity engineering is to refuse a single estate-wide target and instead tier the workloads.
AI workloads split the RTO/RPO table more sharply than any traditional enterprise estate, because the archetypes have opposite tolerances (the same split that drove the redundancy logic in Chapter 1.2 and Chapter 1.3). Interactive inference is the revenue surface and the tightest tier: a user is waiting, an SLA is running, and a regional loss that takes minutes to absorb is a visible outage. Practitioner targets cluster around a ~15-minute RTO and a ~5-minute RPO for production inference (Introl DR analysis, 2025). Training is the opposite: the job already checkpoints, so its RPO is simply the checkpoint interval (commonly 2-4 hours of replicated state), and its RTO is bounded by how fast you can re-acquire GPUs and resume from the last checkpoint — hours, not minutes, are tolerable. Batch inference is best-effort: it queues, it retries, it can wait for the primary to come back. Spending inference-grade DR on a batch pipeline is the same anti-pattern as commissioning 2N power for a checkpointable training job — buying a tier the workload does not value.
| Workload class | RTO target | RPO target | Failover mechanism | Cost vs single-region |
|---|---|---|---|---|
| Interactive inference | ~15 min | ~5 min | Active-active or hot warm standby; global load-balancer drain | ~1.7-2x (carry a second serving fleet) |
| Training (frontier) | Hours | Checkpoint interval (2-4 hr) | Resume from replicated checkpoint on re-acquired GPUs | +5-20% (cross-region checkpoint replication) |
| Model / weights registry | ~1 hr | Near-zero (versioned, replicated) | Multi-region object replication; immutable versioning | Storage egress + duplicate-store cost |
| Batch inference | Hours-best-effort | Re-runnable (idempotent) | Re-queue against any region with capacity | Minimal — opportunistic spare |
| Control / orchestration plane | Minutes | Near-zero | Multi-region quorum; no single-region anchor | Engineering cost > hardware cost |
Read the cost column as the thing you are actually buying. Zero RPO is not free even when it sounds like good hygiene: forcing a training job to a zero-RPO posture (synchronous cross-region state) imposes a ~15-20% throughput penalty on the run (Introl, 2025) — you are paying in goodput for a continuity tier the checkpoint already provides for nearly free. The honest move is to set RPO equal to the checkpoint cadence and spend the saved bandwidth on more GPUs. The checkpoint math that makes this defensible — interval selection, multi-tier checkpointing, sub-2-minute restart — is canonical in Chapter 9.4; here it is simply the input that sets training's RPO floor.
The spare-region problem: continuity is power-bound
What makes AI-factory DR different from every continuity playbook written before 2024 is that failover capacity can no longer be conjured on demand. In a traditional cloud estate, failover capacity is fungible and on-demand: a region fails, you spin up instances elsewhere, you pay the burst rate, you move on. In a power-bound AI estate, that escape hatch is closed. You cannot rent 200 MW of GPUs in a neighbouring region on the morning your primary burns, because that capacity does not exist as slack — every megawatt is contended, interconnection queues run years (the queue framing from Chapter 3.1), and a high-demand part like a current-generation accelerator is allocation-gated, not catalogue-stocked. The failover capacity must already be energized and the GPUs already racked before the disaster, or it is not failover — it is a wish.
That collapses the comforting cloud-era distinction between "reserve capacity" and "pay for it later." For the inference surface, the spare region is a real second factory, pre-paid, drawing real power, depreciating on the same 2-3 year economic clock as the primary (the depreciation reality of Chapter 1.3 and the economics chapter it points to). The continuity decision is therefore a capital-allocation decision disguised as an availability one: how many megawatts of idle-until-needed capacity will you underwrite, and can you make them earn while they wait?
The most important mitigation is to stop treating the spare as idle. Reverse-arbitrage the standby: run interruptible, RPO-loose work — batch inference, evaluation sweeps, synthetic-data generation, low-priority fine-tunes — on the failover fleet during normal operation, and pre-empt it instantly when the primary fails. This is the continuity analogue of the curtailable-load fast lane in power procurement: the spare region pays part of its own carry by doing displaceable work, and the failover event becomes a scheduler pre-emption rather than a cold start. The design constraint is that the pre-emptible workload must drain fast enough to hit the inference RTO — which is an orchestration property, not a hardware one.
| Posture | Standby state | Realistic RTO | Steady cost vs single-region | Energized in advance? |
|---|---|---|---|---|
| Active-active | Live, serving, sized to absorb peer load | Seconds (LB drain) | ~2x | Yes — both regions full |
| Hot / warm standby | Running, draining-ready, scaled-down | Minutes | ~1.4x (60% less than active-active) | Yes — must be racked + powered |
| Pilot light | Core/control plane up; GPU pool minimal | ~1-4 hr (scale-up time) | ~0.2x of full redundancy | Partially — depends on slack that may not exist |
| Cold standby | Defined as code; nothing running | Hours-days | Storage + IaC only | No — exposed to capacity-acquisition risk |
| Backup / restore only | Data replicated; no compute reserved | Days+ | Replication storage only | No — not viable for inference SLAs |
The table hides a trap in its lower rows. Pilot-light and cold standby look attractive because their steady cost is low — but their RTO assumes you can acquire the missing capacity when the disaster hits, and in a power-bound market that assumption is exactly the one that fails. A cold-standby plan that depends on renting GPUs from a neocloud during a regional outage is a plan that competes for scarce capacity with every other operator whose primary just failed in the same correlated event (the wildfire, the heatwave-driven grid event, the regional storm). Cold standby is honest only for workloads that can genuinely wait days; for the inference surface, the realistic choices are active-active or hot standby, and both mean pre-energized megawatts.
Single dependencies: how a local fault becomes an estate-wide one
Geographic failover protects against losing a place. It does nothing against losing a shared dependency — a component that, when it fails, fails everywhere at once and renders your second region useless because it depended on the same thing. The discipline here is the blast-radius lens from Chapter 12.1 applied across regions: enumerate every system that is common to all sites and ask what happens when it is the thing that breaks.
The canonical 2026 warning is the AWS US-EAST-1 outage of October 19-20, 2025: a latent race condition in DynamoDB's DNS-management automation produced an empty DNS record for a regional endpoint, and because so many global services anchor their metadata and control-plane operations in that one region, a ~15-hour event rippled worldwide — taking down services whose own architecture was nominally multi-region but whose control plane was not (AWS post-event summary; InfoQ; ThousandEyes, 2025). The lesson is precise: a multi-region data plane with a single-region control plane is a single-region system. Your inference can be replicated across three regions, but if the scheduler, the service-discovery layer, the secrets store, or the model-registry metadata lives in one place, that place is your true availability ceiling.
The dependencies worth enumerating for an AI estate are specific. The control / orchestration plane (scheduler, fleet-management, health-checking) — anchor it in a single region and you have re-created US-EAST-1. The model-weights store — if the failover region cannot serve the current weights because replication lagged, your spare fleet boots into a stale or empty model; weights replication must be versioned, immutable, and ahead of the failover need, not behind it. DNS and global load-balancing — the very mechanism you rely on to drain a failed region is itself a global dependency that must not share fate with it. Shared firmware and single-vendor loops — a common-cause defect in a CDU controller, a BMC firmware revision, or a single liquid-cooling vendor's pump logic can take every site running that revision down together (the thermal-path reliability concern from Chapter 12.2); this is the beta-factor / common-cause-failure problem that the quantitative model in Chapter 12.5 exists to size. And people — covered next, because the bus-factor on a novel liquid-cooled estate is realer than most plans admit.
Black-swan, pandemic, and human-continuity
Geographic failover answers "what if the site is gone." Business continuity answers the harder, slower questions: what if the people are gone, the supply chain is gone, or the operating environment changes for months rather than hours. These are the low-probability, high-consequence tails that drills rarely exercise and plans rarely fund — and that the 2020-2022 period taught a generation of operators to take seriously.
Key-personnel continuity is sharper for AI factories than for legacy halls precisely because the technology is new. The number of engineers who can safely intervene on a 130 kW direct-to-chip liquid-cooled rack mid-fault, or who understand a specific cluster's NCCL-level failure signatures, is small — often a handful per site, sometimes one. That bus-factor is a continuity risk equal to any transformer. The mitigations are unglamorous and effective: documented runbooks that a competent on-call engineer can execute cold (not tribal knowledge), cross-training and rotation so no single person is the only path to recovery, vendor field-service contracts with guaranteed response windows for the loops you cannot self-service, and a deliberate refusal to let the on-call roster narrow to one name.
Supply continuity is the slow disaster. AI-factory recovery depends on parts that are themselves allocation-gated: HBM and advanced-packaging supply is sold out generations ahead, CDUs and quick-disconnects are specialized, high-voltage transformers and switchgear carry multi-year lead times. A spare-parts strategy sized for a legacy hall — a few PSUs and fans on a shelf — does not cover a liquid-cooled estate where the failed part may be a 128-week-lead transformer or an allocation-gated accelerator. Continuity here means stocking the critical-spares list deliberately, holding vendor SLAs with teeth, and accepting that some failures are recovered by reconfiguration (shrinking the cluster, re-routing the fabric) rather than replacement. Pandemic / access-denial continuity generalizes the personnel question: can the site run lights-out for an extended period with no on-site staff, can remote hands and zero-touch provisioning carry the load, and is the runbook executable by people who cannot physically enter the building? The facilities that rode 2020 best were the ones already operating close to lights-out by design.
Deep dive: distributed-small-pools vs concentrated-big-site as a continuity architecture
There is a structural DR choice that sits upstream of warm-versus-cold standby, and it is geographic granularity. The instinct of a power-bound era is to concentrate — one gigawatt-class campus on one giant interconnection, because that is where the cheap firm power and the scale economics live (the siting logic of Chapter 3.1). But concentration maximizes blast radius: a single grid event, a single substation, a single weather footprint can take the entire estate. The continuity alternative is distribution: instead of one 1,000-GPU pool on one ~1.76 MW connection, ten pools of ~100 GPUs across ten regions and ten independent grids, each drawing ~176 kW — comfortably below the threshold that triggers years-long large-load interconnection studies, and each an independent failure domain (distributed-pool framing, Introl, 2025). Failover becomes routing work-away-from-a-pool rather than standing up a cold region.
The trade is real and it is the training-versus-inference fork again. Distribution is natural for inference, which is loosely coupled, latency-served, and benefits from proximity anyway — geo-distribution doubles as both DR and a latency strategy. It is hostile to frontier training, which is one tightly-coupled synchronous job that wants the largest possible single non-blocking domain and pays a convergence and bandwidth penalty for spanning regions (inter-site training is bandwidth-bound; ~1 Pbit/s inter-region targets and coast-to-coast RTT are the limiters). So the continuity architecture follows the archetype: distribute the inference surface for resilience-and-proximity, concentrate the training cluster and protect it with checkpointing and a hot-spare pool rather than a second campus. Most real estates are a hybrid — a concentrated training core plus a distributed inference mesh — and the DR plan must address each with its own posture.
Contractual continuity: what you owe vs what you can deliver
DR is not only an engineering posture; it is a set of promises in a contract, and the gap between the promise and the topology is where operators get hurt. A colocation or capacity provider's master service agreement carries continuity obligations — availability commitments, maintenance-window rules, sometimes explicit DR or geographic-redundancy clauses — and the customer's own SLA to their users sits on top. When a real correlated failure exceeds the topology, the service-credit ladder fires and, worse, the reputational and renewal damage compounds. The discipline is to make the contract describe what the topology can actually deliver under a realistic failure environment, not under the optimistic independence assumption.
Three contractual traps recur. First, promising an availability number the single-region topology cannot reach — committing to four or five nines on a facility whose control plane, weights store, or sole liquid-cooling vendor is a single point of failure; the number is unreachable the moment a common-cause event hits. Second, silent capacity assumptions — a DR clause that implies failover capacity exists without specifying that it is pre-energized and reserved, so that the obligation is technically met on paper and impossible in a power-bound outage. Third, mismatched RTO/RPO tiers — selling every workload the same continuity grade rather than tiering it, which either over-charges the customer for batch-grade work or under-protects their inference surface. The detailed structure of availability-versus-goodput SLAs, the penalty and service-credit ladders, and how goodput shortfalls are measured and attributed is the subject of Chapter 12.4; the productization of these commitments to customers is in Chapter 10.9 and the serving-side SLOs in Chapter 10.11. The single rule that connects them: never contract a continuity grade your drills have not proven.
Drills, runbooks, and the FMEA tie-in
A DR plan that has never been executed is a hypothesis. The only thing that converts it into a capability is regular, adversarial drilling — the deliberate, scheduled failure of a region, a control-plane component, or a dependency, with the recovery measured against its RTO/RPO target and the gaps fixed before the real event. This is chaos-engineering discipline applied to the facility and cluster layer: you do not wait for the wildfire to discover that your weights-replication lag exceeds your inference RPO, or that the failover load-balancer shares fate with the region it is supposed to drain.
The artifacts that make drills repeatable are runbooks — step-by-step recovery procedures, written to be executed cold by a competent on-call engineer who was not in the room when the architecture was designed. A good runbook names the trigger, the decision authority (who declares a failover, since premature failover during a transient is its own outage), the exact mechanical steps, the verification that the failover succeeded, and the fail-back procedure once the primary returns — because fail-back, the step everyone forgets to rehearse, is frequently where the second outage happens. Each runbook should map to a specific failure mode in the consolidated FMEA catalog in Appendix F: every mode the FMEA enumerates as estate-significant should have a corresponding rehearsed recovery, and every drill should exercise a mode the FMEA flagged. The emergency-operating-procedures that tie operations to that same catalog live in the commissioning and operations parts; the documentation and acceptance-test discipline that captures the go-live baseline against which recovery is measured is in Chapter 13.2.
Deep dive: a minimal regional-failover runbook skeleton for an inference surface
The structure below is the irreducible skeleton of a runbook for the highest-tier case — losing a region that serves interactive inference under a ~15-minute RTO. It is deliberately mechanical, because under real stress the value of a runbook is that it removes judgment from steps that should not require it.
1. Detect & declare. Health-checks and synthetic probes from outside the failing region trip a threshold; an on-call engineer with named authority declares the failover (a human gate prevents flapping on a transient). The clock starts here. 2. Drain. The global load-balancer stops routing new requests to the failed region and shifts traffic to the healthy region(s) already carrying live capacity (active-active) or to the warm standby being promoted. In-flight requests are allowed to complete or are retried idempotently. 3. Promote & scale. If active-passive, pre-empt the displaceable batch work on the standby fleet, confirm the standby is serving current model weights (the version check that catches replication lag), and scale serving capacity to absorb the redirected peak. 4. Verify. Confirm latency and error-rate SLOs are met in the receiving region(s); confirm the control plane and weights store are healthy and not themselves single-region-dependent. 5. Communicate. Fire the tenant/customer notification the contract requires; start the service-credit clock if the SLA breached. 6. Fail back. Only once the primary is independently verified healthy, drain traffic back deliberately and gradually — never all at once, because a cold primary taking full load is a fresh outage. The drill that rehearses this skeleton is what turns the ~15-minute RTO from a number in a contract into a number you can hit.