Guide › Networking, Fabrics & Optics › 8.8

Chapter 8.8

Scale-Across: Multi-Campus & Cross-Region Fabric (DCI for Distributed Training)

When no single campus can be energized fast enough to hold the run, the fork is no longer how to wire one building but how to split a synchronous job across buildings — and that decision propagates into your optics, your transport layer, your training algorithm, and your failure model all at once.

POWER-BOUNDGOODPUT

What you'll decide here

Whether you go multi-site at all — i.e. whether your run is power-bound on a single campus — and if so, at what radius (intra-metro <10 km, metro 10–80 km, or long-haul 80+ km), because the radius sets the latency floor and therefore which training paradigm is even available to you.
Synchronous scale-across (treat remote campuses as one big fabric, pay the bandwidth and latency tax) versus relaxed-synchrony algorithms (DiLoCo/local-SGD/hierarchical SGD) that cut inter-site traffic by 100–1000x at a convergence-quality cost — the central fork of this chapter.
Which coherent DCI optics and transport layer carry the inter-site traffic: grey 800G over dark fiber for short metro spans, versus 800ZR / OpenZR+ coherent pluggables over DWDM for 80–120+ km, versus full OTN/managed transport — each a different cost, reach, and operational-ownership posture.
Where the fault domains and checkpoint boundaries sit once a job spans regions: a fiber cut, a regional power event, or a site evacuation is now a correlated failure that flat intra-DC checkpointing did not plan for.
Whether the second site is a true training peer (it joins the synchronous/relaxed-sync run) or a checkpoint-and-burst replica (it holds state and absorbs overflow), because that decision sets the inter-site bandwidth you must buy and light.

Three chapters of this Part built the network inward-out: the scale-up domain that fuses 72–576 accelerators into one coherent memory machine (Chapter 8.4), and the scale-out Clos fabric that stitches scale-up domains into a 100k–1M-GPU cluster inside a single hall (Chapter 8.5). This chapter is the third tier of that hierarchy — scale-across — and it exists for one reason that has nothing to do with networking: you cannot energize a single campus fast enough. The binding constraint of the 2026 era is megawatts at the substation, not accelerators in the supply chain (Chapter 16.1). When a frontier run needs 1–2+ GW and the largest campus you can interconnect in the timeline tops out at a few hundred MW, you are forced to do something the synchronous-training orthodoxy spent a decade saying you must never do: split one job across buildings, then across campuses, then across regions.

This is the defining frontier-lab architecture of 2025–26. Google trained Gemini Ultra across multiple datacenters on TPUv4 pods, combining SuperPods over its intra- and inter-cluster network with latency and bandwidth sufficient to keep the run synchronous. OpenAI/Microsoft and the Stargate program are building gigawatt-class liquid-cooled campuses explicitly to chase the same multi-datacenter training capability. Scale-across is a chain of forced forks: the moment you accept that the power won't fit in one building, you have implicitly committed to a latency floor, a class of optics, a transport layer, a training algorithm, and a new and nastier failure model. Each fork below carries a goodput cost for choosing it wrong.

The latency/bandwidth hierarchy: scale-up >> scale-out >> scale-across

The single most useful mental model for this whole Part is the three-tier hierarchy, and scale-across is the tier where the numbers fall off a cliff. Inside the scale-up domain, NVLink delivers ~1.8 TB/s per GPU (NVLink 5) at roughly 100 ns latency — load/store and collective semantics, one giant GPU. Scale-out drops to ~400–800 Gb/s per NIC at ~1–2 µs over an InfiniBand or Ethernet Clos. Scale-across drops again, by orders of magnitude on both axes at once: per-site uplink bandwidth is whatever DWDM you can light and pay for (tens of Tb/s aggregate is a large buy), and the latency floor is set by physics you cannot engineer around — ~5 µs per kilometer of fiber, one way.

That 5 µs/km figure is the gravity of this chapter. Light moves at ~300 m/µs in vacuum but ~30% slower in silica (refractive index ~1.47), so standard SMF-28 fiber adds ~4.9 µs/km one-way — call it 5. A campus-to-campus link 40 km apart has a ~200 µs one-way floor, ~400 µs round trip, before a single transponder, amplifier, FEC block, or switch hop adds its own delay. Two regions 1,000 fiber-km apart sit at ~5 ms one-way, ~10 ms RTT. Compare that to the ~100 ns of NVLink: scale-across is four to five orders of magnitude slower than scale-up. An all-reduce that completes in microseconds inside the hall takes milliseconds across the WAN, and a synchronous training step that spends a meaningful fraction of its time blocked on that all-reduce sees its MFU collapse. The entire algorithmic apparatus of this chapter — relaxed synchrony, gradient compression, hierarchy — exists to hide or amortize that floor.

Why the fiber-latency floor decides your algorithm before you've chosen one

The radius between your sites is not a networking detail you tune later — it is the upstream variable that determines which training paradigm is physically available. Intra-metro (<10 km, ~<50 µs RTT): close enough that a well-engineered fabric can keep a run synchronous — the campus simply spans two buildings. Metro (10–80 km, ~100–800 µs RTT): the grey zone — synchronous is possible with heavy overlap and topology-aware placement, but the tax is real. Long-haul / cross-region (80+ km, multi-millisecond RTT): synchronous all-reduce on every step is no longer economic, and you must move to a relaxed-synchrony algorithm or you are simply lighting expensive fiber to watch GPUs idle. Pick the radius — by picking which substations you can energize — and you have already half-decided whether you are running synchronous scale-across or DiLoCo-class local-SGD. → algorithms below.

The first fork: do you go multi-site at all?

Before any optics or algorithm, the honest first question is whether you are actually power-bound. Going multi-site imposes a permanent goodput tax — best case you lose a few percent to inter-site synchronization overhead and operational complexity; worst case a naive synchronous split halves your effective throughput. You take that tax only because the alternative — waiting for a single campus to be energized — costs you more. The math is a race between two clocks: the depreciation clock on accelerators you've already bought (2–3 year economic life; idle silicon is pure loss) versus the interconnection clock on the megawatts you need (4–7 year large-load waits in the densest US hubs). When the power clock is slower than the silicon clock, you go multi-site. That is the whole rationale, and it is a power-bound rationale, not a networking one. → the macro narrative lives in Chapter 16.1; siting and queue mechanics in Chapter 3.2.

The second-order question, once you've decided to split, is what the second site is for. There are two very different answers and they buy very different amounts of fiber. A training peer site runs part of the synchronous (or relaxed-sync) job and must exchange gradients or pseudo-gradients with the primary on a cadence that the algorithm sets — this is a bandwidth-hungry, latency-sensitive relationship. A checkpoint-and-burst replica holds a copy of model state and absorbs elastic inference or batch overflow — it needs enough bandwidth to ship checkpoints (gigabytes to terabytes on a slow cadence) but is indifferent to per-step latency. Mis-classifying the relationship is a classic over-spend: provisioning peer-grade DWDM for a site that only ever needed checkpoint-grade bandwidth, or the reverse — starving a true training peer and watching the run stall on a link you under-bought.

Scale-across radius → what it forces on optics, transport, and training

Radius class	Distance / RTT floor	DCI optics	Transport layer	Viable training paradigm
Intra-metro (one campus, two halls)	<10 km / <~50 µs RTT	Grey 800G / 1.6T over dedicated dark fiber; DR/FR-class reach	Point-to-point dark fiber; no DWDM needed	Synchronous — the "campus" just spans buildings; near-zero algorithm change
Metro (cross-campus)	10–80 km / ~100–800 µs RTT	800ZR coherent pluggables, single-span amplified DWDM	DWDM line system (self-operated or leased waves)	Synchronous possible with heavy overlap + topology-aware placement; hierarchical SGD safer
Long-haul / regional	80–1000+ km / 1–10+ ms RTT	OpenZR+ / 800G ZR+ multi-span; or OTN-mapped coherent	Managed OTN / leased lambdas / IP-over-DWDM	Relaxed-synchrony only — DiLoCo / local-SGD / async; synchronous is uneconomic
Inter-continental	1000s of km / 10s of ms RTT	Subsea / terrestrial coherent regional transport	Carrier OTN; capacity-constrained, expensive	Aggressive local-SGD + compression (16–3000x traffic cut) or federated; research frontier

One-way fiber floor uses ~5 µs/km; RTT roughly doubles route distance, which exceeds straight-line distance. Radius is the upstream choice (set by which substations you can energize); everything right of it is consequence.

Coherent DCI optics: how the inter-site traffic actually moves

Inside the hall, the interconnect debate is copper-versus-optics over single-digit meters (Chapter 8.9). Across sites, copper is irrelevant and the debate is which class of coherent optics carries traffic over kilometers of glass. The fork has three branches, and they trade reach, cost-per-bit, and how much transport plumbing you own.

Grey optics over dark fiber is the intra-metro answer: a standard 800G or 1.6T pluggable (DR/FR-class) lit directly over a dedicated dark-fiber pair you own or lease IRU on, no DWDM line system. Cheapest per link, simplest operationally, but reach- and fiber-count-limited — every wavelength burns a fiber pair, and you run out of pairs (or money) past a campus boundary. Coherent ZR/ZR+ pluggables are the metro-to-regional workhorse and the defining DCI technology of this cycle. The OIF's 800ZR Implementation Agreement (October 2024) defines an 800G coherent line interface for single-span amplified 80–120 km DWDM — a router/switch port speaks DWDM directly, collapsing a whole rack of external transponders into a QSFP-DD/OSFP module. OpenZR+ / 800G ZR+ extends modes from 120 km DCI out beyond 1,000 km of regional transport over existing DWDM. Full OTN / managed transport is the heavyweight branch: a separate optical-transport network with its own protection, OAM, and capacity engineering — the right call when you need carrier-grade resilience, sub-lambda grooming, or you are leasing capacity rather than owning fiber. → the physical-layer primitives (modulation, FEC, link budgets, baud rates) are in Chapter 8.9; the fiber plant and structured cabling in Chapter 8.10.

Deep dive: why 800ZR collapsed the DCI cost structure — and what it doesn't fix

The economic shift that made gigawatt-across-campuses practical is the migration of coherent DSPs into pluggable form factors. A decade ago, interconnecting two sites at hundreds of Gb/s meant a chassis of standalone coherent transponders — a separate, expensive, separately-operated transport box. 800ZR puts a ~118 Gbaud coherent engine (built on ~4 nm CMOS DSPs and 112G-PAM4 host SerDes) inside a QSFP-DD or OSFP module that plugs straight into a switch faceplate. The transport function moves from a dedicated appliance into the router — IP-over-DWDM — eliminating transponder capex, the rack space it consumed, and a layer of operational hand-off. Cignal AI and TrendForce track 800ZR/ZR+ as one of the fastest-ramping optics categories of the cycle: 800G coherent shipments forecast to exceed ~200,000 units and revenue past $1B in 2026, with the broader pluggable-coherent module market on the order of $2B in 2025 heading toward ~$5B by 2029.

What ZR optics emphatically do not fix is the latency floor or the aggregate-bandwidth gap. A coherent pluggable can carry 800G over 120 km, but it still adds ~600 µs of propagation on that span, plus FEC and DSP processing latency on top. And the aggregate inter-site bandwidth you can light — even with dense DWDM packing dozens of 800G waves per fiber pair — is a tiny fraction of the bisection bandwidth available inside either hall's Clos. That asymmetry is the whole reason the algorithm has to change: you cannot brute-force a synchronous all-reduce across a pipe that is 100–1000x thinner than your intra-DC fabric. ZR optics make the pipe cheaper to build; they do not make it fat or fast enough to pretend the WAN isn't there.

The central fork: synchronous scale-across vs relaxed-synchrony algorithms

This is the decision the rest of the chapter orbits. Once the job spans a metro or regional radius, you must choose how synchronization happens across the thin WAN pipe, and the choice trades convergence quality and simplicity against inter-site bandwidth and tolerance to latency.

Branch A — synchronous scale-across. Treat the remote campuses as one extended fabric: data-, tensor-, and pipeline-parallel groups span sites, and every step still ends in a global all-reduce that now traverses the WAN. This preserves exact synchronous-SGD convergence (the math is identical to single-site training, so model quality is uncompromised), and it is what Google did for Gemini Ultra. The cost is acute sensitivity to inter-site bandwidth and latency: you must overlap communication with computation aggressively, place the parallelism dimensions so the chattiest collectives (tensor-parallel) stay inside a site and only the more tolerant ones (data-parallel, pipeline) cross the WAN, and accept that bisection bandwidth across sites is your hard throughput ceiling. Get the placement wrong and the run is bandwidth-bound on the WAN, MFU in the teens. This branch is viable mainly intra-metro and at the easier end of metro.

Branch B — relaxed-synchrony (local-SGD family). Stop synchronizing every step. Each site runs many local optimizer steps independently, then the sites exchange and average their accumulated updates — "pseudo-gradients" — only periodically. This is the DiLoCo paradigm (Distributed Low-Communication): inner optimizer AdamW running H local steps, outer optimizer Nesterov momentum averaging across workers. The headline result is staggering — DiLoCo on 8 workers matched fully-synchronous optimization on C4 while communicating ~500x less, and OpenDiLoCo reproduced this training across two continents and three countries at 90–95% compute utilization. Streaming DiLoCo (2025) overlaps the periodic communication with compute to cut peak bandwidth further; follow-on work reports up to 16x less traffic than DiLoCo and up to ~3000x less than standard DDP. The cost is no longer bandwidth — it's convergence risk: relaxed synchrony introduces staleness and can perturb final model quality, the inner/outer optimizer and sync-interval are extra hyperparameters that must be tuned, and the failure modes are subtler than synchronous SGD's clean restart-from-checkpoint.

Branch C — hierarchical / async hybrids. The pragmatic middle most large multi-campus runs actually land on: synchronous within a site, relaxed across sites. Each campus is one tightly-coupled synchronous fabric; the campuses themselves are coupled with local-SGD-style periodic averaging or bounded-staleness asynchrony. This matches the algorithm to the physics tier-by-tier — pay full synchronous cost only where bandwidth is cheap (intra-DC), and pay the relaxed-sync convergence tax only where bandwidth is expensive (cross-WAN). Gradient/pseudo-gradient compression (quantization, top-k sparsification, low-rank/momentum decoupling) layers on top of any branch to shrink what crosses the WAN further.

Synchronous scale-across vs relaxed-synchrony: the convergence-vs-bandwidth trade

Approach	Inter-site traffic	Latency tolerance	Convergence quality risk	Best-fit radius
Synchronous scale-across (global all-reduce every step)	Highest — full gradient exchange per step	Low — every step blocks on WAN all-reduce	None — identical to single-site SGD	Intra-metro / easy metro (<~40 km)
Hierarchical (sync in-site, relaxed across sites)	Moderate — periodic cross-site averaging	Moderate — WAN touched on sync interval only	Low–moderate; tunable via sync interval	Metro to regional (10–500+ km)
DiLoCo / local-SGD (periodic pseudo-gradient averaging)	~100–500x less than synchronous DDP	High — H local steps between syncs	Moderate — staleness perturbs final quality	Regional to inter-continental
Local-SGD + compression (Streaming DiLoCo, SparseLoCo)	Up to ~16x less than DiLoCo; ~3000x vs DDP	Very high — compute/comm overlapped	Moderate–higher; more hyperparameters to tune	Inter-continental / poorly-connected

Traffic-reduction figures are reported research results (DiLoCo, OpenDiLoCo, Streaming DiLoCo / SparseLoCo, 2024–2025) and are workload- and config-dependent, not guarantees. "Quality risk" is relative to exact synchronous SGD on the same token budget.

~5 µs/km

one-way fiber propagation floor (SMF-28, ~4.9 µs/km); ~10 ms RTT per 1,000 km — the latency physics scale-across cannot engineer around

2025M2 Optics / MapYourTech (fiber latency)

~500x

less inter-site communication for DiLoCo (8 workers) vs fully-synchronous SGD, matching convergence on C4

2024DiLoCo, Douillard et al. (arXiv:2311.08105)

90–95%

compute utilization sustained by OpenDiLoCo training a model across two continents and three countries

2024OpenDiLoCo (Prime Intellect)

16x / ~3000x

further traffic cut vs DiLoCo / vs standard DDP from streaming + compressed local-SGD variants

2025Streaming DiLoCo / SparseLoCo (arXiv:2508.15706)

80–120 km

single-span amplified DWDM reach of the 800ZR coherent interface (OIF IA, Oct 2024); ZR+ extends past 1,000 km

2025OIF 800ZR IA / Cisco; Open ROADM 8.0

>200k units / >$1B

forecast 2026 shipments / revenue for 800G coherent optics; pluggable-coherent module market ~$2B (2025) → ~$5B (2029)

2026Cignal AI; TrendForce; Lightwave

~7 GW

planned Stargate capacity across the Abilene flagship + five new US sites — multi-campus by necessity, not choice

2025OpenAI / Data Center Frontier

4–7 yr

large-load grid interconnection wait in the densest US hubs — the power clock that forces multi-site when it runs slower than the 2–3 yr silicon clock

2026LBNL Queued Up / utility filings

Cross-site fault domains: a new and correlated failure model

Intra-DC reliability engineering assumes failures are independent: a GPU dies, a NIC flaps, a cable goes bad — uncorrelated events you ride out with hot spares and checkpoint-and-resume (Chapter 12.2). Scale-across breaks that assumption. A fiber cut between campuses, a regional grid event, a utility curtailment, a metro-wide cooling-water issue, or a site evacuation takes out an entire site's worth of accelerators at once — a single correlated failure spanning thousands of GPUs. The DCI link itself becomes a first-class fault domain: if the inter-site fiber is a single un-diverse path, one backhoe ends the run. The reliability posture that worked at flat intra-DC scale is no longer sufficient.

This reshapes three things. Path diversity stops being optional: inter-site fiber must be physically diverse (separate conduits, separate entrances, ideally separate carriers) or the WAN link is a guaranteed correlated-failure single point. Checkpoint placement must become site-aware: a checkpoint that lives only on the campus that just lost power is no checkpoint at all. The relaxed-synchrony algorithms help here almost for free — because each site already holds a recent local model copy between syncs, a site loss degrades to resuming from the last global average rather than a total restart, which is one of the underappreciated operational virtues of the DiLoCo family over naive synchronous scale-across. Blast-radius accounting must treat "site" as the failure unit: the question is no longer "what happens when a node fails" but "what fraction of the run survives when a whole campus drops, and how fast can the remaining sites continue or resync." → the checkpointing math — interval, bandwidth, and the storage/network trade behind multi-region durability — is worked in Chapter 9.4; the goodput-vs-availability reframing in Chapter 12.2.

The DCI link is a fault domain, and your checkpoint strategy probably doesn't know it

Two failure modes catch multi-campus operators who carried over flat intra-DC habits. First: the un-diverse fiber path. Teams light a beautiful DWDM line system between campuses over a single physical right-of-way, then discover that one fiber cut — backhoe, fire, flood — instantly partitions a synchronous run with no fallback. Inter-site fiber must be route-diverse before it carries a job, not after the first outage. Second: site-local checkpoints. A checkpoint cadence tuned for independent node failures writes to storage co-located with the compute. When an entire site is lost to a grid or cooling event, the only surviving copy of the last hour of training was in the building that just went dark. Cross-region durability means a recent global checkpoint must exist off the site that can fail as a unit — which costs inter-site bandwidth you must budget for alongside the training traffic, not after it.

Putting it together: the scale-across decision sequence

The chapter resolves to an ordered sequence of forks, each one constraining the next, now stretched across the WAN.

Are you power-bound on one campus? If no, stay single-site and skip all of this; the goodput tax of going multi-site is real and unrecoverable. If yes, the silicon clock is beating the power clock and you go across. → Chapter 16.1.
What radius can you energize? Intra-metro, metro, or regional — set by which substations you can interconnect, not by preference. The radius picks your latency floor at ~5 µs/km. → Chapter 3.2.
Which DCI optics and transport? Grey 800G/1.6T over dark fiber intra-metro; 800ZR coherent over DWDM for 80–120 km; OpenZR+/OTN for regional. → Chapter 8.9, Chapter 8.10.
Synchronous or relaxed-synchrony? Synchronous only if the radius lets a per-step all-reduce fit the compute budget; otherwise hierarchical/DiLoCo-class with compression. This is the model-quality-vs-bandwidth fork. → algorithms above.
How do fault domains and checkpoints span sites? Route-diverse fiber, site-aware checkpoint placement, blast-radius measured per site. → Chapter 9.4, Chapter 12.2.

Walk that sequence and the multi-campus run is a chain of well-posed decisions, each with a named downstream cost. Skip the first fork — go multi-site when you weren't actually power-bound — and you pay a permanent goodput tax for no reason. Skip the algorithm fork — run synchronous across a regional radius — and you light expensive fiber to watch both campuses idle on an all-reduce physics won't complete in time.

Scale-across is the third tier above the scale-up domain of Chapter 8.4 and the intra-DC scale-out Clos of Chapter 8.5; congestion control and collectives that this chapter assumes break over WAN latency are engineered in Chapter 8.6, and the out-of-band/timing fabric in Chapter 8.7. The coherent-optics and transport primitives — modulation, FEC, link budgets, baud rates — are taxonomized in Chapter 8.9, with the fiber plant and structured cabling in Chapter 8.10. The checkpoint math behind cross-region durability lives in Chapter 9.4; the correlated-failure reliability rethink in Chapter 12.2. The power-bound rationale that forces multi-site in the first place is the macro story of Chapter 16.1, with siting and grid-queue mechanics in Chapter 3.2 and the optics roadmap consolidated in Chapter 16.2.