The Definitive Guide toAI Data Centers
Ask the Guide

Chapter 8.2

Scale-Up Fabric (Intra-Node / Intra-Rack)

The scale-up domain — the set of accelerators that talk at memory speed over a switched fabric an order of magnitude faster than the back-end network — is the single hardware boundary that sets your tensor- and expert-parallel ceilings, your MoE inference economics, and your largest blast radius; how big you can make it, and over what medium, is now the most contested decision in AI networking.

GOODPUTDENSITY-RAMPPOWER-BOUND

What you'll decide here

  1. How large a coherent scale-up domain you actually need — 8, 72, 144, or 576+ accelerators — which is really a decision about your TP/EP ceiling and your MoE all-to-all, not about a rack count.
  2. Whether you buy that domain as a vertically-integrated stack (NVLink/NVSwitch) or assemble it from an open standard (UALink, Scale-Up Ethernet) — the lock-in-vs-commoditization fork that governs your supplier power for a decade.
  3. Where copper stops and optics start inside the domain — the reach wall that, more than anything, caps single-rack domain size and forces the CPO transition as racks cross ~200 kW.
  4. The blast radius you are willing to own: a bigger NVLink domain lifts MFU and widens expert parallelism, but a switch-tray or link fault now stalls 72–576 GPUs, not 8.
  5. Whether your memory-semantic needs are load/store coherent (scale-up fabric, CXL) or message-passing (scale-out) — mis-classifying this strands bandwidth on the wrong network.

Chapter 8.1 drew the three-network model — scale-up, scale-out, scale-across — and argued that the boundaries between them move with each silicon generation. This chapter lives inside the innermost ring. The scale-up fabric is the set of accelerators that communicate at memory speed: a switched, often load/store-coherent interconnect running roughly five to eighteen times faster per device than the back-end NIC, across which a tensor- or expert-parallel shard can be split without the collective collapsing. In 2026 this is no longer an intra-server concern. The domain has climbed out of the 8-GPU server, filled a 72-GPU rack, and is now reaching across multiple racks toward 576 and beyond.

The scale-up domain size is a decision with a cascade behind it, not a spec you read off a datasheet. Pick it too small and you cap your tensor-parallel degree, force expert parallelism out onto the slow back-end fabric, and watch MoE inference throughput fall off a cliff. Pick it too large and you have signed up for an optical bill, a CPO serviceability problem, and a blast radius measured in hundreds of GPUs. Buy it from one vendor and you inherit their roadmap and their margin; assemble it from a standard and you inherit an integration burden and a maturity gap.

What 'scale-up' actually means

The defining property is per-device bandwidth and semantics, not topology or distance. A scale-up link carries on the order of 1.8–3.6 TB/s per accelerator in 2026, versus roughly 400 Gb/s (~50 GB/s) on a scale-out NIC — a ~5–18x gap that is the entire reason the two networks exist separately. Inside the scale-up domain, accelerators address each other's HBM with load/store or one-sided semantics and run collectives (all-reduce, all-gather, all-to-all) at a latency and bandwidth the back-end fabric cannot touch. The rule of thumb that falls out of this: fit your tensor parallelism and your expert parallelism inside the scale-up domain; let data and pipeline parallelism span the scale-out fabric. Cross that boundary with the wrong collective and you have moved a memory-bandwidth-bound operation onto a network three to ten times slower.

That makes domain size the master variable of this chapter. Eight GPUs (a classic HGX/DGX server) bounds TP at 8 and forces MoE experts onto the back-end. Seventy-two GPUs (a GB200 NVL72 rack) lets a frontier model run TP and a wide expert-parallel scheme entirely on the fast fabric. Five hundred seventy-six dies (Rubin Ultra Oberon) pushes the boundary across eight racks. Each step up is a real capability gain and a real cost — and the rest of this chapter is the accounting of that trade.

NVIDIA's NVLink is the incumbent and the reference against which every challenger is measured, so it anchors the discussion. The architecture is two parts: NVLink, the SerDes-based per-GPU link, and NVSwitch, the crossbar ASIC that turns point-to-point links into an all-to-all switched fabric. Generation by generation, the per-GPU number is the headline: NVLink 4 (Hopper) delivered 900 GB/s, NVLink 5 (Blackwell) doubled it to 1.8 TB/s, and NVLink 6 (Rubin) doubles it again to 3.6 TB/s — over 14x the bandwidth of PCIe Gen6 (NVIDIA, 2026). NVSwitch is what makes a domain out of those links: in GB200 NVL72, nine NVSwitch trays wire 72 Blackwell GPUs into a single non-blocking domain delivering ~130 TB/s of aggregate NVLink bandwidth and ~13.4 TB of unified, coherent memory.

The NVL domain is the unit that matters operationally. A GB200 NVL72 is one 72-GPU coherent domain in a single rack. The Vera Rubin generation keeps the 72-package rack but counts 144 dies (the rack NVIDIA now officially calls VR200 NVL72, ~260 TB/s scale-up), and Multi-Node NVLink (MNNVL) plus the IMEX (Internode Memory Exchange) service extend the coherent address space across racks — the software substrate that lets the domain outgrow a single chassis. Rubin Ultra's NVL576 'Oberon' stitches eight 72-GPU racks into one 576-GPU NVLink domain, and NVIDIA's roadmap points further to NVL1152-class all-to-all systems. The scheduler-visible consequence: the NVLink domain becomes a first-class allocatable resource — Slurm block scheduling and Kubernetes operators now place jobs on whole domains so a tightly-coupled job lands inside one coherent island rather than straddling the slow fabric. → Chapter 8.1 for the collective/parallelism mapping.

The scale-up-over-Ethernet three-way

For most of NVLink's life there was no scale-up standard — if you wanted a coherent multi-GPU domain, you bought NVIDIA's. The 2025–26 inflection is that the rest of the industry has converged on open scale-up fabrics, and they have split into a three-way contest. The fork is strategic before it is technical: a coherent scale-up domain is the deepest lock-in surface in the whole stack, because the accelerator, the switch, and the collective library co-design around it. Breaking that open is the explicit goal of the challengers.

NVLink / NVLink Fusion is the vertically-integrated path. NVLink Fusion is NVIDIA's 2025 move to license the NVLink interface so third-party CPUs and custom XPUs can join an NVLink domain — a partial opening that keeps NVIDIA at the center of the fabric while courting the custom-silicon builders who would otherwise defect. UALink (Ultra Accelerator Link) is the clean-room open standard: the 200G 1.0 spec (April 2025) defines a switched, low-latency, memory-semantic fabric for up to 1,024 accelerators, with sub-1 µs round-trip latency on <4 m reach and 200 GT/s per lane (UALink Consortium, 2025). AMD's MI400-series 'UALoE72' and MI500 'UAL256' productize it; switch silicon comes from Astera Labs and Broadcom, with hardware expected through 2026–27. Broadcom Scale-Up Ethernet (SUE) — now aligned with the OCP ESUN (Ethernet for Scale-Up Networking) effort — takes the opposite philosophical bet: reuse Ethernet's SerDes, switch silicon, and ecosystem rather than invent a new fabric. A single Tomahawk 6 (102.4 Tbps, shipping in volume March 2026) connects 512 XPUs in single-hop all-to-all as a scale-up fabric, and the same silicon does scale-out — the merchant-switch flexibility that is SUE's whole argument.

Scale-up fabric standards: the lock-in-vs-commoditization fork
FabricBacker / modelPer-device BWMax domainMedium / reachStrategic posture
NVLink 5 / 6 + NVSwitchNVIDIA — vertically integrated1.8 TB/s (Gen5) → 3.6 TB/s (Gen6)72 (NVL72) → 576 (Oberon) → 1152 (roadmap)Copper in-rack (~<1 m); optics/CPO multi-rackDeepest lock-in; full co-design; vendor sets roadmap & margin
NVLink FusionNVIDIA — licensed interfaceSame NVLink generationSame NVL domainSame as NVLinkPartial opening: 3rd-party CPU/XPU into an NVLink domain
UALink 200G 1.0UALink Consortium (AMD, Intel, Astera, Broadcom, hyperscalers)200 GT/s/lane; x4 → 800 GT/sUp to 1,024 acceleratorsOptimized <4 m; <1 µs RTTOpen standard, clean-room; commoditize the switch & accelerator
Broadcom SUE / OCP ESUNBroadcom + OCP — merchant EthernetPer Tomahawk-6 radix (102.4 Tbps switch)512 XPUs single-hop (one TH6)Ethernet PHY: copper, then pluggable/CPOReuse Ethernet ecosystem; one silicon for scale-up & scale-out
Domain sizes and bandwidths are 2026-current vendor/consortium figures; see keynumbers for sources and vintages. 'Reach' is the in-domain medium budget before optics are mandatory. Per-device bandwidth is bidirectional aggregate.

The table is a bet on supplier power. The vertically-integrated path buys you the most mature, highest-bandwidth, best-co-designed fabric on the market today — at the price of single-vendor dependence on the most strategically important boundary in your cluster. The open paths invert that: you accept a maturity gap and an integration burden in exchange for multi-vendor switch and accelerator sourcing and a credible threat that keeps incumbent margins honest. There is no neutral choice. Even buying NVLink is a bet — that NVIDIA's roadmap stays ahead far enough, fast enough, to justify the lock-in premium. → the merchant-vs-captive silicon business model is framed in Chapter 8.3.

The memory-semantic landscape: CXL, and the TPU/ICI alternative

Two adjacent fabrics belong in the same mental model, because both are memory-semantic but neither is a drop-in scale-up replacement. CXL (Compute Express Link) is cache-coherent over PCIe physical layers; its sweet spot is memory expansion and pooling — adding or sharing DRAM/HBM capacity across hosts — not the terabyte-per-second all-to-all that training collectives demand. In an AI rack, CXL and the scale-up fabric are complementary: CXL widens the memory pool and disaggregates capacity (increasingly relevant to KV-cache tiering), while NVLink/UALink carries the high-bandwidth collective traffic. Conflating them strands bandwidth on the wrong network — using CXL for an all-reduce, or a scale-up fabric for cold-capacity pooling, both leave performance on the table.

The cleanest existence proof that a different scale-up philosophy works at scale is Google's TPU. Its ICI (Inter-Chip Interconnect) wires chips into a 3D torus — six links per chip in ±X/±Y/±Z — rather than an all-to-all crossbar. Ironwood (TPU v7) runs ICI at 9.6 Tb/s per chip and forms 64-chip cubes that scale, via reconfigurable Optical Circuit Switches (OCS), into superpods of up to 9,216 chips (144 cubes, 48 OCS units, 13,824 optical ports) — a single coherent scale-up domain an order of magnitude larger than any NVLink rack, at the cost of a torus's longer worst-case hop count and a topology tuned to TPU collectives. The OCS twist matters: optics here are a circuit-switched, reconfigurable substrate, not packet switches, which lets Google route around failed cubes and reshape the torus per job — a fundamentally different answer to blast radius than NVIDIA's fixed crossbar. The lesson for a vendor-neutral reader: 'scale-up domain' is a function, and torus+OCS, all-to-all crossbar, and Ethernet-based fabrics are three legitimate implementations with different reach, blast-radius, and scheduling consequences. → the OCS topology recurs in scale-out in Chapter 8.5.

How domain size shapes training and inference

This is where the abstract 'domain size' decision turns into MFU and tokens-per-dollar. For training, the scale-up domain sets the ceiling on tensor parallelism and the practical limit on expert parallelism. Tensor parallelism shards a single layer's matmuls across GPUs and exchanges activations on every forward and backward step — a latency- and bandwidth-bound all-reduce that only stays cheap inside the scale-up fabric. Push TP past the domain boundary onto the back-end network and MFU falls hard, because the per-step collective now runs at scale-out speed. A 72-GPU domain lets a frontier model carry a high TP degree and still leave room for pipeline and data parallelism across the scale-out fabric; an 8-GPU domain caps TP at 8 and forces the rest of the parallelism budget onto slower links.

For inference, the domain is the enabler of wide expert parallelism in Mixture-of-Experts models. Each MoE layer routes tokens to a subset of experts via an all-to-all — the most punishing collective for a slow fabric. NVIDIA's own measurements on NVL72 show wide expert parallelism (EP32 and beyond) substantially outperforming narrow EP8 precisely because the all-to-all stays inside the coherent domain (NVIDIA Developer, 2025). The domain also reshapes prefill/decode disaggregation: a large coherent domain lets you place a prefill pool and a decode pool in the same NVLink island and stream KV-cache between them at memory speed rather than over the network — the substrate behind GB200 NVL72 + Dynamo MoE serving. The decision consequence is direct: your maximum profitable EP degree, and therefore your tokens-per-dollar on large MoE models, is bounded by how many accelerators you put in one scale-up domain. → inference archetype in Chapter 1.3; training archetype in Chapter 1.2.

1.8 → 3.6 TB/s
NVLink per-GPU bandwidth: Gen5 (Blackwell) → Gen6 (Rubin); Gen4 (Hopper) was 900 GB/s
2026NVIDIA NVLink / Rubin platform
~130 TB/s
aggregate NVLink bandwidth in one GB200 NVL72 domain (72 GPUs, ~13.4 TB unified memory)
2025NVIDIA NVLink / OCP NVL72 contribution
72 → 576 → 1152
NVLink domain size: NVL72 → Rubin Ultra Oberon NVL576 → NVL1152 (roadmap)
2026 (announced)NVIDIA Vera Rubin roadmap; HPCwire
1,024
max accelerators per UALink 200G 1.0 domain; <1 µs RTT on <4 m reach, 200 GT/s/lane
2025UALink Consortium 200G 1.0; Tom's Hardware
512 XPUs
single-hop all-to-all scale-up domain on one Broadcom Tomahawk 6 (102.4 Tbps, SUE/ESUN)
2026Broadcom Tomahawk 6 launch
9,216 chips
Google Ironwood (TPU v7) scale-up superpod via ICI (9.6 Tb/s/chip) + OCS (48 units, 13,824 ports)
2026Google Cloud; SemiAnalysis
passive ~1-2 m
copper reach at 800G/1.6T: passive DAC ~1-2 m, active (AEC) ~3-7 m; optics beyond — the domain-size wall
2025SemiAnalysis (GB200 architecture)
9W vs ~30W
per-interface optical power, CPO vs traditional pluggable — the efficiency case for in-domain optics
2025NVIDIA (Scaling AI Factories with CPO)

Copper vs optical inside the domain — and the CPO transition

The single physical fact that gates domain size is copper reach. A scale-up link runs at the same lane rates as the rest of the fabric (200G/lane and climbing), and at those rates passive copper (DAC) carries a clean signal only ~1–2 m; active electrical cable (AEC) stretches that to ~3–7 m at a power and reliability cost. Inside a single rack, that is enough: the GB200 NVL72 was deliberately engineered around copper, with ~5,184 in-rack NVLink cables on a spine backplane whose worst-case span is well under a metre — which is why NVIDIA fought to keep the NVL72 domain all-copper. Copper here is not a compromise; it is the right answer when it fits, saving on the order of ~20 kW per rack of optics power and eliminating thousands of failure-prone transceivers. → the physical-layer reach taxonomy is detailed in Chapter 8.9.

The wall arrives when the domain outgrows the rack. A 576-GPU Oberon domain spans eight racks; rack-to-rack at NVLink-6 lane rates is past copper's reach, so those links must go optical. That is the forcing function behind NVIDIA's embrace of co-packaged optics (CPO) for scale-up: as the company puts it, use copper where you can and optics where you must (The Register, 2026). CPO moves the optical engine onto the switch package, cutting per-interface power to ~9 W versus ~30 W for a traditional pluggable (NVIDIA, 2025) — a saving that becomes existential once a domain needs thousands of optical links. NVIDIA's stated plan applies CPO to scale-up NVL576 around 2027, with a co-packaged NVLink optical ASIC following. Broadcom's Tomahawk 6 already ships in a CPO ('Davisson') variant. The moment you commit to a multi-rack coherent domain, you have also committed to an optical scale-up fabric and the CPO serviceability model — fewer field-replaceable transceivers, more board-level repair, and a different spares strategy. → CPO and the fiber plant are engineered in Chapter 8.10.

Deep dive: why the NVL72 stayed copper — and why NVL576 cannot

The GB200 NVL72 is a master class in keeping a domain inside copper's reach, and understanding it explains why the next step is forced optical. NVIDIA placed all 72 GPUs and nine NVSwitch trays in a single rack and routed the NVLink fabric over a copper spine backplane — roughly 5,184 cables — rather than optics. The engineering rationale was threefold. Power: driving that many links over optics would add on the order of ~20 kW per rack of transceiver power, a steep penalty when the rack already draws ~120–132 kW and is cooling-bound. Reliability: optical transceivers are among the highest-failure-rate components in any fabric; thousands of them inside the most tightly-coupled part of the cluster would raise the interrupt rate of a synchronous job that already restarts on a single fault. Cost: copper DAC/backplane is dramatically cheaper per link than an optical interface. The whole NVL72 mechanical design — compact rack, short spine, sub-metre worst-case span — exists to keep those 130 TB/s of links on copper.

That envelope does not survive the jump to 576. Spreading a coherent domain across eight racks means the longest scale-up links are several metres at NVLink-6 lane rates — past passive DAC, past practical AEC, and into territory where signal integrity and cable bulk make copper untenable. So the same three pressures that favored copper in NVL72 now favor optics in NVL576: at multi-rack scale, optics wins on signal integrity outright, and CPO's ~9 W-per-interface figure is what keeps the power penalty survivable where pluggables (~30 W) would not be. The deeper lesson is that domain size, medium, and rack power are one coupled decision: you cannot choose a 576-GPU domain and a copper fabric and a 600 kW rack independently — the physics binds them together. → 800 VDC / ~600 kW racks and the multi-rack optical roadmap consolidate in Chapter 16.2.

The scale-up roadmap, and where it points

The trajectory through 2027 is legible and tightly coupled to power. Per-GPU scale-up bandwidth doubles each generation (900 GB/s → 1.8 TB/s → 3.6 TB/s, with NVLink 7 targeting ~10.8 TB/s on Rubin Ultra). Domain size climbs from 72 to 144 to 576 and, on the roadmap, to NVL1152-class all-to-all systems. The medium transitions from all-copper in-rack to optical/CPO multi-rack as domains cross the rack boundary. And all of it rides on a power envelope that climbs from ~132 kW (NVL72) toward ~600 kW Kyber-class racks on an 800 VDC distribution architecture — because a bigger, faster scale-up domain is, at bottom, a denser, hotter, more power-hungry rack. These three curves — bandwidth, domain size, medium — do not move independently, and a roadmap that treats them separately will mis-plan the substrate. The full subsystem roadmap, including the 800 VDC / ~600 kW rack and the multi-rack optical-domain timeline, is consolidated in Chapter 16.2; the macro power-bound rationale is Chapter 16.1.

The collective primitives and parallelism mapping that justify fitting TP/EP inside the scale-up domain are in Chapter 8.1. The switch ASICs, NICs, and the merchant-vs-captive silicon business model behind these fabrics are in Chapter 8.3. Scale-out transport and the InfiniBand-vs-Ethernet-vs-Ultra-Ethernet contest pick up where this chapter's boundary ends in Chapter 8.4; topology and oversubscription in Chapter 8.5; SHARP's scale-out analogue and congestion control in Chapter 8.6; multi-campus scale-across in Chapter 8.8. The physical-layer reach taxonomy (DAC/AEC/optics) is Chapter 8.9, and CPO plus the fiber plant is Chapter 8.10. The blast-radius and checkpoint coupling is engineered in Chapter 9.4; the rack-power and 800 VDC substrate this fabric rides on is Chapter 16.2.