Guide › Networking, Fabrics & Optics › 8.1

Chapter 8.1

Network Fundamentals & AI Traffic Characterization

In a synchronous AI cluster the network is not plumbing between computers — it is the part of the computer that decides how fast every other part is allowed to run, because the whole job moves at the speed of its slowest collective and stalls on its longest tail.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Where you draw the scale-up / scale-out / scale-across boundaries — because that single partition decides which traffic rides a 1.8 TB/s memory-semantic fabric, which rides a 400–800 Gb/s packet fabric, and which crawls across a WAN, and it is the upstream choice every other networking chapter inherits.
Which parallelism strategy (TP/EP vs PP/DP) maps onto which network tier — get the mapping wrong and you push an all-reduce that wanted 1.8 TB/s of NVLink onto a 400 Gb/s NIC and watch model-FLOPS-utilization collapse.
Whether your design target is raw link rate or delivered goodput — the two diverge under tail latency, congestion, and the BSP barrier, and only one of them shows up in the training run's wall-clock time.
Whether you are building a training fabric (tightly-coupled, bandwidth-and-tail-bound, optimized for job-completion-time) or an inference fabric (loosely-coupled, latency-SLO-bound, optimized for tokens-per-second-per-dollar) — they characterize traffic differently and pull the design in opposite directions.
Which traffic-characterization artifacts (collective profile, message-size histogram, incast/tail budget) must exist before you size switches, optics, and buffers — because a fabric sized from peak link rate instead of measured traffic is over-built where it does not matter and under-built where it does.

Bandwidth drops by orders of magnitude across the three network tiers — which is why tightly-coupled parallelism must stay inside the scale-up domain.

In a single-GPU world the network is an afterthought, a way to feed data in and ship results out. In a frontier AI cluster the interconnect is a first-class component of the computer, and for the largest synchronous jobs it is the component that gates all the others. The reason is structural. A modern training run is one program spread across tens of thousands of accelerators that must agree, repeatedly and synchronously, on the value of a shared set of gradients. That agreement is a collective communication — an all-reduce, all-gather, or reduce-scatter — and until it completes, every GPU that finished its share of the work sits idle at a barrier waiting for the slowest participant. The network is not transporting bytes between computations; it is sequencing the computation itself. Misjudge it and you do not get a slow network. You get expensive accelerators running at a fraction of their FLOPS because they spend the wall-clock waiting on each other.

This chapter is the foundation for all of Part 8. It establishes the three-network model (scale-up, scale-out, scale-across) and why its internal boundaries are moving; the collective primitives and how parallelism strategies map onto each network tier; the tyranny of the tail and the BSP barrier semantics that turn a one-in-a-thousand slow link into a fleet-wide stall; and the divergence between raw link rate and delivered goodput that decides whether a fabric earns its capital. We close on the two traffic profiles — training versus inference — and on job-completion-time as the north-star metric that reframes every downstream networking decision. Each fork below names what choosing wrong costs you in goodput, dollars, or stranded megawatts.

The three-network model

An AI cluster is not one network. It is a hierarchy of at least three, each with a different bandwidth, latency, reach, and cost-per-bit, and the single most consequential network-design decision is where you draw the boundaries between them. The hierarchy spans roughly three orders of magnitude in per-GPU bandwidth and four in latency, and traffic that lands on the wrong tier is either ruinously expensive or ruinously slow.

Scale-up is the memory-semantic fabric inside a tightly-coupled domain — historically a node, now a rack-scale domain of 72 GPUs (NVL72) heading toward 576 (Rubin Ultra). It is the fastest and most expensive tier: NVLink 5 delivers ~1.8 TB/s per GPU, an NVL72 rack carries ~130 TB/s of aggregate bisection, and the semantics are load/store — a GPU reads another GPU's memory almost as if it were local. This is where you want tensor-parallel and expert-parallel traffic to live, because those collectives are bandwidth-hungry on every layer. → Chapter 8.2.

Scale-out is the packet-switched fabric that stitches domains into a cluster of tens of thousands of GPUs. It is roughly an order of magnitude slower per GPU — a 400 Gb/s NIC is ~1/18th of NVLink-5 per-GPU bandwidth, 800 Gb/s closes some of that gap — and an order of magnitude cheaper. The semantics are message-passing (RDMA over InfiniBand or RoCE/Ethernet), and this is the natural home for data-parallel gradient all-reduce and pipeline-parallel point-to-point. → Chapter 8.4 and Chapter 8.5.

Scale-across is the newest tier and a creature of the power-bound era: the WAN/DCI fabric that couples multiple campuses when no single site can be energized at the scale a frontier run demands. Latency jumps to milliseconds, bandwidth-per-GPU collapses, and the algorithms must change to tolerate it (asynchronous, hierarchical, or local-SGD methods, gradient compression). Three years ago this tier did not exist in production; in 2025–2026 it became the defining frontier-lab architecture because the grid, not the GPU, is the binding constraint. → Chapter 8.8; the power-bound rationale in Chapter 16.1.

The master fork: where the scale-up boundary sits

Set the scale-up domain size before anything else in the fabric. Every collective in your job lands on exactly one of the three tiers, and the scale-up domain size decides which. A bigger scale-up domain (8 → 72 → 576 GPUs) means more of your tensor- and expert-parallel traffic stays on the 1.8 TB/s memory-semantic fabric instead of spilling onto the ~10x-slower packet fabric — which directly lifts the tensor-parallel and expert-parallel ceilings and therefore model-FLOPS-utilization. But a bigger domain is more expensive per GPU, has a larger blast radius (one NVSwitch failure can take down the whole domain), and is approaching the copper reach wall that forces optical scale-up. The boundary you draw is simultaneously a performance decision, a reliability decision, and a cost decision, and the rest of Part 8 inherits it. Decide it from the workload's collective profile, never from a switch vendor's reference diagram.

Collective primitives and the parallelism mapping

Distributed training is built from a small vocabulary of collective communication primitives, and almost all the traffic in a cluster is one of them. All-reduce (sum a tensor across all ranks, return the result to all) is the workhorse of data-parallel gradient synchronization. All-gather and reduce-scatter (the two halves of which a ring all-reduce is composed) move sharded weights and activations under tensor and sharded-data parallelism. All-to-all is the signature of mixture-of-experts: every token is routed to its chosen experts, so every rank exchanges with every other rank — the most punishing pattern for a fabric because it stresses bisection bandwidth uniformly. Point-to-point sends move activations forward and gradients backward across a pipeline stage.

The engineering content is the mapping: which parallelism dimension generates which collective, and therefore which network tier it must ride. Get the mapping right and each collective lands on a fabric sized for it; get it wrong and you push a bandwidth-bound collective onto a tier that cannot feed it, and the job stalls. The four parallelism dimensions are nested precisely so they can be mapped onto the bandwidth hierarchy.

Parallelism → collective → network-tier mapping

Parallelism	Dominant collective	Frequency	Target tier	Why it lands there
Tensor (TP)	All-gather / reduce-scatter (per layer)	Every layer, every step	Scale-up (NVLink)	Highest-frequency, bandwidth-bound; only the 1.8 TB/s fabric keeps it off the critical path
Expert (EP, MoE)	All-to-all (token routing)	Every MoE layer	Scale-up; spills to scale-out as EP degree grows	Bisection-stressing; wider scale-up domains raise the EP ceiling before it spills
Pipeline (PP)	Point-to-point (activations / gradients)	Per micro-batch boundary	Scale-out (intra-cluster)	Sparse, latency-tolerant; cheap packet fabric is sufficient
Data (DP / FSDP)	All-reduce (gradients); all-gather (params)	Once per step (or sharded)	Scale-out; scale-across at multi-campus scale	Large but infrequent; tolerant of oversubscription and, with the right algorithm, WAN latency

The placement rule frontier clusters actually follow: bandwidth-hungry, every-step collectives stay on scale-up; less frequent or latency-tolerant traffic spills to scale-out and beyond. Bandwidth figures are 2025–2026 NVIDIA-class reference points; see keynumbers for sources.

The table is a placement rule. The design move is to keep the leftmost, highest-frequency rows (TP, EP) inside the scale-up domain where 1.8 TB/s of bandwidth makes the collective nearly free, and to let the lower-frequency rows (PP, DP) spill onto the cheaper scale-out fabric where their infrequency tolerates lower bandwidth and even oversubscription. This is exactly why the scale-up domain size is the master fork: every GPU you can add to the domain is a chunk of TP/EP traffic you keep off the slow fabric. It is also why mixture-of-experts inference reshaped fabric design — wide expert parallelism wants the largest scale-up domain it can get so the all-to-all stays on NVLink. → MoE inference and EP degree in Chapter 8.2.

Tail-latency tyranny and the BSP barrier

Here is the single fact that makes AI networking different from every other kind of data-center networking: a synchronous training step is a bulk-synchronous-parallel (BSP) computation, and a BSP barrier runs at the speed of its slowest participant. When ten thousand GPUs all-reduce a gradient, the collective is not done when the median link finishes — it is done when the last one does. Average bandwidth is almost irrelevant; what governs wall-clock is the tail. A fabric with superb median latency and a fat 99.9th-percentile tail will deliver worse training throughput than a fabric with mediocre median latency and a tight tail. This is the tyranny of the tail, and it inverts the intuition every web-scale network engineer brings to the problem.

The mechanism that produces a fat tail is incast and the congestion it triggers. In a rail-optimized fat-tree, an all-reduce produces synchronized, many-to-one traffic — many senders hammering one receiver at the same instant — which overruns switch buffers, triggers flow control (PFC) or packet drops, and stretches the tail. One slow link, one congested switch, one mis-tuned congestion-control loop, and the barrier that should have closed in microseconds drags for milliseconds, multiplied across every step of a months-long run. Public training analyses bear this out: lifting a fabric from a congested regime to a well-engineered one can cut GPU idle time from above ~33% to below ~15%, and at the 175B-parameter scale a rise in effective latency from ~10 µs to ~1000 µs, or 0.1% packet loss, each shaves roughly 10–13% off the GPU compute-time fraction (industry training-network analyses, 2025). Those are not rounding errors — they are the difference between a profitable cluster and a stranded one.

Why 'fast links, slow job' is the recurring failure mode

The most common way to build a disappointing AI fabric is to procure it on link rate and validate it on point-to-point bandwidth. Both pass. Then the real training job — synchronized all-reduce under incast, with a BSP barrier closing every step — exposes a tail the bandwidth test never touched, and model-FLOPS-utilization comes in 15–30% below the spec sheet. The defenses are tail-oriented, not bandwidth-oriented: adaptive routing and packet spray to spread incast across paths, fast congestion control (the gap that untuned RoCE leaves wide and InfiniBand and Spectrum-X close), in-network reduction (SHARP) to collapse the collective in the switch, and deep-buffer or VOQ switches at the incast points. You validate with collective benchmarks at scale and a measured tail budget — never with iPerf. → congestion control in Chapter 8.6; switch buffering in Chapter 8.3.

Goodput vs raw link rate

A 400 Gb/s link does not deliver 400 Gb/s of useful gradient synchronization, and the gap between the two is where AI fabrics are won and lost. Raw link rate is the SerDes specification. Goodput is the fraction of that rate that actually advances the job after you subtract protocol overhead, congestion losses, retransmissions, load-balancing inefficiency, and time lost waiting at the BSP barrier. The discipline is to design and procure against goodput, because raw link rate is the number on the invoice and goodput is the number in the training run's wall-clock.

The protocol choice is the cleanest illustration. InfiniBand delivers ~98–99% effective throughput at ~1–2 µs latency because it was purpose-built lossless and in-order. Untuned RoCEv2 over Ethernet can sit at 5–10 µs with a long tail and effective throughput far below line rate, because RoCE's go-back-N retransmission discards everything after an out-of-order packet and retransmits from there. The 2025–2026 answer — NVIDIA Spectrum-X and, generically, Ultra Ethernet — reclaims that gap with packet spray, direct data placement, and hardware reordering, landing at ~95% effective throughput, near InfiniBand, on commodity Ethernet silicon. Meta famously tuned RoCE to match InfiniBand and trained its largest models on it. The lesson is not about which protocol wins. It is that the delivered-throughput gap between a tuned and an untuned fabric is larger than the gap between any two link generations, so tuning and topology, not link rate, are where the goodput is. → the protocol war in Chapter 8.4.

1.8 TB/s

NVLink-5 per-GPU scale-up bandwidth (NVL72 = ~130 TB/s rack aggregate); ~18x a 400 Gb/s scale-out NIC

2025NVIDIA NVLink; SemiAnalysis

98–99% vs ~95%

InfiniBand effective throughput vs Spectrum-X/tuned-Ethernet; untuned RoCEv2 far lower

2025SemiAnalysis 100k H100 clusters; NVIDIA

~1–2 µs

InfiniBand all-reduce latency; tuned RoCEv2 ~1.5–2.5 µs; untuned 5–10 µs

2025SemiAnalysis; vendor benchmarks

33% → 15%

GPU idle-time reduction achievable by moving from a congested to a well-engineered AI fabric

2025Industry AI training-network analyses

~10–13%

GPU compute-time fraction lost at 175B scale from latency 10→1000 µs, or from 0.1% packet loss

2025AI training-network performance studies

1:1 vs 2:1–3:1

scale-out oversubscription: non-blocking for training vs 2:1–3:1 for inference (2:1 cuts back-end cost ~31%, contested)

2025SemiAnalysis; Juniper AI cluster design

8.4%

share of Llama-3 training interruptions attributed to network switch/cable faults

2024Meta (Llama 3 paper)

~90% / ~96%

industry-average / best-in-class goodput (effective training time) — the JCT north-star

2025SemiAnalysis ClusterMAX; CoreWeave

Training vs inference traffic profiles

The two dominant workloads characterize traffic so differently that they want different fabrics. Training traffic is synchronized, bursty-at-the-barrier, bisection-hungry, and east-west: the cluster is silent during compute, then every GPU floods the fabric at once during the collective, then falls silent again. It is governed by job-completion-time, tolerant of moderate latency on the data-parallel all-reduce but utterly intolerant of a fat tail, and it demands a 1:1 non-blocking back-end because oversubscribing the all-reduce starves it. Inference traffic is the opposite: many independent requests, each fitting inside a node or small scale-up domain, governed by a latency SLO (time-to-first-token, time-per-output-token) rather than a global barrier. It is loosely coupled, so the scale-out fabric can be oversubscribed 2:1–3:1 — saving ~31% of back-end cost — and the saved capital goes to geo-distribution and uptime instead.

The 2025–2026 twist is that inference traffic is no longer trivial. Reasoning models emit long decode sequences, mixture-of-experts routing creates real all-to-all traffic, and prefill/decode disaggregation moves the KV-cache across the fabric — so inference, once a single-node afterthought, now exercises the scale-up domain hard. The characterization still differs from training (no global synchronous barrier, latency-SLO-bound), but the gap is narrowing, and a fabric scoped for last-generation single-node inference can be undersized for MoE reasoning serving. → inference fabric demands in Chapter 1.3; MoE all-to-all in Chapter 8.2.

Training vs inference: how the two workloads characterize traffic

Property	Training fabric	Inference fabric
Coupling	Tight — synchronous BSP barrier every step	Loose — request fits a node / small scale-up domain
Traffic shape	Bursty east-west; silent-then-flood at the collective	Steady-ish many-to-one request/response; bursty arrivals
Governing metric	Job-completion-time (goodput / MFU)	Latency SLO (TTFT, TPOT) at target throughput
Tail tolerance	None — the barrier waits for the slowest link	Bounded — tail shows up as p99 latency, not a global stall
Oversubscription	1:1 non-blocking on the back-end	2:1–3:1 acceptable (~31% cheaper back-end)
Siting consequence	Power-first; can be far from users	Latency-first; geo-distributed near users

Both can be served by either protocol family; what differs is the blocking requirement, the governing metric, and the tail tolerance.

Job-completion-time as the north-star

Every networking decision in Part 8 ultimately answers to one metric, and it is not bandwidth, latency, or port count — it is job-completion-time (for training) and its inference twin, tokens-per-second-per-dollar at the SLO. JCT is the wall-clock from job start to a trained model, and it folds in everything the spec sheets hide: the tail, the congestion, the goodput, the failures and restarts. A fabric is good if and only if it lowers JCT per dollar of capital and per watt of power — and a fabric can have best-in-class link rate, best-in-class median latency, and still lose on JCT because its tail or its reliability is poor.

Reframing networking around JCT changes what you optimize. You stop chasing peak bisection bandwidth as a vanity number and start chasing the combination that minimizes wall-clock: tight tail, high goodput, fast recovery from the inevitable link and switch failures (which the Llama-3 run attributed ~8.4% of interruptions to). It also re-prioritizes the whole part — congestion control, in-network reduction, adaptive routing, and topology matter because they move JCT, not because they look impressive in a bake-off. This is the same shift from availability to goodput that the reliability chapter makes for the facility: the question is never 'is the link up?' but 'how much useful training did the fabric deliver this week?'. → goodput vs availability in Chapter 12.2; checkpoint/restart math behind failure recovery in Chapter 9.4.

Deep dive: why the scale-up / scale-out boundary keeps moving

The three-network model is not a fixed partition — its internal boundaries migrate with each hardware generation, and that movement is one of the most important dynamics in AI infrastructure. The pressure is bidirectional. Pushing the scale-up boundary outward: every generation, NVIDIA (and AMD with UALink, hyperscalers with their own fabrics) enlarges the scale-up domain — 8 GPUs in HGX, 72 in NVL72, a roadmapped 576 in Rubin Ultra — because keeping TP and EP traffic on the 1.8 TB/s memory-semantic fabric instead of the ~10x-slower packet fabric is the single biggest lever on model-FLOPS-utilization for large dense and MoE models. Bigger domains raise the tensor- and expert-parallel ceilings and widen the all-to-all that MoE inference depends on.

Pulling it back: the copper reach wall. NVLink at multi-terabit-per-second over copper is good for under a meter or two; once a domain spans multiple racks, copper cannot reach and the domain must go optical — which is more expensive, more power-hungry, and (until co-packaged optics mature) less reliable. So the scale-up boundary sits exactly where the bandwidth benefit of a larger domain still outweighs the cost, power, and blast-radius penalty of crossing into optics. Meanwhile a brand-new boundary opened at the top — scale-across — because the grid cannot energize a single site large enough for a frontier run, forcing multi-campus training over a WAN that did not exist as a training tier three years ago. The practical consequence for a facility designer: provision for the boundary to move. A hall plumbed and powered only for today's domain size strands the option to adopt the next generation's larger one. → copper-vs-optics in the domain in Chapter 8.2; the roadmap in Chapter 16.2; scale-across in Chapter 8.8.

Deep dive: characterizing traffic before you size the fabric

The recurring mistake is sizing a fabric from peak link rate and rack count instead of from measured traffic — which over-builds where it does not matter and under-builds where it does. A defensible fabric design starts from three traffic-characterization artifacts, derived from the workload's collective profile rather than a vendor reference diagram.

Collective profile. Which collectives the job runs (all-reduce, all-gather, all-to-all, point-to-point), at what frequency, and on which parallelism dimension — because that is what tells you how much traffic wants scale-up versus scale-out, and therefore the domain size and the blocking factor. An MoE model with heavy all-to-all and a dense model doing ring all-reduce produce completely different fabric demands at the same GPU count.
Message-size histogram. The distribution of message sizes, because small-message latency and large-message bandwidth are served by different switch and protocol choices, and a fabric tuned for one is mediocre at the other. Gradient all-reduce is large-message bandwidth-bound; pipeline activations and control traffic are latency-bound.
Incast and tail budget. The expected incast ratio at the collective and the p99/p99.9 tail-latency budget the BSP barrier can tolerate — because that, not average bandwidth, sizes buffers, picks the congestion-control regime, and decides whether you need deep-buffer switches, adaptive routing, or in-network reduction.

These artifacts are the network analogue of the workload profile sheet that scopes the whole facility. Produce them before you count switches and optics, and the sizing in Chapter 8.5 becomes a derivation rather than a guess.

This chapter sets the foundation the rest of Part 8 builds on. The scale-up fabric and the moving domain boundary are engineered in Chapter 8.2; the switch, NIC, and DPU silicon in Chapter 8.3; the scale-out protocol war (InfiniBand vs RoCE vs Spectrum-X vs Ultra Ethernet) in Chapter 8.4; topology, sizing, and the oversubscription decision in Chapter 8.5; congestion control, adaptive routing, and in-network compute — the tail-fighting toolkit — in Chapter 8.6; and scale-across multi-campus training in Chapter 8.8. The training/inference fork that this chapter reads from the network is scoped facility-wide in Chapter 1.2 and Chapter 1.3; the goodput-over-availability reframing in Chapter 12.2; the failure-and-restart math behind JCT in Chapter 9.4; and the metric definitions (goodput, MFU, JCT, $/M-tokens) in Chapter 0.3.