Chapter 8.1
Network Fundamentals & AI Traffic Characterization
In a synchronous AI cluster the network is not plumbing between computers — it is the part of the computer that decides how fast every other part is allowed to run, because the whole job moves at the speed of its slowest collective and stalls on its longest tail.
What you'll decide here
- Where you draw the scale-up / scale-out / scale-across boundaries — because that single partition decides which traffic rides a 1.8 TB/s memory-semantic fabric, which rides a 400–800 Gb/s packet fabric, and which crawls across a WAN, and it is the upstream choice every other networking chapter inherits.
- Which parallelism strategy (TP/EP vs PP/DP) maps onto which network tier — get the mapping wrong and you push an all-reduce that wanted 1.8 TB/s of NVLink onto a 400 Gb/s NIC and watch model-FLOPS-utilization collapse.
- Whether your design target is raw link rate or delivered goodput — the two diverge under tail latency, congestion, and the BSP barrier, and only one of them shows up in the training run's wall-clock time.
- Whether you are building a training fabric (tightly-coupled, bandwidth-and-tail-bound, optimized for job-completion-time) or an inference fabric (loosely-coupled, latency-SLO-bound, optimized for tokens-per-second-per-dollar) — they characterize traffic differently and pull the design in opposite directions.
- Which traffic-characterization artifacts (collective profile, message-size histogram, incast/tail budget) must exist before you size switches, optics, and buffers — because a fabric sized from peak link rate instead of measured traffic is over-built where it does not matter and under-built where it does.
In a single-GPU world the network is an afterthought, a way to feed data in and ship results out. In a frontier AI cluster the interconnect is a first-class component of the computer, and for the largest synchronous jobs it is the component that gates all the others. The reason is structural. A modern training run is one program spread across tens of thousands of accelerators that must agree, repeatedly and synchronously, on the value of a shared set of gradients. That agreement is a collective communication — an all-reduce, all-gather, or reduce-scatter — and until it completes, every GPU that finished its share of the work sits idle at a barrier waiting for the slowest participant. The network is not transporting bytes between computations; it is sequencing the computation itself. Misjudge it and you do not get a slow network. You get expensive accelerators running at a fraction of their FLOPS because they spend the wall-clock waiting on each other.
This chapter is the foundation for all of Part 8. It establishes the three-network model (scale-up, scale-out, scale-across) and why its internal boundaries are moving; the collective primitives and how parallelism strategies map onto each network tier; the tyranny of the tail and the BSP barrier semantics that turn a one-in-a-thousand slow link into a fleet-wide stall; and the divergence between raw link rate and delivered goodput that decides whether a fabric earns its capital. We close on the two traffic profiles — training versus inference — and on job-completion-time as the north-star metric that reframes every downstream networking decision. Each fork below names what choosing wrong costs you in goodput, dollars, or stranded megawatts.
The three-network model
An AI cluster is not one network. It is a hierarchy of at least three, each with a different bandwidth, latency, reach, and cost-per-bit, and the single most consequential network-design decision is where you draw the boundaries between them. The hierarchy spans roughly three orders of magnitude in per-GPU bandwidth and four in latency, and traffic that lands on the wrong tier is either ruinously expensive or ruinously slow.
Scale-up is the memory-semantic fabric inside a tightly-coupled domain — historically a node, now a rack-scale domain of 72 GPUs (NVL72) heading toward 576 (Rubin Ultra). It is the fastest and most expensive tier: NVLink 5 delivers ~1.8 TB/s per GPU, an NVL72 rack carries ~130 TB/s of aggregate bisection, and the semantics are load/store — a GPU reads another GPU's memory almost as if it were local. This is where you want tensor-parallel and expert-parallel traffic to live, because those collectives are bandwidth-hungry on every layer. → Chapter 8.2.
Scale-out is the packet-switched fabric that stitches domains into a cluster of tens of thousands of GPUs. It is roughly an order of magnitude slower per GPU — a 400 Gb/s NIC is ~1/18th of NVLink-5 per-GPU bandwidth, 800 Gb/s closes some of that gap — and an order of magnitude cheaper. The semantics are message-passing (RDMA over InfiniBand or RoCE/Ethernet), and this is the natural home for data-parallel gradient all-reduce and pipeline-parallel point-to-point. → Chapter 8.4 and Chapter 8.5.
Scale-across is the newest tier and a creature of the power-bound era: the WAN/DCI fabric that couples multiple campuses when no single site can be energized at the scale a frontier run demands. Latency jumps to milliseconds, bandwidth-per-GPU collapses, and the algorithms must change to tolerate it (asynchronous, hierarchical, or local-SGD methods, gradient compression). Three years ago this tier did not exist in production; in 2025–2026 it became the defining frontier-lab architecture because the grid, not the GPU, is the binding constraint. → Chapter 8.8; the power-bound rationale in Chapter 16.1.
Collective primitives and the parallelism mapping
Distributed training is built from a small vocabulary of collective communication primitives, and almost all the traffic in a cluster is one of them. All-reduce (sum a tensor across all ranks, return the result to all) is the workhorse of data-parallel gradient synchronization. All-gather and reduce-scatter (the two halves of which a ring all-reduce is composed) move sharded weights and activations under tensor and sharded-data parallelism. All-to-all is the signature of mixture-of-experts: every token is routed to its chosen experts, so every rank exchanges with every other rank — the most punishing pattern for a fabric because it stresses bisection bandwidth uniformly. Point-to-point sends move activations forward and gradients backward across a pipeline stage.
The engineering content is the mapping: which parallelism dimension generates which collective, and therefore which network tier it must ride. Get the mapping right and each collective lands on a fabric sized for it; get it wrong and you push a bandwidth-bound collective onto a tier that cannot feed it, and the job stalls. The four parallelism dimensions are nested precisely so they can be mapped onto the bandwidth hierarchy.
| Parallelism | Dominant collective | Frequency | Target tier | Why it lands there |
|---|---|---|---|---|
| Tensor (TP) | All-gather / reduce-scatter (per layer) | Every layer, every step | Scale-up (NVLink) | Highest-frequency, bandwidth-bound; only the 1.8 TB/s fabric keeps it off the critical path |
| Expert (EP, MoE) | All-to-all (token routing) | Every MoE layer | Scale-up; spills to scale-out as EP degree grows | Bisection-stressing; wider scale-up domains raise the EP ceiling before it spills |
| Pipeline (PP) | Point-to-point (activations / gradients) | Per micro-batch boundary | Scale-out (intra-cluster) | Sparse, latency-tolerant; cheap packet fabric is sufficient |
| Data (DP / FSDP) | All-reduce (gradients); all-gather (params) | Once per step (or sharded) | Scale-out; scale-across at multi-campus scale | Large but infrequent; tolerant of oversubscription and, with the right algorithm, WAN latency |
The table is a placement rule. The design move is to keep the leftmost, highest-frequency rows (TP, EP) inside the scale-up domain where 1.8 TB/s of bandwidth makes the collective nearly free, and to let the lower-frequency rows (PP, DP) spill onto the cheaper scale-out fabric where their infrequency tolerates lower bandwidth and even oversubscription. This is exactly why the scale-up domain size is the master fork: every GPU you can add to the domain is a chunk of TP/EP traffic you keep off the slow fabric. It is also why mixture-of-experts inference reshaped fabric design — wide expert parallelism wants the largest scale-up domain it can get so the all-to-all stays on NVLink. → MoE inference and EP degree in Chapter 8.2.
Tail-latency tyranny and the BSP barrier
Here is the single fact that makes AI networking different from every other kind of data-center networking: a synchronous training step is a bulk-synchronous-parallel (BSP) computation, and a BSP barrier runs at the speed of its slowest participant. When ten thousand GPUs all-reduce a gradient, the collective is not done when the median link finishes — it is done when the last one does. Average bandwidth is almost irrelevant; what governs wall-clock is the tail. A fabric with superb median latency and a fat 99.9th-percentile tail will deliver worse training throughput than a fabric with mediocre median latency and a tight tail. This is the tyranny of the tail, and it inverts the intuition every web-scale network engineer brings to the problem.
The mechanism that produces a fat tail is incast and the congestion it triggers. In a rail-optimized fat-tree, an all-reduce produces synchronized, many-to-one traffic — many senders hammering one receiver at the same instant — which overruns switch buffers, triggers flow control (PFC) or packet drops, and stretches the tail. One slow link, one congested switch, one mis-tuned congestion-control loop, and the barrier that should have closed in microseconds drags for milliseconds, multiplied across every step of a months-long run. Public training analyses bear this out: lifting a fabric from a congested regime to a well-engineered one can cut GPU idle time from above ~33% to below ~15%, and at the 175B-parameter scale a rise in effective latency from ~10 µs to ~1000 µs, or 0.1% packet loss, each shaves roughly 10–13% off the GPU compute-time fraction (industry training-network analyses, 2025). Those are not rounding errors — they are the difference between a profitable cluster and a stranded one.
Goodput vs raw link rate
A 400 Gb/s link does not deliver 400 Gb/s of useful gradient synchronization, and the gap between the two is where AI fabrics are won and lost. Raw link rate is the SerDes specification. Goodput is the fraction of that rate that actually advances the job after you subtract protocol overhead, congestion losses, retransmissions, load-balancing inefficiency, and time lost waiting at the BSP barrier. The discipline is to design and procure against goodput, because raw link rate is the number on the invoice and goodput is the number in the training run's wall-clock.
The protocol choice is the cleanest illustration. InfiniBand delivers ~98–99% effective throughput at ~1–2 µs latency because it was purpose-built lossless and in-order. Untuned RoCEv2 over Ethernet can sit at 5–10 µs with a long tail and effective throughput far below line rate, because RoCE's go-back-N retransmission discards everything after an out-of-order packet and retransmits from there. The 2025–2026 answer — NVIDIA Spectrum-X and, generically, Ultra Ethernet — reclaims that gap with packet spray, direct data placement, and hardware reordering, landing at ~95% effective throughput, near InfiniBand, on commodity Ethernet silicon. Meta famously tuned RoCE to match InfiniBand and trained its largest models on it. The lesson is not about which protocol wins. It is that the delivered-throughput gap between a tuned and an untuned fabric is larger than the gap between any two link generations, so tuning and topology, not link rate, are where the goodput is. → the protocol war in Chapter 8.4.
Training vs inference traffic profiles
The two dominant workloads characterize traffic so differently that they want different fabrics. Training traffic is synchronized, bursty-at-the-barrier, bisection-hungry, and east-west: the cluster is silent during compute, then every GPU floods the fabric at once during the collective, then falls silent again. It is governed by job-completion-time, tolerant of moderate latency on the data-parallel all-reduce but utterly intolerant of a fat tail, and it demands a 1:1 non-blocking back-end because oversubscribing the all-reduce starves it. Inference traffic is the opposite: many independent requests, each fitting inside a node or small scale-up domain, governed by a latency SLO (time-to-first-token, time-per-output-token) rather than a global barrier. It is loosely coupled, so the scale-out fabric can be oversubscribed 2:1–3:1 — saving ~31% of back-end cost — and the saved capital goes to geo-distribution and uptime instead.
The 2025–2026 twist is that inference traffic is no longer trivial. Reasoning models emit long decode sequences, mixture-of-experts routing creates real all-to-all traffic, and prefill/decode disaggregation moves the KV-cache across the fabric — so inference, once a single-node afterthought, now exercises the scale-up domain hard. The characterization still differs from training (no global synchronous barrier, latency-SLO-bound), but the gap is narrowing, and a fabric scoped for last-generation single-node inference can be undersized for MoE reasoning serving. → inference fabric demands in Chapter 1.3; MoE all-to-all in Chapter 8.2.
| Property | Training fabric | Inference fabric |
|---|---|---|
| Coupling | Tight — synchronous BSP barrier every step | Loose — request fits a node / small scale-up domain |
| Traffic shape | Bursty east-west; silent-then-flood at the collective | Steady-ish many-to-one request/response; bursty arrivals |
| Governing metric | Job-completion-time (goodput / MFU) | Latency SLO (TTFT, TPOT) at target throughput |
| Tail tolerance | None — the barrier waits for the slowest link | Bounded — tail shows up as p99 latency, not a global stall |
| Oversubscription | 1:1 non-blocking on the back-end | 2:1–3:1 acceptable (~31% cheaper back-end) |
| Siting consequence | Power-first; can be far from users | Latency-first; geo-distributed near users |
Job-completion-time as the north-star
Every networking decision in Part 8 ultimately answers to one metric, and it is not bandwidth, latency, or port count — it is job-completion-time (for training) and its inference twin, tokens-per-second-per-dollar at the SLO. JCT is the wall-clock from job start to a trained model, and it folds in everything the spec sheets hide: the tail, the congestion, the goodput, the failures and restarts. A fabric is good if and only if it lowers JCT per dollar of capital and per watt of power — and a fabric can have best-in-class link rate, best-in-class median latency, and still lose on JCT because its tail or its reliability is poor.
Reframing networking around JCT changes what you optimize. You stop chasing peak bisection bandwidth as a vanity number and start chasing the combination that minimizes wall-clock: tight tail, high goodput, fast recovery from the inevitable link and switch failures (which the Llama-3 run attributed ~8.4% of interruptions to). It also re-prioritizes the whole part — congestion control, in-network reduction, adaptive routing, and topology matter because they move JCT, not because they look impressive in a bake-off. This is the same shift from availability to goodput that the reliability chapter makes for the facility: the question is never 'is the link up?' but 'how much useful training did the fabric deliver this week?'. → goodput vs availability in Chapter 12.2; checkpoint/restart math behind failure recovery in Chapter 9.4.
Deep dive: why the scale-up / scale-out boundary keeps moving
The three-network model is not a fixed partition — its internal boundaries migrate with each hardware generation, and that movement is one of the most important dynamics in AI infrastructure. The pressure is bidirectional. Pushing the scale-up boundary outward: every generation, NVIDIA (and AMD with UALink, hyperscalers with their own fabrics) enlarges the scale-up domain — 8 GPUs in HGX, 72 in NVL72, a roadmapped 576 in Rubin Ultra — because keeping TP and EP traffic on the 1.8 TB/s memory-semantic fabric instead of the ~10x-slower packet fabric is the single biggest lever on model-FLOPS-utilization for large dense and MoE models. Bigger domains raise the tensor- and expert-parallel ceilings and widen the all-to-all that MoE inference depends on.
Pulling it back: the copper reach wall. NVLink at multi-terabit-per-second over copper is good for under a meter or two; once a domain spans multiple racks, copper cannot reach and the domain must go optical — which is more expensive, more power-hungry, and (until co-packaged optics mature) less reliable. So the scale-up boundary sits exactly where the bandwidth benefit of a larger domain still outweighs the cost, power, and blast-radius penalty of crossing into optics. Meanwhile a brand-new boundary opened at the top — scale-across — because the grid cannot energize a single site large enough for a frontier run, forcing multi-campus training over a WAN that did not exist as a training tier three years ago. The practical consequence for a facility designer: provision for the boundary to move. A hall plumbed and powered only for today's domain size strands the option to adopt the next generation's larger one. → copper-vs-optics in the domain in Chapter 8.2; the roadmap in Chapter 16.2; scale-across in Chapter 8.8.
Deep dive: characterizing traffic before you size the fabric
The recurring mistake is sizing a fabric from peak link rate and rack count instead of from measured traffic — which over-builds where it does not matter and under-builds where it does. A defensible fabric design starts from three traffic-characterization artifacts, derived from the workload's collective profile rather than a vendor reference diagram.
- Collective profile. Which collectives the job runs (all-reduce, all-gather, all-to-all, point-to-point), at what frequency, and on which parallelism dimension — because that is what tells you how much traffic wants scale-up versus scale-out, and therefore the domain size and the blocking factor. An MoE model with heavy all-to-all and a dense model doing ring all-reduce produce completely different fabric demands at the same GPU count.
- Message-size histogram. The distribution of message sizes, because small-message latency and large-message bandwidth are served by different switch and protocol choices, and a fabric tuned for one is mediocre at the other. Gradient all-reduce is large-message bandwidth-bound; pipeline activations and control traffic are latency-bound.
- Incast and tail budget. The expected incast ratio at the collective and the p99/p99.9 tail-latency budget the BSP barrier can tolerate — because that, not average bandwidth, sizes buffers, picks the congestion-control regime, and decides whether you need deep-buffer switches, adaptive routing, or in-network reduction.
These artifacts are the network analogue of the workload profile sheet that scopes the whole facility. Produce them before you count switches and optics, and the sizing in Chapter 8.5 becomes a derivation rather than a guess.