Chapter 8.3
Network Silicon: Switch ASICs, NICs & DPUs
The switch ASIC, the NIC, and the DPU are three silicon decisions that set the ceiling on every fabric you can build on top of them — pick the SerDes generation, the buffer architecture, and the offload engine before you draw a topology, because the topology is downstream of all three.
What you'll decide here
- Which switch ASIC family you standardize on — merchant high-radix (Broadcom Tomahawk), merchant deep-buffer (Broadcom Jericho), or captive vertically-integrated (NVIDIA Quantum InfiniBand / Spectrum Ethernet) — because that choice sets your radix, your buffering philosophy, and your degree of vendor lock-in for the life of the cluster.
- Whether your back-end NIC is a plain RoCE/IB NIC or a SuperNIC with full transport offload — and whether the host even needs a DPU, or whether a NIC suffices, because the DPU is a per-server tax you justify with storage, security, and multi-tenant isolation, not with raw bandwidth.
- Shallow shared-buffer vs deep-buffer VOQ at each tier of the fabric: shallow where reach is short and you control congestion with PFC/ECN, deep where you cross buildings or oversubscribe and need to absorb bursts without dropping.
- Which functions you push off the host CPU and onto the DPU (storage initiator, encryption, the VPC overlay, the security policy plane) versus leaving on x86 — every offloaded function frees host cores for the workload but adds a second control plane to operate and patch.
- Whether the SerDes generation on your chosen ASIC (200G vs 224G vs the 448G horizon) is current enough to carry you through one full GPU refresh, because the SerDes lane rate — not the marketed aggregate Tbps — is the spec that actually gates port speed, reach, and the copper-vs-optics break.
The fabric chapters that bracket this one — scale-up in Chapter 8.2, scale-out protocols in Chapter 8.4, topology in Chapter 8.5 — all assume a set of silicon building blocks and reason about how to wire them together. This chapter is about the blocks themselves: the three classes of programmable silicon that the entire AI network is assembled from. The switch ASIC moves packets between ports. The NIC (and its beefed-up cousin, the SuperNIC) connects a server's accelerators to the wire and runs the RDMA transport. The DPU/IPU is a NIC with a CPU complex bolted on, sitting in the data path to offload storage, security, and virtualization from the host. Get the silicon selection right and the fabric design that follows is a series of well-posed wiring problems. Get it wrong — under-spec the SerDes, mis-match the buffer architecture to your reach, buy a DPU you have no offload for — and you have baked a ceiling into the cluster that no topology cleverness can lift.
The forks in this chapter are unusually consequential because they are early and sticky. The switch ASIC family decides your radix and your lock-in posture before a single cable is run. The SerDes generation on that ASIC decides, for the next several years, how many ports you get, how far copper reaches, and when you are forced onto optics. And the DPU decision — to deploy one at all, and what to run on it — is a per-server line item across thousands of servers that pays back only if you actually move work onto it.
The gating spec: SerDes generation, not aggregate Tbps
Marketing leads with the headline number — "102.4 Tbps switch," "800G NIC" — but the spec that actually gates everything downstream is the per-lane SerDes rate. A switch's aggregate bandwidth is just lanes times lane-rate times two (duplex); the architecturally interesting questions — how many ports you get, how fast each one runs, how far copper reaches before you must pay for optics — are all decided by the lane. The SerDes ladder is the master clock of the entire networking industry: 50G → 100G → 200G → 400G per lane, each doubling roughly every two to three years, with the jump from 100G to 200G (NRZ-to-PAM4 era maturing) and now 200G to 224G being the 2025–2026 inflection (SemiAnalysis, AI networks, 2025).
Why does the lane matter more than the aggregate? Because a fixed aggregate can be sliced many ways, and the slicing is what you live with. Broadcom's Tomahawk 6 delivers 102.4 Tbps on 224G SerDes — that is 512 lanes, which you can present as 64×1.6T, 128×800G, or 256×400G ports (Broadcom, 2025). The same aggregate on an older 112G generation would need twice the lanes to hit the same port speed, blowing up package size, power, and cost. The lane rate also sets the copper cliff: at 224G, passive DAC reaches barely ~1m and active copper (AEC) ~2–3m, which is why the 224G generation forces optics out of the rack sooner and makes the co-packaged-optics transition urgent rather than optional. When you evaluate a switch or NIC, read past the aggregate to the lane — it tells you the radix you can build, the reach you get, and how close you are to the next forced optics upgrade. The physical-layer consequences are engineered in Chapter 8.2 (copper reach inside the domain) and the protocol layer in Chapter 8.4.
Switch ASIC families: the three-way fork
There are effectively three families of AI switch silicon in 2026, and choosing among them is the most consequential network-silicon decision you make. They differ not just on speeds and feeds but on buffering philosophy and business model — and those two axes, more than raw bandwidth, are what you live with.
Broadcom Tomahawk is the merchant high-radix, shallow-shared-buffer line: the workhorse leaf/spine ASIC for Ethernet AI fabrics. Tomahawk 6 ships at 102.4 Tbps on 224G SerDes with on-chip cognitive routing and native co-packaged-optics options, targeting 100k–1M-XPU Ethernet fabrics (Broadcom, 2025). Its design point is maximum radix and minimum per-bit latency, with a relatively small on-chip buffer — it assumes you control congestion at the endpoints and with PFC/ECN, not by storing bytes in the switch.
Broadcom Jericho is the merchant deep-buffer, VOQ line: a routing-class ASIC built for the cases Tomahawk is deliberately not. Jericho4 ships at 51.2 Tbps on TSMC 3nm with HBM-attached deep packet buffers — up to ~160× more buffering than on-chip memory — line-rate MACsec, and a scale-across reach of over 100 km, combining four 800G ports into 3.2 Tbps "HyperPorts" and scaling to 36,000 of them (Broadcom / The Next Platform, 2025). Jericho exists to absorb bursts and cross buildings without dropping packets; it is the silicon behind the scale-across DCI fabric of Chapter 8.4.
NVIDIA Quantum (InfiniBand) and Spectrum (Ethernet) are the captive, vertically-integrated families. Quantum-X800 delivers 800G/port InfiniBand with SHARPv4 in-network reduction offload baked into the switch; Spectrum-X is the Ethernet line that, paired with ConnectX/BlueField at the endpoints, reaches ~95% effective throughput on production fabrics at xAI Colossus's 100k–200k-GPU scale (NVIDIA, 2025). The value proposition is a co-designed switch+NIC system that closes most of the gap to InfiniBand on Ethernet — at the cost of buying both ends from one vendor. The merchant-vs-captive business-model framing — who captures the margin, who controls the roadmap — is developed in Chapter 7.1.
| Family | Class | Headline silicon (2026) | Buffer architecture | Where it wins | Lock-in posture |
|---|---|---|---|---|---|
| Broadcom Tomahawk | Merchant, high-radix leaf/spine | Tomahawk 6 — 102.4 Tbps, 224G SerDes, native CPO option | Shallow shared on-chip; congestion managed at endpoints | Scale-out Clos/fat-tree, short reach, max radix, lowest $/Gb | Low — open NOS (SONiC/FBOSS), multi-vendor optics |
| Broadcom Jericho | Merchant, deep-buffer routing/fabric | Jericho4 — 51.2 Tbps, 3nm, HBM packet memory, >100 km | Deep HBM-attached VOQ; ~160× on-chip buffering | Scale-across DCI, heavy oversubscription, lossless over distance | Low — merchant, but pairs with Broadcom fabric elements |
| NVIDIA Quantum | Captive, InfiniBand | Quantum-X800 — 800G/port, SHARPv4 in-network reduction | Credit-based lossless; in-network collective offload | Reduction-heavy synchronous training; lowest tuned latency | High — IB end-to-end; SHARP needs NVIDIA NICs |
| NVIDIA Spectrum | Captive, Ethernet (Spectrum-X) | Spectrum-X / Photonics — 102.4–409.6 Tbps, CPO 2H 2026 | Shallow + adaptive routing; co-designed with ConnectX/BlueField | Ethernet AI at hyperscale (~95% effective) with packet spray | Medium-high — best with NVIDIA NIC/DPU at the endpoints |
Shallow-shared vs deep-buffer VOQ: the buffering tradeoff
The split between Tomahawk and Jericho is the cleanest example in networking of a real architectural fork with no free lunch, so it deserves its own treatment. The question is: where do you store a packet that arrives faster than its egress port can drain? Two answers, two silicon philosophies.
Shallow shared-buffer ASICs (Tomahawk-class) keep a small, fast pool of on-chip SRAM shared across all ports. The bet is that with short reach, a non-blocking topology, and good endpoint congestion control (PFC/ECN/DCQCN, adaptive routing, packet spray — the machinery of Chapter 8.6), bursts are absorbed at the source and the switch never needs to hold much. The payoff is lowest latency, highest radix, and lowest power-per-bit. The risk: when a real incast burst exceeds the shallow buffer, you must either drop (lossy) or assert backpressure (PFC), and PFC at scale brings head-of-line blocking and deadlock risk. Shallow buffering only works if the congestion-control loop is fast enough to keep the buffer from filling.
Deep-buffer VOQ ASICs (Jericho-class) attach large off-chip memory — HBM on Jericho4 — and organize it as virtual output queues: a separate logical queue per egress destination, so a congested port cannot head-of-line-block traffic bound elsewhere. The payoff is the ability to absorb enormous bursts and to run lossless over long reach — the only way to carry RoCE across >100 km of DCI without dropping. The cost is real: added latency (a packet may sit in deep buffer), higher power and die area (HBM is not free), and higher $/port. You do not want deep buffers on a short-reach leaf where they add latency you never needed; you do want them at the fabric edge that crosses buildings.
NICs and SuperNICs: the RoCE/IB offload path
The NIC is where the network meets the accelerator, and in AI fabrics it does far more than push frames. The defining feature is RDMA — remote direct memory access — which lets a GPU on one node read or write a GPU's memory on another node without involving either host CPU, the foundation that makes collective communication tolerable at scale. Two transports carry RDMA: native InfiniBand (NVIDIA ConnectX in IB mode) and RoCEv2 (RDMA over Converged Ethernet), which runs the same verbs over a routable Ethernet/UDP underlay. The NIC implements the transport in hardware; the quality of that implementation — how it handles congestion, retransmission, and packet reordering — is a large part of why one fabric reaches 95% effective throughput and another wastes a third of its bisection.
The 2026 NIC provisioning rule for back-end fabrics is roughly one SuperNIC per GPU: an 8-GPU server carries 8×400G or 8×800G back-end ports (3.2–6.4 Tb/s/node), plus a separate, smaller NIC for the front-end/storage/management plane. The term SuperNIC denotes the AI-optimized variant: full transport offload, hardware support for the adaptive routing / packet-spray and out-of-order reassembly that Ultra Ethernet and Spectrum-X require, and line-rate congestion handling. ConnectX-8 carries the current generation; NVIDIA's ConnectX-9, paired with the Rubin platform, doubles bandwidth to 800 Gb/s RDMA (NVIDIA / ServeTheHome, 2025). The reason packet spray re-introduces a hint of vendor coupling: spraying packets across all paths only works if the receiving NIC can reassemble out-of-order delivery in hardware, so the switch and NIC must agree — which is why Spectrum-X and UEC are switch+NIC systems, not just switches. The transport semantics that ride on top — lossless vs lossy, in-order vs out-of-order — are the subject of Chapter 8.4.
DPUs and IPUs: the offload tax and what it buys
A DPU (data processing unit; Intel's term is IPU, infrastructure processing unit) is a NIC with a programmable CPU complex, memory, and accelerators added, sitting in the data path between the host and the wire. The canonical examples are NVIDIA BlueField, AMD Pensando, Intel IPU, and the cloud-captive designs (AWS Nitro, Google's IPU work). The premise is infrastructure offload: move the storage initiator, the encryption, the virtual-network overlay, and the security policy plane off the host x86 and onto the DPU, freeing host cores for the paying workload and creating an isolation boundary the tenant cannot see past.
The 2026 flagship sets the scale of the bet. BlueField-4 pairs a 64-core Arm Neoverse V2 complex (64 billion transistors, ~6× the compute of BlueField-3) with the ConnectX-9 NIC at 800 Gb/s, 128 GB of LPDDR5, a PCIe Gen6 host interface, and an on-board SSD, shipping in 2026 both as a card and integrated into the Rubin NVL144 rack (NVIDIA / HPCwire, 2025). That is a server-class computer on a NIC — and that is exactly why the DPU is a decision, not a default. You are adding a second CPU, a second operating system, and a second control plane to every server. It pays back only if you actually run infrastructure functions on it.
Three offload domains justify a DPU, and each is treated in depth elsewhere in the guide:
- Storage. The DPU acts as an NVMe-oF initiator and runs the GPUDirect Storage data path, presenting remote flash to the GPU as if local and bypassing the host CPU entirely — and, in the BlueField-4 generation, terminating an Ethernet-attached KV-cache/context-memory tier for inference. The storage data path is built out in Chapter 9.3.
- Security and isolation. The DPU enforces the tenant VPC overlay, line-rate encryption over RDMA, and microsegmentation policy in hardware the tenant root cannot reach — the hard isolation boundary for multi-tenant GPU clouds. This is the substance of Chapter 11.6 and Chapter 11.7.
- Virtualization & the overlay. The DPU runs the VXLAN/VPC encapsulation and the software-defined network, so the host hypervisor (or bare-metal stack) is relieved of network virtualization — the model top-tier neoclouds standardize on for bare-metal-with-VPC.
| Deployment context | NIC / SuperNIC alone | Add a DPU/IPU | What tips the decision |
|---|---|---|---|
| Single-tenant training cluster | Usually sufficient — RDMA transport is in the NIC | Optional — only if storage/security offload is wanted | No tenant boundary to enforce; host cores often not the bottleneck |
| Multi-tenant GPU cloud / neocloud | Leaves isolation on the host — soft boundary | Effectively mandatory — hardware VPC + encryption boundary | Tenant must not see past the boundary; ClusterMAX-grade isolation |
| Inference fleet with disaggregated KV/storage | Host CPU runs the storage initiator — steals cores | Strong fit — NVMe-oF + context-memory tier offload | GPUDirect Storage path and KV-cache tier free host cores for serving |
| Cost-sensitive batch / internal cluster | Preferred — fewer control planes to operate | Hard to justify — added capex + a second OS to patch | No isolation or storage-offload requirement to amortize the DPU |
Deep dive: how SHARP turns the switch into a collective accelerator — and why it is a lock-in
The clearest example of a switch ASIC doing more than moving packets is NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol), and it is worth understanding because it is simultaneously a real performance win and a real lock-in. In a normal all-reduce, every GPU sends its gradient slice to every other, and the reduction (the sum) happens at the endpoints — bandwidth scales poorly and burns GPU streaming multiprocessors (SMs) on communication instead of compute. In-network reduction moves the arithmetic into the switch: the Quantum switch ASIC sums the contributions as they pass through, so each GPU sends its data once and receives the final result, halving the data on the wire for the reduction and freeing SMs.
SHARPv4 on Quantum-X800, composed with NVLink SHARP inside the rack, cuts the SMs NCCL spends on collectives from ~16 to ≤6 in the integrated stack (NVIDIA, 2025) — a concrete, measurable reason to choose InfiniBand for reduction-heavy synchronous training. The catch is that this is a captive feature: it lives in the Quantum switch and requires NVIDIA NICs that speak the protocol. There is no merchant-silicon or open-Ethernet equivalent shipping at parity in 2026, which is why reduction-heavy pre-training is the workload where the captive stack still has its strongest claim — and why an operator who standardizes on Ethernet for everything else may still carve out an InfiniBand island for the largest training runs. The collective-offload mechanics and the Ethernet counter-arguments are developed in Chapter 8.6.
The vertical-integration question, restated as silicon
Step back and the three ASIC families resolve into one strategic fork: buy the network as a co-designed system from one vendor, or assemble it from merchant silicon. NVIDIA's pitch is that the switch (Quantum/Spectrum), the NIC (ConnectX), and the DPU (BlueField) are designed together, so features like SHARP, packet spray with hardware reassembly, and line-rate encryption work end-to-end out of the box — and the data shows the integrated Spectrum-X stack hitting ~95% effective throughput at hyperscale. The cost is that the value of the system depends on owning both ends; the moment you mix in a third-party NIC, the co-designed features degrade or disappear.
The merchant counter-case is that Broadcom (Tomahawk for radix, Jericho for depth) plus a SuperNIC of your choosing plus an open NOS (SONiC, FBOSS) gives you a multi-vendor supply chain, no single-vendor margin capture, and the freedom to mix optics and cables — at the cost of doing the integration yourself and accepting that the most aggressive co-designed features (SHARP-equivalent in-network reduction) are not yet at parity. This is why ~70% of new AI deployments now choose Ethernet/RoCEv2 (SemiAnalysis, 2026): for most fabrics the merchant path is good enough and cheaper, and the captive premium is reserved for the reduction-heavy training islands where it still pays. The business-model and margin framing of merchant-vs-captive silicon is developed in Chapter 7.1.
Deep dive: why the DPU's second control plane is the part people forget to budget
The DPU sales pitch is all about what it offloads. The part that gets under-budgeted is what it adds: a complete second computer in every server, with its own operating system, its own firmware, its own security-patch cadence, and its own failure modes. When you deploy BlueField at fleet scale, you have doubled the number of OS images you patch, the number of agents you monitor, and the number of things that can break in the data path between the host and the wire — a DPU that hangs takes the server's network with it.
This is why the DPU decision is not "is it powerful" (it obviously is — 64 Arm cores, 800 Gb/s) but "do I have enough infrastructure work to amortize a second control plane across thousands of servers." In a multi-tenant neocloud the answer is yes: the hardware isolation boundary is non-negotiable and the storage offload is real, so the operational tax is worth it — and ClusterMAX-grade ratings essentially require it. In a single-tenant training cluster with no isolation requirement and host cores to spare, a plain SuperNIC is often the better engineering choice precisely because it is one control plane, not two. The lesson generalizes: every offload frees a host resource and adds an operational surface, and the DPU is only a win when the freed resource is worth more than the added surface. The isolation case that most often tips it is in Chapter 11.6; the storage case in Chapter 9.3.
Anti-patterns
The recurring silicon mis-selections all come from optimizing one number in isolation instead of reading the gating spec and the downstream cost:
- Buying the aggregate Tbps and ignoring the SerDes generation. A switch bought one SerDes generation behind looks competitive on aggregate bandwidth but yields half the radix at your port speed, reaches less far on copper, and forces an optics upgrade a generation early. Read the lane, not the headline.
- Deep buffers everywhere. Putting deep-buffer routing silicon on short-reach leaf switches where a shallow-buffer high-radix ASIC belongs — paying latency, power, and $/port for burst absorption the tier never needs. Match the buffer to the reach.
- A DPU with nothing to offload. Deploying BlueField-class silicon across a single-tenant cluster with no isolation requirement and no storage-offload plan — a per-server capex line and a second control plane bought for a feature set you never enable.
- Mixing a third-party NIC into a co-designed fabric. Buying a Spectrum-X or SHARP-capable switch for its integrated features, then pairing it with a generic NIC that cannot do hardware reassembly or in-network reduction — paying the captive premium and getting the merchant feature set.