The Definitive Guide toAI Data Centers
Ask the Guide

Chapter 8.3

Network Silicon: Switch ASICs, NICs & DPUs

The switch ASIC, the NIC, and the DPU are three silicon decisions that set the ceiling on every fabric you can build on top of them — pick the SerDes generation, the buffer architecture, and the offload engine before you draw a topology, because the topology is downstream of all three.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

  1. Which switch ASIC family you standardize on — merchant high-radix (Broadcom Tomahawk), merchant deep-buffer (Broadcom Jericho), or captive vertically-integrated (NVIDIA Quantum InfiniBand / Spectrum Ethernet) — because that choice sets your radix, your buffering philosophy, and your degree of vendor lock-in for the life of the cluster.
  2. Whether your back-end NIC is a plain RoCE/IB NIC or a SuperNIC with full transport offload — and whether the host even needs a DPU, or whether a NIC suffices, because the DPU is a per-server tax you justify with storage, security, and multi-tenant isolation, not with raw bandwidth.
  3. Shallow shared-buffer vs deep-buffer VOQ at each tier of the fabric: shallow where reach is short and you control congestion with PFC/ECN, deep where you cross buildings or oversubscribe and need to absorb bursts without dropping.
  4. Which functions you push off the host CPU and onto the DPU (storage initiator, encryption, the VPC overlay, the security policy plane) versus leaving on x86 — every offloaded function frees host cores for the workload but adds a second control plane to operate and patch.
  5. Whether the SerDes generation on your chosen ASIC (200G vs 224G vs the 448G horizon) is current enough to carry you through one full GPU refresh, because the SerDes lane rate — not the marketed aggregate Tbps — is the spec that actually gates port speed, reach, and the copper-vs-optics break.

The fabric chapters that bracket this one — scale-up in Chapter 8.2, scale-out protocols in Chapter 8.4, topology in Chapter 8.5 — all assume a set of silicon building blocks and reason about how to wire them together. This chapter is about the blocks themselves: the three classes of programmable silicon that the entire AI network is assembled from. The switch ASIC moves packets between ports. The NIC (and its beefed-up cousin, the SuperNIC) connects a server's accelerators to the wire and runs the RDMA transport. The DPU/IPU is a NIC with a CPU complex bolted on, sitting in the data path to offload storage, security, and virtualization from the host. Get the silicon selection right and the fabric design that follows is a series of well-posed wiring problems. Get it wrong — under-spec the SerDes, mis-match the buffer architecture to your reach, buy a DPU you have no offload for — and you have baked a ceiling into the cluster that no topology cleverness can lift.

The forks in this chapter are unusually consequential because they are early and sticky. The switch ASIC family decides your radix and your lock-in posture before a single cable is run. The SerDes generation on that ASIC decides, for the next several years, how many ports you get, how far copper reaches, and when you are forced onto optics. And the DPU decision — to deploy one at all, and what to run on it — is a per-server line item across thousands of servers that pays back only if you actually move work onto it.

The gating spec: SerDes generation, not aggregate Tbps

Marketing leads with the headline number — "102.4 Tbps switch," "800G NIC" — but the spec that actually gates everything downstream is the per-lane SerDes rate. A switch's aggregate bandwidth is just lanes times lane-rate times two (duplex); the architecturally interesting questions — how many ports you get, how fast each one runs, how far copper reaches before you must pay for optics — are all decided by the lane. The SerDes ladder is the master clock of the entire networking industry: 50G → 100G → 200G → 400G per lane, each doubling roughly every two to three years, with the jump from 100G to 200G (NRZ-to-PAM4 era maturing) and now 200G to 224G being the 2025–2026 inflection (SemiAnalysis, AI networks, 2025).

Why does the lane matter more than the aggregate? Because a fixed aggregate can be sliced many ways, and the slicing is what you live with. Broadcom's Tomahawk 6 delivers 102.4 Tbps on 224G SerDes — that is 512 lanes, which you can present as 64×1.6T, 128×800G, or 256×400G ports (Broadcom, 2025). The same aggregate on an older 112G generation would need twice the lanes to hit the same port speed, blowing up package size, power, and cost. The lane rate also sets the copper cliff: at 224G, passive DAC reaches barely ~1m and active copper (AEC) ~2–3m, which is why the 224G generation forces optics out of the rack sooner and makes the co-packaged-optics transition urgent rather than optional. When you evaluate a switch or NIC, read past the aggregate to the lane — it tells you the radix you can build, the reach you get, and how close you are to the next forced optics upgrade. The physical-layer consequences are engineered in Chapter 8.2 (copper reach inside the domain) and the protocol layer in Chapter 8.4.

Switch ASIC families: the three-way fork

There are effectively three families of AI switch silicon in 2026, and choosing among them is the most consequential network-silicon decision you make. They differ not just on speeds and feeds but on buffering philosophy and business model — and those two axes, more than raw bandwidth, are what you live with.

Broadcom Tomahawk is the merchant high-radix, shallow-shared-buffer line: the workhorse leaf/spine ASIC for Ethernet AI fabrics. Tomahawk 6 ships at 102.4 Tbps on 224G SerDes with on-chip cognitive routing and native co-packaged-optics options, targeting 100k–1M-XPU Ethernet fabrics (Broadcom, 2025). Its design point is maximum radix and minimum per-bit latency, with a relatively small on-chip buffer — it assumes you control congestion at the endpoints and with PFC/ECN, not by storing bytes in the switch.

Broadcom Jericho is the merchant deep-buffer, VOQ line: a routing-class ASIC built for the cases Tomahawk is deliberately not. Jericho4 ships at 51.2 Tbps on TSMC 3nm with HBM-attached deep packet buffers — up to ~160× more buffering than on-chip memory — line-rate MACsec, and a scale-across reach of over 100 km, combining four 800G ports into 3.2 Tbps "HyperPorts" and scaling to 36,000 of them (Broadcom / The Next Platform, 2025). Jericho exists to absorb bursts and cross buildings without dropping packets; it is the silicon behind the scale-across DCI fabric of Chapter 8.4.

NVIDIA Quantum (InfiniBand) and Spectrum (Ethernet) are the captive, vertically-integrated families. Quantum-X800 delivers 800G/port InfiniBand with SHARPv4 in-network reduction offload baked into the switch; Spectrum-X is the Ethernet line that, paired with ConnectX/BlueField at the endpoints, reaches ~95% effective throughput on production fabrics at xAI Colossus's 100k–200k-GPU scale (NVIDIA, 2025). The value proposition is a co-designed switch+NIC system that closes most of the gap to InfiniBand on Ethernet — at the cost of buying both ends from one vendor. The merchant-vs-captive business-model framing — who captures the margin, who controls the roadmap — is developed in Chapter 7.1.

Switch ASIC families → the decision fork
FamilyClassHeadline silicon (2026)Buffer architectureWhere it winsLock-in posture
Broadcom TomahawkMerchant, high-radix leaf/spineTomahawk 6 — 102.4 Tbps, 224G SerDes, native CPO optionShallow shared on-chip; congestion managed at endpointsScale-out Clos/fat-tree, short reach, max radix, lowest $/GbLow — open NOS (SONiC/FBOSS), multi-vendor optics
Broadcom JerichoMerchant, deep-buffer routing/fabricJericho4 — 51.2 Tbps, 3nm, HBM packet memory, >100 kmDeep HBM-attached VOQ; ~160× on-chip bufferingScale-across DCI, heavy oversubscription, lossless over distanceLow — merchant, but pairs with Broadcom fabric elements
NVIDIA QuantumCaptive, InfiniBandQuantum-X800 — 800G/port, SHARPv4 in-network reductionCredit-based lossless; in-network collective offloadReduction-heavy synchronous training; lowest tuned latencyHigh — IB end-to-end; SHARP needs NVIDIA NICs
NVIDIA SpectrumCaptive, Ethernet (Spectrum-X)Spectrum-X / Photonics — 102.4–409.6 Tbps, CPO 2H 2026Shallow + adaptive routing; co-designed with ConnectX/BlueFieldEthernet AI at hyperscale (~95% effective) with packet sprayMedium-high — best with NVIDIA NIC/DPU at the endpoints
2026-current reference points. Aggregate bandwidth and SerDes are vendor specs; effective-throughput and lock-in characterizations are practitioner framing. See keynumbers below for sources and vintages.

Shallow-shared vs deep-buffer VOQ: the buffering tradeoff

The split between Tomahawk and Jericho is the cleanest example in networking of a real architectural fork with no free lunch, so it deserves its own treatment. The question is: where do you store a packet that arrives faster than its egress port can drain? Two answers, two silicon philosophies.

Shallow shared-buffer ASICs (Tomahawk-class) keep a small, fast pool of on-chip SRAM shared across all ports. The bet is that with short reach, a non-blocking topology, and good endpoint congestion control (PFC/ECN/DCQCN, adaptive routing, packet spray — the machinery of Chapter 8.6), bursts are absorbed at the source and the switch never needs to hold much. The payoff is lowest latency, highest radix, and lowest power-per-bit. The risk: when a real incast burst exceeds the shallow buffer, you must either drop (lossy) or assert backpressure (PFC), and PFC at scale brings head-of-line blocking and deadlock risk. Shallow buffering only works if the congestion-control loop is fast enough to keep the buffer from filling.

Deep-buffer VOQ ASICs (Jericho-class) attach large off-chip memory — HBM on Jericho4 — and organize it as virtual output queues: a separate logical queue per egress destination, so a congested port cannot head-of-line-block traffic bound elsewhere. The payoff is the ability to absorb enormous bursts and to run lossless over long reach — the only way to carry RoCE across >100 km of DCI without dropping. The cost is real: added latency (a packet may sit in deep buffer), higher power and die area (HBM is not free), and higher $/port. You do not want deep buffers on a short-reach leaf where they add latency you never needed; you do want them at the fabric edge that crosses buildings.

NICs and SuperNICs: the RoCE/IB offload path

The NIC is where the network meets the accelerator, and in AI fabrics it does far more than push frames. The defining feature is RDMA — remote direct memory access — which lets a GPU on one node read or write a GPU's memory on another node without involving either host CPU, the foundation that makes collective communication tolerable at scale. Two transports carry RDMA: native InfiniBand (NVIDIA ConnectX in IB mode) and RoCEv2 (RDMA over Converged Ethernet), which runs the same verbs over a routable Ethernet/UDP underlay. The NIC implements the transport in hardware; the quality of that implementation — how it handles congestion, retransmission, and packet reordering — is a large part of why one fabric reaches 95% effective throughput and another wastes a third of its bisection.

The 2026 NIC provisioning rule for back-end fabrics is roughly one SuperNIC per GPU: an 8-GPU server carries 8×400G or 8×800G back-end ports (3.2–6.4 Tb/s/node), plus a separate, smaller NIC for the front-end/storage/management plane. The term SuperNIC denotes the AI-optimized variant: full transport offload, hardware support for the adaptive routing / packet-spray and out-of-order reassembly that Ultra Ethernet and Spectrum-X require, and line-rate congestion handling. ConnectX-8 carries the current generation; NVIDIA's ConnectX-9, paired with the Rubin platform, doubles bandwidth to 800 Gb/s RDMA (NVIDIA / ServeTheHome, 2025). The reason packet spray re-introduces a hint of vendor coupling: spraying packets across all paths only works if the receiving NIC can reassemble out-of-order delivery in hardware, so the switch and NIC must agree — which is why Spectrum-X and UEC are switch+NIC systems, not just switches. The transport semantics that ride on top — lossless vs lossy, in-order vs out-of-order — are the subject of Chapter 8.4.

DPUs and IPUs: the offload tax and what it buys

A DPU (data processing unit; Intel's term is IPU, infrastructure processing unit) is a NIC with a programmable CPU complex, memory, and accelerators added, sitting in the data path between the host and the wire. The canonical examples are NVIDIA BlueField, AMD Pensando, Intel IPU, and the cloud-captive designs (AWS Nitro, Google's IPU work). The premise is infrastructure offload: move the storage initiator, the encryption, the virtual-network overlay, and the security policy plane off the host x86 and onto the DPU, freeing host cores for the paying workload and creating an isolation boundary the tenant cannot see past.

The 2026 flagship sets the scale of the bet. BlueField-4 pairs a 64-core Arm Neoverse V2 complex (64 billion transistors, ~6× the compute of BlueField-3) with the ConnectX-9 NIC at 800 Gb/s, 128 GB of LPDDR5, a PCIe Gen6 host interface, and an on-board SSD, shipping in 2026 both as a card and integrated into the Rubin NVL144 rack (NVIDIA / HPCwire, 2025). That is a server-class computer on a NIC — and that is exactly why the DPU is a decision, not a default. You are adding a second CPU, a second operating system, and a second control plane to every server. It pays back only if you actually run infrastructure functions on it.

Three offload domains justify a DPU, and each is treated in depth elsewhere in the guide:

  • Storage. The DPU acts as an NVMe-oF initiator and runs the GPUDirect Storage data path, presenting remote flash to the GPU as if local and bypassing the host CPU entirely — and, in the BlueField-4 generation, terminating an Ethernet-attached KV-cache/context-memory tier for inference. The storage data path is built out in Chapter 9.3.
  • Security and isolation. The DPU enforces the tenant VPC overlay, line-rate encryption over RDMA, and microsegmentation policy in hardware the tenant root cannot reach — the hard isolation boundary for multi-tenant GPU clouds. This is the substance of Chapter 11.6 and Chapter 11.7.
  • Virtualization & the overlay. The DPU runs the VXLAN/VPC encapsulation and the software-defined network, so the host hypervisor (or bare-metal stack) is relieved of network virtualization — the model top-tier neoclouds standardize on for bare-metal-with-VPC.
Do you actually need a DPU? — the per-server decision
Deployment contextNIC / SuperNIC aloneAdd a DPU/IPUWhat tips the decision
Single-tenant training clusterUsually sufficient — RDMA transport is in the NICOptional — only if storage/security offload is wantedNo tenant boundary to enforce; host cores often not the bottleneck
Multi-tenant GPU cloud / neocloudLeaves isolation on the host — soft boundaryEffectively mandatory — hardware VPC + encryption boundaryTenant must not see past the boundary; ClusterMAX-grade isolation
Inference fleet with disaggregated KV/storageHost CPU runs the storage initiator — steals coresStrong fit — NVMe-oF + context-memory tier offloadGPUDirect Storage path and KV-cache tier free host cores for serving
Cost-sensitive batch / internal clusterPreferred — fewer control planes to operateHard to justify — added capex + a second OS to patchNo isolation or storage-offload requirement to amortize the DPU
The DPU is a per-server line item across thousands of servers. The right answer depends on what infrastructure functions you have to offload, not on bandwidth alone.
102.4 Tbps
Broadcom Tomahawk 6 switch ASIC on 224G SerDes; native co-packaged-optics option; targets 100k–1M-XPU Ethernet fabrics
2025Broadcom (Tomahawk 6 announcement)
51.2 Tbps
Broadcom Jericho4 deep-buffer router; 3nm; HBM packet memory (~160× on-chip); RoCE over >100 km via 3.2 Tbps HyperPorts
2025Broadcom / The Next Platform
224G/lane
SerDes rate enabling 1.6T pluggables and 102.4 Tbps switching; 448G/lane samples ~2027, volume ~2028
2025SemiAnalysis (AI networks); Ciena 448G
800 Gb/s
NVIDIA ConnectX-9 SuperNIC / BlueField-4 RDMA bandwidth — double ConnectX-8/BlueField-3
2026 (shipping)NVIDIA / ServeTheHome / HPCwire
64-core
BlueField-4 Arm Neoverse V2 complex (64B transistors, ~6× BF-3 compute); 128 GB LPDDR5; PCIe Gen6
2026 (announced)NVIDIA / WWT / HPCwire
~95%
effective throughput on Spectrum-X Ethernet at xAI Colossus 100k–200k-GPU scale, zero flow-collision loss
2025NVIDIA (Spectrum-X / Colossus)
~70%
share of new AI infrastructure deployments choosing Ethernet/RoCEv2 over InfiniBand as parity closes
2026SemiAnalysis (AI networks)
1 per GPU
back-end SuperNIC provisioning (8×400G/800G per 8-GPU server = 3.2–6.4 Tb/s/node) + a separate front-end/storage NIC
2025domain research (scale-out fabric); NVIDIA RA
Deep dive: how SHARP turns the switch into a collective accelerator — and why it is a lock-in

The clearest example of a switch ASIC doing more than moving packets is NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol), and it is worth understanding because it is simultaneously a real performance win and a real lock-in. In a normal all-reduce, every GPU sends its gradient slice to every other, and the reduction (the sum) happens at the endpoints — bandwidth scales poorly and burns GPU streaming multiprocessors (SMs) on communication instead of compute. In-network reduction moves the arithmetic into the switch: the Quantum switch ASIC sums the contributions as they pass through, so each GPU sends its data once and receives the final result, halving the data on the wire for the reduction and freeing SMs.

SHARPv4 on Quantum-X800, composed with NVLink SHARP inside the rack, cuts the SMs NCCL spends on collectives from ~16 to ≤6 in the integrated stack (NVIDIA, 2025) — a concrete, measurable reason to choose InfiniBand for reduction-heavy synchronous training. The catch is that this is a captive feature: it lives in the Quantum switch and requires NVIDIA NICs that speak the protocol. There is no merchant-silicon or open-Ethernet equivalent shipping at parity in 2026, which is why reduction-heavy pre-training is the workload where the captive stack still has its strongest claim — and why an operator who standardizes on Ethernet for everything else may still carve out an InfiniBand island for the largest training runs. The collective-offload mechanics and the Ethernet counter-arguments are developed in Chapter 8.6.

The vertical-integration question, restated as silicon

Step back and the three ASIC families resolve into one strategic fork: buy the network as a co-designed system from one vendor, or assemble it from merchant silicon. NVIDIA's pitch is that the switch (Quantum/Spectrum), the NIC (ConnectX), and the DPU (BlueField) are designed together, so features like SHARP, packet spray with hardware reassembly, and line-rate encryption work end-to-end out of the box — and the data shows the integrated Spectrum-X stack hitting ~95% effective throughput at hyperscale. The cost is that the value of the system depends on owning both ends; the moment you mix in a third-party NIC, the co-designed features degrade or disappear.

The merchant counter-case is that Broadcom (Tomahawk for radix, Jericho for depth) plus a SuperNIC of your choosing plus an open NOS (SONiC, FBOSS) gives you a multi-vendor supply chain, no single-vendor margin capture, and the freedom to mix optics and cables — at the cost of doing the integration yourself and accepting that the most aggressive co-designed features (SHARP-equivalent in-network reduction) are not yet at parity. This is why ~70% of new AI deployments now choose Ethernet/RoCEv2 (SemiAnalysis, 2026): for most fabrics the merchant path is good enough and cheaper, and the captive premium is reserved for the reduction-heavy training islands where it still pays. The business-model and margin framing of merchant-vs-captive silicon is developed in Chapter 7.1.

Deep dive: why the DPU's second control plane is the part people forget to budget

The DPU sales pitch is all about what it offloads. The part that gets under-budgeted is what it adds: a complete second computer in every server, with its own operating system, its own firmware, its own security-patch cadence, and its own failure modes. When you deploy BlueField at fleet scale, you have doubled the number of OS images you patch, the number of agents you monitor, and the number of things that can break in the data path between the host and the wire — a DPU that hangs takes the server's network with it.

This is why the DPU decision is not "is it powerful" (it obviously is — 64 Arm cores, 800 Gb/s) but "do I have enough infrastructure work to amortize a second control plane across thousands of servers." In a multi-tenant neocloud the answer is yes: the hardware isolation boundary is non-negotiable and the storage offload is real, so the operational tax is worth it — and ClusterMAX-grade ratings essentially require it. In a single-tenant training cluster with no isolation requirement and host cores to spare, a plain SuperNIC is often the better engineering choice precisely because it is one control plane, not two. The lesson generalizes: every offload frees a host resource and adds an operational surface, and the DPU is only a win when the freed resource is worth more than the added surface. The isolation case that most often tips it is in Chapter 11.6; the storage case in Chapter 9.3.

Anti-patterns

The recurring silicon mis-selections all come from optimizing one number in isolation instead of reading the gating spec and the downstream cost:

  • Buying the aggregate Tbps and ignoring the SerDes generation. A switch bought one SerDes generation behind looks competitive on aggregate bandwidth but yields half the radix at your port speed, reaches less far on copper, and forces an optics upgrade a generation early. Read the lane, not the headline.
  • Deep buffers everywhere. Putting deep-buffer routing silicon on short-reach leaf switches where a shallow-buffer high-radix ASIC belongs — paying latency, power, and $/port for burst absorption the tier never needs. Match the buffer to the reach.
  • A DPU with nothing to offload. Deploying BlueField-class silicon across a single-tenant cluster with no isolation requirement and no storage-offload plan — a per-server capex line and a second control plane bought for a feature set you never enable.
  • Mixing a third-party NIC into a co-designed fabric. Buying a Spectrum-X or SHARP-capable switch for its integrated features, then pairing it with a generic NIC that cannot do hardware reassembly or in-network reduction — paying the captive premium and getting the merchant feature set.
This chapter supplies the silicon that the rest of Part 8 wires together. The copper-reach and scale-up domain those ASICs sit inside is Chapter 8.2; the transport semantics (RoCE vs IB, lossless vs lossy, in-order vs out-of-order) that the NICs implement are Chapter 8.4; the radix-and-oversubscription math that turns these chips into a topology is Chapter 8.5; the congestion control, packet spray, and SHARP collective offload that the silicon enables are Chapter 8.6. The DPU's offloaded functions are developed where they live: storage and GPUDirect in Chapter 9.3, multi-tenant isolation in Chapter 11.6, and microsegmentation/zero-trust in Chapter 11.7. The merchant-vs-captive business model behind the family fork is Chapter 7.1; the telemetry that operates all of it is Chapter 10.6 and Chapter 14.2.