Chapter 7.1
Accelerator Landscape & Taxonomy
The accelerator is not a product you buy — it is an architectural family you marry, and the four families (merchant GPU, systolic TPU, hyperscaler XPU, inference ASIC) impose different software stacks, different scale-up fabrics, and different lock-in horizons long after the silicon is racked.
What you'll decide here
- Which accelerator family you are designing around — merchant GPU, systolic-array TPU, hyperscaler custom XPU, or inference-specialized ASIC — because that single choice sets your software stack, your scale-up fabric, and your switching cost for the life of the asset.
- Whether you are a merchant buyer (paying NVIDIA/AMD margin for a portable CUDA/ROCm stack) or are large enough to justify a captive design program (eating NRE and a multi-year tape-out lead time to escape that margin).
- Which datasheet number actually governs your workload — dense vs sparse, peak vs sustained, and at which precision — so you size against a goodput-realistic figure and not a marketing headline that you will never reach.
- How much of the value sits above the die: the Broadcom/Marvell design-partner relationship and TSMC fab/CoWoS allocation that any custom program is hostage to, regardless of whose logo is on the chip.
- Whether a single-family fleet (operational simplicity, deepest software) or a heterogeneous fleet (price leverage, supply resilience) matches your scale, your team, and your exposure to one vendor's roadmap and allocation.
Every chapter in Part 7 is downstream of one classification: what kind of accelerator is this? Not the part number — the architectural family. A merchant GPU, a systolic-array TPU, a hyperscaler XPU, and an inference-specialized ASIC are not four points on a performance spectrum; they are four different bets about how flexible the silicon should be, who writes the compiler, who owns the scale-up fabric, and how long you are locked in. Choosing a family is the master fork of the compute layer, and like the workload fork in Chapter 1.1, it propagates: the family sets the software stack you train your team on, the interconnect you plumb your racks for, the memory supply chain you are exposed to, and the depreciation argument your CFO will fight over.
This chapter defines the four families by the properties that drive consequences — programmability, who controls the toolchain, and the scale-up domain — then separates the two business models layered on top: merchant (you buy a chip and pay the designer's margin for a portable stack) versus captive (you co-design a chip and eat the NRE to escape that margin). We name the two firms — Broadcom and Marvell — through which nearly every captive program flows, and the one fab, TSMC, that every family is hostage to. We close on a costly literacy gap in accelerator procurement: reading a datasheet without being lied to by it. Dense vs sparse, peak vs sustained, and the precision games inflate a headline FLOPS number by 4x before you have run a single token.
The four families, by what actually differs
The families are not distinguished by speed — at a given TSMC node and HBM generation, peak FLOPS converge. They are distinguished by three architectural properties that determine everything downstream:
- Programmability / generality. How far the silicon strays from a fixed dataflow. A GPU is a general SIMT machine that will run any kernel you can express; a systolic array is a near-fixed matrix-multiply dataflow that is highly efficient on the ops it was built for and inert on the ops it was not; an inference ASIC narrows further to a single serving regime.
- Who controls the toolchain. The chip is the cheap part of the lock-in. CUDA, XLA, and Neuron are the expensive part. A merchant GPU ships with a portable, decade-matured software stack you can hire for; a captive XPU ships with a compiler that exists to serve one owner's models and one cloud's instances.
- The scale-up domain. How many accelerators share a coherent, high-bandwidth memory fabric before you fall off onto a slower scale-out network. This is a family-level attribute — NVLink for NVIDIA, ICI for TPU, NeuronLink for Trainium, UALink/Ethernet for the open camp — and it caps how large a tensor- or expert-parallel shard can be without paying the network tax. The fabric engineering itself lives in Part 8; here it is a taxonomy axis.
Those three properties order the table below. The leftmost column is the choice; everything to the right is a consequence you inherit.
| Family | Canonical 2026 parts | Programmability | Toolchain control | Scale-up domain | Primary consequence |
|---|---|---|---|---|---|
| Merchant GPU | NVIDIA Blackwell B200/GB200, Rubin; AMD MI355X/MI400 | Fully general SIMT; runs any kernel | Vendor stack, but portable + hireable (CUDA / ROCm) | NVLink 72 (NVL72) → 576; UALink/Ethernet (AMD) | Highest unit margin paid for optionality, ecosystem, resale liquidity |
| Systolic TPU | Google TPU v7 Ironwood | Near-fixed matmul dataflow; superb on GEMM, weak off it | XLA / JAX, captive to one operator's cloud | ICI 3D-torus; 9,216-chip OCS pod | Best perf/watt on supported ops; rent-only, no merchant market |
| Hyperscaler XPU | AWS Trainium3, Microsoft Maia, Meta MTIA, OpenAI/Broadcom | Domain-specialized, semi-programmable | Captive SDK (Neuron, etc.); thin third-party support | NeuronLink / vendor scale-up (e.g. 144-chip UltraServer) | Anchor-tenant economics; escapes merchant margin, eats SDK-maturity tax |
| Inference ASIC | Groq LPU, SambaNova RDU, AWS Inferentia, d-Matrix, Tenstorrent | Narrowest; fixed-function or reconfigurable-dataflow for serving | Bespoke / emerging; limited framework reach | Vendor-specific; often small or none | Lowest cost-per-token in its niche; fixed-function obsolescence risk |
Each row of the table is a set of inheritances. Choose the systolic TPU and you have also chosen XLA/JAX and a Google Cloud rental relationship — there is no TPU you can buy, rack, and resell. Choose a hyperscaler XPU and you have chosen anchor-tenant economics: the part exists because one operator's internal demand justified the tape-out, and your access to it is a function of their spare capacity and their SDK's maturity, not a merchant price list. Choose an inference ASIC and you may win dramatically on cost-per-token in a narrow serving regime — and lose the moment the dominant model architecture shifts under a fixed-function design. The families do not trade off on one axis; they trade flexibility for efficiency for control, and the right answer is set by your scale and your software org, not by a FLOPS chart. → the per-family deep dives: NVIDIA in Chapter 7.2, AMD in Chapter 7.3, hyperscaler XPUs in Chapter 7.4, custom ASICs in Chapter 7.5.
Why a GPU and a TPU are not the same animal
The deepest split in the taxonomy is architectural, and it is worth making concrete because it explains the perf/watt gap and the portability gap at once. A GPU is a SIMT (single-instruction, multiple-thread) machine: thousands of cores execute the same instruction across different data, with a large register file, programmable caches, and Tensor Cores bolted on for matrix math. Its virtue is generality — any kernel you can write, it will run — and its vice is that generality costs silicon area and power on control logic, scheduling, and data movement that a fixed-function part spends on math.
A TPU is built around a systolic array: a 2D grid of multiply-accumulate units through which data flows rhythmically, each cell passing partial sums to its neighbor so that operands are reused across the array without re-fetching from memory. For dense matrix multiplication — the dominant op in transformer training and serving — this is extraordinarily efficient: it minimizes the data movement that dominates the energy budget, which is why Google's TPU v7 Ironwood reaches roughly 4,614 FP8 TFLOPS per chip at a perf/watt the company markets aggressively against Blackwell (Google Cloud / TrendForce, Nov 2025). The cost is rigidity: ops that do not map cleanly to the array's dataflow — irregular sparsity, dynamic control flow, exotic attention variants — run poorly or fall back to slower paths, and the XLA compiler must statically schedule the whole graph. The GPU pays generality tax every cycle; the TPU pays a flexibility tax only when the model drifts off the dataflow it was built for. → numerics and precision in Chapter 7.10.
Merchant vs captive — and the firms in the middle
The business-model layer sits on top of the architectural one and is just as consequential. Merchant silicon is sold to anyone: NVIDIA and AMD design a chip, TSMC fabs it, and the buyer pays a gross margin — NVIDIA's data-center margins have run in the 70%+ range — in exchange for a portable software stack, a hireable skills market, a deep secondary market, and someone else's roadmap risk. Captive silicon is designed by the operator who will run it. Google, Amazon, Microsoft, Meta, and now OpenAI build accelerators tuned to their own models and clouds, priced internally to undercut the merchant margin they would otherwise pay NVIDIA. The prize is enormous at hyperscale: at a million-accelerator fleet, escaping a 70-point margin on every chip is the difference between a viable and an unviable cost-per-token.
But almost no operator designs the whole chip alone. The physical-design, SerDes, packaging, and tape-out expertise sits with two merchant-silicon houses, and nearly every captive program flows through one of them. Broadcom (≈55–60% of the custom-ASIC market) is the design partner behind Google's TPU line, Meta's MTIA, and OpenAI's first in-house accelerator; Marvell (≈15%) serves AWS Trainium/Inferentia and Microsoft Maia. Together they hold roughly 95% of the custom-ASIC market (J.P. Morgan via TrendForce; Tom's Hardware, 2025–2026). The strategic reading: 'building your own silicon' rarely means vertical independence — it means swapping NVIDIA's margin and roadmap for a Broadcom-or-Marvell design relationship and, underneath both, the same single fab.
| Dimension | Merchant (NVIDIA, AMD) | Captive XPU (TPU, Trainium, Maia, MTIA) |
|---|---|---|
| Who pays the margin | You — 70%+ data-center gross margin | You capture it internally; pay NRE + design-partner fee instead |
| Up-front cost | Purchase price only | Hundreds of $M NRE; mask sets; multi-program commitment |
| Lead time to volume | Order against an allocation queue | ~18–36 months design-to-volume per generation |
| Software | Portable, mature, hireable (CUDA / ROCm) | Captive compiler (XLA / Neuron); thin third-party support |
| Resale / secondary market | Deep — underwrites residual value and GPU-backed debt | Effectively none; asset is captive and non-fungible |
| Who it makes sense for | Almost everyone below frontier-self-build scale | Operators with own-model volume + a silicon/compiler org |
Reading a datasheet without being lied to
The costly literacy gap in accelerator procurement is taking the headline FLOPS number at face value. Vendors quote the largest defensible figure, and the gap between that figure and what your workload sustains can be a factor of four or more before you have run a single token. Three traps recur, and each one inflates the number in a different way.
Trap 1 — dense vs sparse. The biggest headline numbers usually assume structured sparsity (commonly 2:4 — two of every four weights zeroed), which doubles the quoted matmul throughput. NVIDIA's GB200 NVL72 rack is marketed at ~1.44 ExaFLOPS FP4 with sparsity; the dense figure is half that, and most production workloads do not realize the full sparsity speed-up. AMD, by contrast, quotes MI355X FP4 at 9.2 PFLOPS without sparsity — so a naive 'their number is smaller' comparison is comparing a dense figure against a sparse one. Always normalize to dense, at the same precision, before comparing two vendors.
Trap 2 — precision inflation. A chip's biggest FLOPS number is at its lowest-precision format. Drop from FP16 to FP8 and the number doubles; drop to FP4/FP6 and it doubles again. MI355X is a clean illustration: ~5 PFLOPS FP16, ~10.1 PFLOPS FP8, ~20.1 PFLOPS FP4/FP6 — same silicon, a 4x spread purely from the precision the marketing slide chose. If your training run needs BF16/FP8 for stability, the FP4 headline is a number you will never see. Match the quoted precision to the precision your workload actually runs at. → the precision ladder in Chapter 7.10.
Trap 3 — peak vs sustained. Peak FLOPS assumes every multiply-accumulate unit is fed every cycle. Real workloads are throttled by memory bandwidth, collective-communication stalls, kernel launch overhead, and thermal limits. The honest metric is Model FLOPS Utilization (MFU) — sustained useful FLOPS over peak — which lands at roughly 30–50% on well-tuned training (best-in-class above 50% on Hopper), and the goodput that survives failures and restarts is lower still (industry ~90%, best ~96%). A part with a higher peak but a worse compiler and a thinner memory pipe can lose on sustained throughput to a part with a lower headline. This is why measured benchmarks diverge from spec sheets, and why a procurement RFP must demand sustained-MFU numbers on your models, not peak FLOPS on the vendor's. → realized-MFU gap and switching cost in Chapter 7.9; the governing cost-per-token metric in Chapter 7.11.
Deep dive: why an inference ASIC can win on cost-per-token and still be the wrong buy
Inference-specialized silicon — Groq's LPU, SambaNova's RDU, AWS Inferentia, d-Matrix, Tenstorrent and others — narrows the dataflow further than even a TPU, optimizing for one serving regime (often low-latency single-stream decode, or high-throughput batched prefill). Within that regime the results can be striking: deterministic latency, very high tokens-per-second, and a cost-per-token well below a general GPU because none of the silicon is spent on training flexibility, large register files, or speculative generality. For a stable, high-volume serving workload on a fixed model architecture, this is a genuine win.
The catch is fixed-function obsolescence risk, and it is the inference-ASIC version of the cooling-cliff one-way door. A part designed around today's dominant attention pattern, today's KV-cache layout, and today's quantization scheme is exposed when the model architecture shifts — and in this field it shifts yearly. A GPU absorbs an architectural change by recompiling a kernel; a fixed-function ASIC may need a new tape-out, which is a new ~18–36 month cycle and a new NRE bill. The reconfigurable-dataflow ASICs (SambaNova, Tenstorrent) hedge this by keeping the dataflow programmable, trading some peak efficiency for the ability to track model evolution. The decision therefore mirrors the merchant/captive fork: buy the inference ASIC only for a workload whose shape you are confident will outlive the silicon's design cycle, and keep a GPU pool for everything still moving. This is the hybrid fleet, justified. → custom-ASIC economics and the fixed-function-vs-reprogrammable trade in Chapter 7.5.
Single-family vs heterogeneous fleet
The last decision the taxonomy forces is fleet composition, and it is a goodput-vs-resilience trade. A single-family fleet — almost always all-NVIDIA in 2026 — buys operational simplicity: one software stack, one scheduler integration, one set of failure modes, one driver matrix, the deepest pool of hireable engineers, and the most liquid resale market. The price is total exposure to one vendor's roadmap, one vendor's allocation queue, and one vendor's pricing power. A heterogeneous fleet — GPUs for flexible training, a TPU or Trainium pool for steady high-volume work, an inference ASIC for a stable serving tier — buys price leverage (a credible second source disciplines the incumbent's quote), supply resilience (allocation shocks hit one family at a time), and workload-fit efficiency. The price is multiplied operational complexity: every additional family is another compiler to maintain, another set of kernels to port, another realized-MFU gap to measure, another on-call runbook.
The rule of thumb that survives contact with operators: match fleet diversity to scale and to the size of your software org. Below a few thousand accelerators, a single family almost always wins — the operational tax of a second stack exceeds the price leverage it buys. At hyperscale, heterogeneity is mandatory, because the allocation and pricing exposure of a single-vendor fleet at a million-chip scale is an existential risk, and the operator already has the compiler team to pay the integration tax. The middle is genuinely hard, and it is where most of the bad decisions get made — a mid-size operator adds a second family for price leverage it is too small to realize, and drowns in the operational complexity it was too small to absorb. → switching-cost quantification in Chapter 7.9; the full selection-and-TCO model, RFP construction, and buy-vs-rent-vs-build in Chapter 7.11.
Deep dive: the taxonomy as a supply-chain map, not just an architecture map
It is tempting to read the four families purely as engineering categories. The more useful reading in 2026 is as a supply-chain dependency map, because that is what actually gates delivery. Trace any 2026 frontier accelerator back through its stack and you converge on the same chokepoints regardless of family. The logic die: TSMC N3/N3P. The advanced packaging that stitches logic to memory: TSMC CoWoS, the most-cited binding constraint on AI compute through 2030. The memory: a three-supplier HBM oligopoly (SK hynix, Samsung, Micron), with HBM3E sold out and HBM4 ramping into a structural shortage. The captive-program design IP: Broadcom or Marvell. The merchant alternative: NVIDIA or AMD, themselves at the front of the same TSMC queue.
The consequence for a strategist is that family choice does not diversify your upstream risk as much as it appears to. Switching from NVIDIA GPUs to a Broadcom-designed custom ASIC changes your margin structure and your software stack, but it does not move you off TSMC wafers, off CoWoS substrate, or off the HBM oligopoly — it may even put you deeper into the same queue behind the merchant vendors who pre-booked capacity. Genuine supply diversification comes from second-sourcing at the packaging and memory layer, not the architecture layer, which is why those layers — not the choice of GPU vs XPU — are treated as the real allocation gate in Chapter 7.6, Chapter 7.7, and the procurement strategy in Chapter 2.3.