Guide › Compute, Silicon & System Integration › 7.1

Chapter 7.1

Accelerator Landscape & Taxonomy

The accelerator is not a product you buy — it is an architectural family you marry, and the four families (merchant GPU, systolic TPU, hyperscaler XPU, inference ASIC) impose different software stacks, different scale-up fabrics, and different lock-in horizons long after the silicon is racked.

GOODPUTPOWER-BOUND

What you'll decide here

Which accelerator family you are designing around — merchant GPU, systolic-array TPU, hyperscaler custom XPU, or inference-specialized ASIC — because that single choice sets your software stack, your scale-up fabric, and your switching cost for the life of the asset.
Whether you are a merchant buyer (paying NVIDIA/AMD margin for a portable CUDA/ROCm stack) or are large enough to justify a captive design program (eating NRE and a multi-year tape-out lead time to escape that margin).
Which datasheet number actually governs your workload — dense vs sparse, peak vs sustained, and at which precision — so you size against a goodput-realistic figure and not a marketing headline that you will never reach.
How much of the value sits above the die: the Broadcom/Marvell design-partner relationship and TSMC fab/CoWoS allocation that any custom program is hostage to, regardless of whose logo is on the chip.
Whether a single-family fleet (operational simplicity, deepest software) or a heterogeneous fleet (price leverage, supply resilience) matches your scale, your team, and your exposure to one vendor's roadmap and allocation.

Every chapter in Part 7 is downstream of one classification: what kind of accelerator is this? Not the part number — the architectural family. A merchant GPU, a systolic-array TPU, a hyperscaler XPU, and an inference-specialized ASIC are not four points on a performance spectrum; they are four different bets about how flexible the silicon should be, who writes the compiler, who owns the scale-up fabric, and how long you are locked in. Choosing a family is the master fork of the compute layer, and like the workload fork in Chapter 1.1, it propagates: the family sets the software stack you train your team on, the interconnect you plumb your racks for, the memory supply chain you are exposed to, and the depreciation argument your CFO will fight over.

This chapter defines the four families by the properties that drive consequences — programmability, who controls the toolchain, and the scale-up domain — then separates the two business models layered on top: merchant (you buy a chip and pay the designer's margin for a portable stack) versus captive (you co-design a chip and eat the NRE to escape that margin). We name the two firms — Broadcom and Marvell — through which nearly every captive program flows, and the one fab, TSMC, that every family is hostage to. We close on a costly literacy gap in accelerator procurement: reading a datasheet without being lied to by it. Dense vs sparse, peak vs sustained, and the precision games inflate a headline FLOPS number by 4x before you have run a single token.

The four families, by what actually differs

The families are not distinguished by speed — at a given TSMC node and HBM generation, peak FLOPS converge. They are distinguished by three architectural properties that determine everything downstream:

Programmability / generality. How far the silicon strays from a fixed dataflow. A GPU is a general SIMT machine that will run any kernel you can express; a systolic array is a near-fixed matrix-multiply dataflow that is highly efficient on the ops it was built for and inert on the ops it was not; an inference ASIC narrows further to a single serving regime.
Who controls the toolchain. The chip is the cheap part of the lock-in. CUDA, XLA, and Neuron are the expensive part. A merchant GPU ships with a portable, decade-matured software stack you can hire for; a captive XPU ships with a compiler that exists to serve one owner's models and one cloud's instances.
The scale-up domain. How many accelerators share a coherent, high-bandwidth memory fabric before you fall off onto a slower scale-out network. This is a family-level attribute — NVLink for NVIDIA, ICI for TPU, NeuronLink for Trainium, UALink/Ethernet for the open camp — and it caps how large a tensor- or expert-parallel shard can be without paying the network tax. The fabric engineering itself lives in Part 8; here it is a taxonomy axis.

Those three properties order the table below. The leftmost column is the choice; everything to the right is a consequence you inherit.

The master fork: rent a portable stack, or own a captive one

If you take one decision from this chapter, take this one. Merchant silicon (NVIDIA, AMD) sells you a chip plus a portable software stack and a roadmap you do not control — you pay a gross margin reported in the 70%+ range for optionality, hireable skills, and the freedom to resell the asset into a deep secondary market. Captive silicon (Google TPU, AWS Trainium, Microsoft Maia, Meta MTIA) is co-designed by the operator who will run it, tuned to that operator's models and cloud, and priced to undercut merchant margin internally — but it carries non-recurring engineering in the hundreds of millions, a 18–36 month design-to-volume lead time, a compiler that only its owner fully trusts, and effectively zero resale market. The fork is not 'which is faster.' It is do you have the model volume and the software org to amortize a captive program, or are you better off renting NVIDIA's ecosystem and its resale liquidity? Below roughly your-own-frontier-model scale, the answer is almost always merchant. → economics of the build threshold in Chapter 7.5.

The four accelerator families → what you inherit

Family	Canonical 2026 parts	Programmability	Toolchain control	Scale-up domain	Primary consequence
Merchant GPU	NVIDIA Blackwell B200/GB200, Rubin; AMD MI355X/MI400	Fully general SIMT; runs any kernel	Vendor stack, but portable + hireable (CUDA / ROCm)	NVLink 72 (NVL72) → 576; UALink/Ethernet (AMD)	Highest unit margin paid for optionality, ecosystem, resale liquidity
Systolic TPU	Google TPU v7 Ironwood	Near-fixed matmul dataflow; superb on GEMM, weak off it	XLA / JAX, captive to one operator's cloud	ICI 3D-torus; 9,216-chip OCS pod	Best perf/watt on supported ops; rent-only, no merchant market
Hyperscaler XPU	AWS Trainium3, Microsoft Maia, Meta MTIA, OpenAI/Broadcom	Domain-specialized, semi-programmable	Captive SDK (Neuron, etc.); thin third-party support	NeuronLink / vendor scale-up (e.g. 144-chip UltraServer)	Anchor-tenant economics; escapes merchant margin, eats SDK-maturity tax
Inference ASIC	Groq LPU, SambaNova RDU, AWS Inferentia, d-Matrix, Tenstorrent	Narrowest; fixed-function or reconfigurable-dataflow for serving	Bespoke / emerging; limited framework reach	Vendor-specific; often small or none	Lowest cost-per-token in its niche; fixed-function obsolescence risk

Representative 2026-current parts per family. Scale-up domain is the coherent high-bandwidth fabric size, a family attribute engineered in Part 8. Software-control column is the lock-in axis, deepened in Chapter 7.9.

Each row of the table is a set of inheritances. Choose the systolic TPU and you have also chosen XLA/JAX and a Google Cloud rental relationship — there is no TPU you can buy, rack, and resell. Choose a hyperscaler XPU and you have chosen anchor-tenant economics: the part exists because one operator's internal demand justified the tape-out, and your access to it is a function of their spare capacity and their SDK's maturity, not a merchant price list. Choose an inference ASIC and you may win dramatically on cost-per-token in a narrow serving regime — and lose the moment the dominant model architecture shifts under a fixed-function design. The families do not trade off on one axis; they trade flexibility for efficiency for control, and the right answer is set by your scale and your software org, not by a FLOPS chart. → the per-family deep dives: NVIDIA in Chapter 7.2, AMD in Chapter 7.3, hyperscaler XPUs in Chapter 7.4, custom ASICs in Chapter 7.5.

Why a GPU and a TPU are not the same animal

The deepest split in the taxonomy is architectural, and it is worth making concrete because it explains the perf/watt gap and the portability gap at once. A GPU is a SIMT (single-instruction, multiple-thread) machine: thousands of cores execute the same instruction across different data, with a large register file, programmable caches, and Tensor Cores bolted on for matrix math. Its virtue is generality — any kernel you can write, it will run — and its vice is that generality costs silicon area and power on control logic, scheduling, and data movement that a fixed-function part spends on math.

A TPU is built around a systolic array: a 2D grid of multiply-accumulate units through which data flows rhythmically, each cell passing partial sums to its neighbor so that operands are reused across the array without re-fetching from memory. For dense matrix multiplication — the dominant op in transformer training and serving — this is extraordinarily efficient: it minimizes the data movement that dominates the energy budget, which is why Google's TPU v7 Ironwood reaches roughly 4,614 FP8 TFLOPS per chip at a perf/watt the company markets aggressively against Blackwell (Google Cloud / TrendForce, Nov 2025). The cost is rigidity: ops that do not map cleanly to the array's dataflow — irregular sparsity, dynamic control flow, exotic attention variants — run poorly or fall back to slower paths, and the XLA compiler must statically schedule the whole graph. The GPU pays generality tax every cycle; the TPU pays a flexibility tax only when the model drifts off the dataflow it was built for. → numerics and precision in Chapter 7.10.

~55–60% / ~15%

Broadcom / Marvell share of the custom AI ASIC market — a design-partner duopoly (~95% combined)

2025–2026J.P. Morgan via TrendForce; Tom's Hardware ASIC State of Play

~$30B

high-end custom AI ASIC market size, ~30% annual growth

2025J.P. Morgan (H. Sur)

~1.44 EFLOPS

GB200 NVL72 rack FP4 with sparsity (dense and other-precision figures are lower — the datasheet-reading trap)

2025NVIDIA GB200 NVL72

4,614 FP8 TFLOPS

per-chip TPU v7 Ironwood; 9,216-chip OCS pod = 42.5 FP8 ExaFLOPS

2025Google Cloud; TrendForce

2.52 PFLOPS

per-chip Trainium3 MXFP8, 144 GB HBM3E; ~4x perf/watt vs Trn2 UltraServer

Dec 2025AWS (re:Invent / Trn3 UltraServers)

9.2 PFLOPS / 288 GB

AMD MI355X FP4 (no sparsity) and HBM3E capacity; 1,400 W peak board power

2025AMD MI355X datasheet; Tom's Hardware

5–6 yr vs 2–3 yr

GPU book life vs frontier-economic life — the depreciation fight the family choice feeds (CONTESTED)

2026CNBC / SemiAnalysis synthesis

~$283–318k

all-in cost per merchant 8-GPU (H100-class) server, excl. storage — the margin a captive program targets

2025SemiAnalysis AI Neocloud Playbook

Merchant vs captive — and the firms in the middle

The business-model layer sits on top of the architectural one and is just as consequential. Merchant silicon is sold to anyone: NVIDIA and AMD design a chip, TSMC fabs it, and the buyer pays a gross margin — NVIDIA's data-center margins have run in the 70%+ range — in exchange for a portable software stack, a hireable skills market, a deep secondary market, and someone else's roadmap risk. Captive silicon is designed by the operator who will run it. Google, Amazon, Microsoft, Meta, and now OpenAI build accelerators tuned to their own models and clouds, priced internally to undercut the merchant margin they would otherwise pay NVIDIA. The prize is enormous at hyperscale: at a million-accelerator fleet, escaping a 70-point margin on every chip is the difference between a viable and an unviable cost-per-token.

But almost no operator designs the whole chip alone. The physical-design, SerDes, packaging, and tape-out expertise sits with two merchant-silicon houses, and nearly every captive program flows through one of them. Broadcom (≈55–60% of the custom-ASIC market) is the design partner behind Google's TPU line, Meta's MTIA, and OpenAI's first in-house accelerator; Marvell (≈15%) serves AWS Trainium/Inferentia and Microsoft Maia. Together they hold roughly 95% of the custom-ASIC market (J.P. Morgan via TrendForce; Tom's Hardware, 2025–2026). The strategic reading: 'building your own silicon' rarely means vertical independence — it means swapping NVIDIA's margin and roadmap for a Broadcom-or-Marvell design relationship and, underneath both, the same single fab.

Merchant vs captive → the business-model fork

Dimension	Merchant (NVIDIA, AMD)	Captive XPU (TPU, Trainium, Maia, MTIA)
Who pays the margin	You — 70%+ data-center gross margin	You capture it internally; pay NRE + design-partner fee instead
Up-front cost	Purchase price only	Hundreds of $M NRE; mask sets; multi-program commitment
Lead time to volume	Order against an allocation queue	~18–36 months design-to-volume per generation
Software	Portable, mature, hireable (CUDA / ROCm)	Captive compiler (XLA / Neuron); thin third-party support
Resale / secondary market	Deep — underwrites residual value and GPU-backed debt	Effectively none; asset is captive and non-fungible
Who it makes sense for	Almost everyone below frontier-self-build scale	Operators with own-model volume + a silicon/compiler org

The fork beneath the architectural one. NRE and lead-time figures are 2026 practitioner ranges; the build-justification threshold is quantified in Chapter 7.5.

TSMC is the universal dependency under every family

Whatever family you choose and whichever business model you adopt, the leading-edge die is almost certainly fabbed by TSMC — N4/N3/N3P for the 2025–2026 generation, with Trainium3 already on N3P. The architectural rivalry between GPU and TPU, the commercial rivalry between merchant and captive, and the design-partner contest between Broadcom and Marvell all resolve into the same wafer-start queue and, one step later, the same CoWoS advanced-packaging line. This is why supply, not architecture, is the binding constraint of the 2026 era: the differentiator that matters most is not whose dataflow is cleverer but whose name is higher on TSMC's allocation list for wafers and CoWoS substrate. The packaging gate is engineered in Chapter 7.7; the HBM oligopoly that rides on it in Chapter 7.6; and the procurement strategy that treats allocation as the real scarce asset in Chapter 2.3.

Reading a datasheet without being lied to

The costly literacy gap in accelerator procurement is taking the headline FLOPS number at face value. Vendors quote the largest defensible figure, and the gap between that figure and what your workload sustains can be a factor of four or more before you have run a single token. Three traps recur, and each one inflates the number in a different way.

Trap 1 — dense vs sparse. The biggest headline numbers usually assume structured sparsity (commonly 2:4 — two of every four weights zeroed), which doubles the quoted matmul throughput. NVIDIA's GB200 NVL72 rack is marketed at ~1.44 ExaFLOPS FP4 with sparsity; the dense figure is half that, and most production workloads do not realize the full sparsity speed-up. AMD, by contrast, quotes MI355X FP4 at 9.2 PFLOPS without sparsity — so a naive 'their number is smaller' comparison is comparing a dense figure against a sparse one. Always normalize to dense, at the same precision, before comparing two vendors.

Trap 2 — precision inflation. A chip's biggest FLOPS number is at its lowest-precision format. Drop from FP16 to FP8 and the number doubles; drop to FP4/FP6 and it doubles again. MI355X is a clean illustration: ~5 PFLOPS FP16, ~10.1 PFLOPS FP8, ~20.1 PFLOPS FP4/FP6 — same silicon, a 4x spread purely from the precision the marketing slide chose. If your training run needs BF16/FP8 for stability, the FP4 headline is a number you will never see. Match the quoted precision to the precision your workload actually runs at. → the precision ladder in Chapter 7.10.

Trap 3 — peak vs sustained. Peak FLOPS assumes every multiply-accumulate unit is fed every cycle. Real workloads are throttled by memory bandwidth, collective-communication stalls, kernel launch overhead, and thermal limits. The honest metric is Model FLOPS Utilization (MFU) — sustained useful FLOPS over peak — which lands at roughly 30–50% on well-tuned training (best-in-class above 50% on Hopper), and the goodput that survives failures and restarts is lower still (industry ~90%, best ~96%). A part with a higher peak but a worse compiler and a thinner memory pipe can lose on sustained throughput to a part with a lower headline. This is why measured benchmarks diverge from spec sheets, and why a procurement RFP must demand sustained-MFU numbers on your models, not peak FLOPS on the vendor's. → realized-MFU gap and switching cost in Chapter 7.9; the governing cost-per-token metric in Chapter 7.11.

Deep dive: why an inference ASIC can win on cost-per-token and still be the wrong buy

Inference-specialized silicon — Groq's LPU, SambaNova's RDU, AWS Inferentia, d-Matrix, Tenstorrent and others — narrows the dataflow further than even a TPU, optimizing for one serving regime (often low-latency single-stream decode, or high-throughput batched prefill). Within that regime the results can be striking: deterministic latency, very high tokens-per-second, and a cost-per-token well below a general GPU because none of the silicon is spent on training flexibility, large register files, or speculative generality. For a stable, high-volume serving workload on a fixed model architecture, this is a genuine win.

The catch is fixed-function obsolescence risk, and it is the inference-ASIC version of the cooling-cliff one-way door. A part designed around today's dominant attention pattern, today's KV-cache layout, and today's quantization scheme is exposed when the model architecture shifts — and in this field it shifts yearly. A GPU absorbs an architectural change by recompiling a kernel; a fixed-function ASIC may need a new tape-out, which is a new ~18–36 month cycle and a new NRE bill. The reconfigurable-dataflow ASICs (SambaNova, Tenstorrent) hedge this by keeping the dataflow programmable, trading some peak efficiency for the ability to track model evolution. The decision therefore mirrors the merchant/captive fork: buy the inference ASIC only for a workload whose shape you are confident will outlive the silicon's design cycle, and keep a GPU pool for everything still moving. This is the hybrid fleet, justified. → custom-ASIC economics and the fixed-function-vs-reprogrammable trade in Chapter 7.5.

Single-family vs heterogeneous fleet

The last decision the taxonomy forces is fleet composition, and it is a goodput-vs-resilience trade. A single-family fleet — almost always all-NVIDIA in 2026 — buys operational simplicity: one software stack, one scheduler integration, one set of failure modes, one driver matrix, the deepest pool of hireable engineers, and the most liquid resale market. The price is total exposure to one vendor's roadmap, one vendor's allocation queue, and one vendor's pricing power. A heterogeneous fleet — GPUs for flexible training, a TPU or Trainium pool for steady high-volume work, an inference ASIC for a stable serving tier — buys price leverage (a credible second source disciplines the incumbent's quote), supply resilience (allocation shocks hit one family at a time), and workload-fit efficiency. The price is multiplied operational complexity: every additional family is another compiler to maintain, another set of kernels to port, another realized-MFU gap to measure, another on-call runbook.

The rule of thumb that survives contact with operators: match fleet diversity to scale and to the size of your software org. Below a few thousand accelerators, a single family almost always wins — the operational tax of a second stack exceeds the price leverage it buys. At hyperscale, heterogeneity is mandatory, because the allocation and pricing exposure of a single-vendor fleet at a million-chip scale is an existential risk, and the operator already has the compiler team to pay the integration tax. The middle is genuinely hard, and it is where most of the bad decisions get made — a mid-size operator adds a second family for price leverage it is too small to realize, and drowns in the operational complexity it was too small to absorb. → switching-cost quantification in Chapter 7.9; the full selection-and-TCO model, RFP construction, and buy-vs-rent-vs-build in Chapter 7.11.

Deep dive: the taxonomy as a supply-chain map, not just an architecture map

It is tempting to read the four families purely as engineering categories. The more useful reading in 2026 is as a supply-chain dependency map, because that is what actually gates delivery. Trace any 2026 frontier accelerator back through its stack and you converge on the same chokepoints regardless of family. The logic die: TSMC N3/N3P. The advanced packaging that stitches logic to memory: TSMC CoWoS, the most-cited binding constraint on AI compute through 2030. The memory: a three-supplier HBM oligopoly (SK hynix, Samsung, Micron), with HBM3E sold out and HBM4 ramping into a structural shortage. The captive-program design IP: Broadcom or Marvell. The merchant alternative: NVIDIA or AMD, themselves at the front of the same TSMC queue.

The consequence for a strategist is that family choice does not diversify your upstream risk as much as it appears to. Switching from NVIDIA GPUs to a Broadcom-designed custom ASIC changes your margin structure and your software stack, but it does not move you off TSMC wafers, off CoWoS substrate, or off the HBM oligopoly — it may even put you deeper into the same queue behind the merchant vendors who pre-booked capacity. Genuine supply diversification comes from second-sourcing at the packaging and memory layer, not the architecture layer, which is why those layers — not the choice of GPU vs XPU — are treated as the real allocation gate in Chapter 7.6, Chapter 7.7, and the procurement strategy in Chapter 2.3.

Each family gets a full treatment downstream: NVIDIA's generational cadence in Chapter 7.2, AMD's open-challenger position and the ROCm-maturity tax in Chapter 7.3, the hyperscaler XPUs (TPU, Trainium, Maia, MTIA) and anchor-tenant economics in Chapter 7.4, and the custom-ASIC build threshold (NRE, lead time, minimum-volume) in Chapter 7.5. The binding-constraint layers under all four — HBM in Chapter 7.6, advanced packaging in Chapter 7.7. The software lock-in that the family choice commits you to, and the realized-MFU gap, in Chapter 7.9; the precision ladder behind the datasheet traps in Chapter 7.10. The selection, TCO, RFP, and buy-vs-rent-vs-build decision that consumes this whole taxonomy in Chapter 7.11; the depreciation argument the merchant/captive fork feeds in Chapter 1.8; and the consolidated per-generation perf/watt and cost-per-token roadmap in Chapter 16.2.