Guide › Compute, Silicon & System Integration › 7.4

Chapter 7.4

Hyperscaler XPUs: TPU, Trainium/Inferentia, Maia, MTIA

When the company that owns the model also owns the silicon, the accelerator stops being a product you buy and becomes a cost structure you rent into — and the real decision is no longer FLOPS-per-dollar but whether you can tolerate a software stack and a supply chain you do not control.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

Whether your workload is stable enough — a fixed model architecture served at volume — to amortize the lock-in of a single-vendor XPU stack (XLA/JAX on TPU, Neuron on Trainium), or whether you need the portability that only merchant GPUs and CUDA still guarantee.
Whether you can access an XPU at all: TPU, Trainium, Maia, and MTIA are captive — rentable through one cloud or not rentable at all — so the procurement question is which hyperscaler you are willing to anchor to, not which chip you prefer.
Whether the perf/watt and tokens/$/W advantage of an inference-optimized XPU is large and durable enough to justify re-tooling your serving stack, against a roadmap whose cadence and software maturity you cannot audit from outside.
Whether to build your own silicon at all — the model-owner-builds-silicon path (OpenAI/Broadcom, Anthropic-on-Trainium) trades a multi-year, nine-figure NRE and a software-team headcount for supply independence and a structurally lower cost-per-token.
Which scale-up philosophy you are implicitly buying: TPU's optically-switched 3D torus, Trainium's NeuronLink all-to-all, or a merchant-GPU NVLink/UALink fabric — because the interconnect, not the chip, sets the failure blast radius and the largest model you can train coherently.

The previous two chapters covered silicon you can buy: NVIDIA and AMD sell merchant accelerators to anyone with a purchase order. This chapter is about silicon you can only rent, or only build — the hyperscaler XPUs. Google's TPU, AWS's Trainium and Inferentia, Microsoft's Maia, and Meta's MTIA are not products on a price list. They are vertically-integrated cost structures: a chip co-designed with a compiler, wired into a proprietary fabric, deployed in racks the same company owns, serving models the same company (or its anchor tenant) runs. The 2026 reality is that roughly 28% of AI server shipments are now ASIC-based — the highest share since 2023 — and custom-ASIC unit growth is running near 45% year-over-year, nearly triple the rate of merchant GPUs (TrendForce / Tom's Hardware, May 2026). The merchant-GPU monopoly is not collapsing, but the inference half of the market is quietly defecting.

For most operators the XPU is not a thing you select — it is a thing you inherit when you pick a cloud. You do not buy a TPU; you rent a Cloud TPU slice and accept XLA. You do not buy a Trainium; you rent a Trn instance and accept the Neuron SDK. So the forks here are second-order: stable-workload-vs-portability, captive-vs-merchant, rent-the-anchor's-economics-vs-build-your-own. This chapter traces each and ends on the structural trend that ties them together — the model owner deciding that the cheapest way to serve its own tokens is to design the chip that serves them.

Why a hyperscaler builds its own chip

The motive is never raw FLOPS — merchant GPUs win on peak FLOPS and will keep winning, because NVIDIA spreads its R&D across the entire market. The motive is the intersection of three pressures that only a model-owner feels at full force. First, cost-per-token at volume. When you serve hundreds of billions of tokens a day against a model architecture you control and rarely change, a fixed-function accelerator tuned for exactly that shape beats a general-purpose GPU on tokens/$/W — and at hyperscale, single-digit-percent efficiency gains are nine-figure annual line items. Second, supply independence. NVIDIA allocation is the binding constraint of the era (see Chapter 7.6 on HBM and Chapter 2.3 on long-lead procurement); a hyperscaler with its own tape-out and its own TSMC wafer-and-CoWoS allocation is not standing in NVIDIA's queue. Third, the NVIDIA margin. NVIDIA's gross margin is the hyperscaler's cost; eliminating it on the highest-volume, most-predictable workloads is the single largest TCO lever a model-owner has.

The consequence — the reason most operators should not build — is that all three pressures require scale to pay off. Custom silicon carries a $100M-plus NRE per generation, a multi-year lead time, a hard minimum-volume threshold below which per-unit cost is worse than just renting GPUs, and a permanent software-engineering tax to keep the compiler competitive (see Chapter 7.5 for the ASIC break-even math). Only an entity that owns the model, owns the deployment, and ships the workload at hyperscale clears that bar. Everyone else rents the result.

The master fork: stable-shape-at-volume vs portability

The one decision that determines whether an XPU is right for you. If your workload is a fixed model architecture served at high, sustained volume — a known transformer shape, a stable serving stack, months between architecture changes — then a fixed-function XPU's tokens/$/W advantage compounds and the lock-in is a cost you pay once. If your workload is exploratory, architecture-churning, or must run portably across vendors, the XPU's compiler rigidity becomes a tax on every change: a new attention variant or a custom kernel that runs on a GPU in an afternoon can take weeks to land on XLA or Neuron, or may not lower efficiently at all. Decide which side you are on before you re-tool a serving stack — porting onto an XPU is a quarter of engineering; porting back off after you have built around its quirks is worse.

Google TPU: the most mature XPU, and the deepest lock-in

The TPU is the only hyperscaler XPU with a decade of generations behind it, and it is the existence proof that the model-owner-builds-silicon thesis works at frontier scale — Gemini is trained and served on TPUs end to end. The 2026 flagship is Ironwood (TPU v7), made generally available at Google Cloud Next in 2026 and explicitly positioned as 'the first Google TPU for the age of inference.' Each chip delivers 4,614 FP8 TFLOPS and 192 GB of HBM3E at 7.37 TB/s — a 6x memory jump over Trillium (TPU v6), the largest single-generation memory expansion in TPU history — at roughly 2x the perf/watt of the prior generation (Google Cloud; SemiAnalysis, Nov 2025).

The architectural signature that separates TPU from every GPU fabric is the interconnect. TPUs are wired as a 3D torus over the Inter-Chip Interconnect (ICI, 9.6 Tb/s per chip on Ironwood), and the torus is stitched together by optical circuit switches (OCS) — Google's own MEMS-mirror optical switches that physically reconfigure which chips are neighbors. An Ironwood pod scales to 9,216 chips delivering 42.5 FP8 ExaFLOPS with ~1.77 PB of pooled HBM. The OCS decouples the logical topology a job sees from the physical wiring, which buys two things a switched fat-tree cannot: per-job topology-on-demand (a job requests a torus shape, the OCS provisions it), and fast routing-around-failures (a dead chip is optically bypassed rather than failing the slice). This is why Google can run ~9,200-chip coherent domains without the all-to-all switch silicon that dominates a comparable GPU cluster — the optics are the scale-up fabric. → Chapter 8.2 treats OCS-vs-switched scale-up in full.

The lock-in is correspondingly deep. There is no CUDA on a TPU; the path is JAX or TensorFlow lowered through XLA, the compiler that turns a high-level graph into TPU machine code. XLA is genuinely excellent for the dense and MoE transformer shapes Google cares about — but it is a compiler-first model, not a kernel-first one. You express computation and trust XLA to schedule it; you do not hand-write a PTX-equivalent kernel for a novel op the way a CUDA engineer does. For a stable architecture this is a feature (the compiler optimizes across the whole graph); for a research workload chasing a new operator every week it is friction. And TPUs are captive to Google Cloud — you cannot buy one, cannot colocate one, and your exit option is a full re-port to another stack. → Chapter 7.9 quantifies the XLA-vs-CUDA switching cost.

AWS Trainium / Inferentia and the anchor-tenant economics

AWS runs a two-chip split: Inferentia for inference and Trainium for training, both programmed through the Neuron SDK, which sits under PyTorch and JAX. The 2026 story is Trainium3, announced at re:Invent 2025. Each Trainium3 chip delivers 2.52 FP8 PFLOPS with 144 GB of HBM3E at 4.9 TB/s — the first 3nm AI accelerator AWS has shipped. The system unit is the Trn3 UltraServer: up to 144 chips for 362 MXFP8 PFLOPS, 20.7 TB of HBM, and 706 TB/s of aggregate bandwidth, delivering ~4.4x the performance and 4x the perf/watt of the Trn2 UltraServer (AWS, Dec 2025). The scale-up fabric is NeuronLink-v4 with a NeuronSwitch-v1 all-to-all topology at 2 TB/s per chip — AWS taking the same vertical-scale-up page NVIDIA wrote, against an open-Ethernet scale-out (EFA).

What makes Trainium strategically different from TPU is the anchor-tenant model. Google builds TPUs primarily for itself; AWS builds Trainium substantially for Anthropic. Project Rainier — the Trainium2 cluster in St. Joseph County, Indiana, an $11B AWS site — went from announcement (re:Invent, Dec 2024) to a live cluster of ~500,000 Trainium2 chips in under twelve months, and by 2026 Anthropic and AWS report running over one million Trainium2 chips to train and serve Claude, with the partnership committed toward up to 5 GW of compute (AWS; Anthropic, 2026). External adopters such as Uber have cited cost savings on the order of 50% versus NVIDIA on suitable workloads. The anchor tenant de-risks the silicon: a guaranteed million-chip buyer amortizes the NRE before the first external customer ever rents a Trn instance. The consequence for everyone else: you are renting into economics that were sized for someone else's model, on a Neuron stack whose roadmap follows the anchor's needs, not yours.

Hyperscaler XPU comparison (2026-current)

XPU	Flagship (2026)	Per-chip peak	HBM / BW	Scale-up fabric	Software	Access model	Primary workload
Google TPU	Ironwood (v7)	4,614 FP8 TFLOPS	192 GB HBM3E / 7.37 TB/s	3D torus over ICI + optical circuit switch (OCS); 9,216-chip pod, 42.5 ExaFLOPS	JAX / TF via XLA	Captive — Google Cloud only	Inference-first (also trains Gemini)
AWS Trainium	Trainium3 (Trn3)	2.52 FP8 PFLOPS	144 GB HBM3E / 4.9 TB/s	NeuronLink-v4 all-to-all (NeuronSwitch-v1, 2 TB/s/chip); 144-chip UltraServer	Neuron SDK (PyTorch/JAX)	Captive — AWS only; anchor: Anthropic	Training + inference (Claude)
AWS Inferentia	Inferentia2	Inference-tuned (lower TFLOPS, high efficiency)	32 GB HBM / moderate	NeuronLink (smaller domains)	Neuron SDK	Captive — AWS only	Cost-optimized online inference
Microsoft Maia	Maia 200	>10 PFLOPS FP4 / 5 PFLOPS FP8	216 GB HBM3E / 7 TB/s	Ethernet-based; wider rack + Sidekick closed-loop liquid	Maia SDK / Triton path	Captive — Azure first-party	Serving GPT-class + Copilot
Meta MTIA	MTIA 300-series	Recsys/inference-tuned	HBM (3nm, CoWoS from 300-gen)	Internal Meta fabric	PyTorch-native internal stack	Not rentable — Meta internal only	Recommendation + GenAI inference

Per-chip and per-pod/system figures from vendor primaries and independent analysis; see keynumbers for sources and vintages. 'Captive' = rentable through one cloud only, not purchasable. Maia/MTIA figures are first-party-deployment specs, not rentable instances.

The 'Access model' column governs strategy: every XPU here is captive, and two of them (MTIA, and effectively the first-party tiers of Maia) are not rentable at all. The choice of XPU is therefore downstream of a choice of cloud, and the choice of cloud is downstream of a choice of software stack you are willing to live inside. The perf/watt numbers matter, but the fork is which ecosystem you are entering.

Maia, MTIA, and the model-owner-builds-silicon wave

The third cohort is the newest and the clearest signal of where the industry is heading. Microsoft Maia 200 deployed in early 2026: TSMC 3nm, 140B+ transistors, >10 PFLOPS FP4 / 5 PFLOPS FP8, 216 GB HBM3E at 7 TB/s in a 750 W envelope, which Microsoft claims delivers ~30% better performance-per-dollar than the best hardware in its existing fleet. Maia ships with a co-designed system: a wider rack and the Sidekick closed-loop liquid-cooling sidecar that lets Azure retrofit Maia into existing halls without re-plumbing the building (a deliberate density-ramp hedge — see Chapter 5.10 on liquid retrofits). In 2026 Maia 200 serves GPT-class models for OpenAI and powers Microsoft 365 Copilot. Meta MTIA took the opposite, most-aggressive roadmap stance: in early 2026 Meta disclosed four new generations (MTIA 300 through 500) for deployment through 2027, moving to 3nm with CoWoS packaging — silicon built for Meta's recommendation and GenAI inference and never sold to anyone.

The most consequential entrant is the one that is not a cloud at all. OpenAI, partnered with Broadcom in a ~$10B program, taped out its first custom inference ASIC ('Jalapeño') in roughly nine months, targeting deployment from late 2026 (VentureBeat / Tom's Hardware, 2026). This is the model-owner-builds-silicon trend in its purest form: the entity that owns the most-served model in the world deciding that the cheapest tokens are the ones running on a chip it designed. And it reveals the quiet kingmaker — Broadcom is the design-and-SerDes partner behind Google's TPU, Meta's MTIA, Microsoft's Maia, and OpenAI's Jalapeño (with Marvell playing a similar role elsewhere). The merchant-silicon disruption is not a swarm of independent chip startups; it is a handful of model owners renting Broadcom's and Marvell's packaging, SerDes, and physical-design expertise to convert their workload knowledge into captive silicon. → Chapter 7.5 on the merchant-silicon ASIC economics; Chapter 8.3 on the SerDes and switch-ASIC supply chain underneath all of it.

Why the XPU wave is an inference wave, and why that is power-bound

Notice what every 2026 flagship optimizes for. Ironwood is 'for the age of inference.' Maia 200 and OpenAI's Jalapeño are inference chips. MTIA is recommendation-and-inference. Even Trainium's stated goal is 'the best token economics for agentic, reasoning, and video.' This is not a coincidence — it is the inference-share shift (inference is now ~2/3 of AI compute; see Chapter 1.3) meeting the power wall. Inference at hyperscale is a tokens-per-megawatt game, and when grid capacity binds (Chapter 3.2 on speed-to-power), perf/watt governs silicon selection over raw $/chip. A fixed-function XPU tuned for one model shape extracts more tokens from the same megawatt than a general-purpose GPU — and the megawatt, not the chip, is the scarce input. The XPU wave is, at bottom, a response to the power-bound era.

192 GB

HBM3E per Ironwood (TPU v7) chip @ 7.37 TB/s; 4,614 FP8 TFLOPS; 6x memory vs Trillium

2026 (GA)Google Cloud (Ironwood blog); SemiAnalysis TPUv7

9,216

chips per Ironwood pod = 42.5 FP8 ExaFLOPS, ~1.77 PB pooled HBM, OCS 3D-torus

2026Google Cloud; TrendForce; The Register

2.52 PFLOPS

FP8 per Trainium3 chip; 144 GB HBM3E @ 4.9 TB/s; first 3nm AWS accelerator

Dec 2025AWS (Trn3 launch); Tom's Hardware

144 chips

per Trn3 UltraServer = 362 MXFP8 PFLOPS, 20.7 TB HBM, 4x perf/watt vs Trn2

Dec 2025AWS About-AWS / EC2 Trn3

>1M

Trainium2 chips on Project Rainier serving and training Claude (from ~500k at activation, <12 mo)

2026AWS; Anthropic; DCD

216 GB

HBM3E per Microsoft Maia 200 @ 7 TB/s; >10 PFLOPS FP4 / 5 PFLOPS FP8, 750W, +30% perf/$

early 2026Microsoft Azure; Tom's Hardware

~9 months

tape-out time for OpenAI/Broadcom 'Jalapeño' inference ASIC; deploy from late 2026

2026VentureBeat; Tom's Hardware

~28%

ASIC share of AI server shipments in 2026 (highest since 2023); custom-ASIC units +~45% YoY

2026TrendForce via Tom's Hardware

The decision: rent an XPU, build one, or stay on merchant GPUs

For an operator who is not a frontier model owner, the practical question collapses to three options, and the right one is a function of workload stability and your willingness to enter a single vendor's gravity well.

Stay on merchant GPUs (NVIDIA/AMD). The default, and correct for any workload that churns architecture, must run portably, or needs the deepest kernel ecosystem. You pay the NVIDIA margin and stand in the allocation queue, but you keep CUDA velocity and a multi-vendor exit. → Chapter 7.2, Chapter 7.3.

Rent an XPU through its captive cloud. Right when you have a stable high-volume inference workload and the XPU's tokens/$/W beats GPUs by enough to repay the port — a TPU slice for a JAX-native serving stack, a Trn/Inf instance for a PyTorch model that lowers cleanly through Neuron. The consequence is cloud lock-in: your serving stack now assumes that vendor's compiler, fabric, and roadmap, and your exit is a re-port. → Chapter 7.9, Chapter 7.11.

Build your own. Reserved for entities that own the model, ship at hyperscale, and can clear the $100M+ NRE, the multi-year lead time, and the permanent compiler-team headcount — almost always via a Broadcom/Marvell partnership rather than from scratch. The payoff is supply independence and a structurally lower cost-per-token; the risk is fixed-function rigidity against a model architecture that may move under you. → Chapter 7.5.

Rent-XPU vs build-silicon vs merchant-GPU — the fork and its downstream cost

Path	Up-front cost	Software burden	Supply posture	Exit cost	Best-fit
Merchant GPU (buy/rent)	Capex or opex; pays NVIDIA margin	Lowest — CUDA/ROCm ecosystem	Stands in allocation queue	Lowest — multi-vendor, portable	Architecture churn, portability, frontier research
Rent XPU (TPU/Trn/Inf)	Opex only; below-GPU $/token at volume	Moderate — XLA or Neuron port	Inherits the cloud's allocation	High — re-port to exit the cloud	Stable high-volume workload on one cloud
Build own silicon	$100M+ NRE/gen + multi-yr lead + compiler team	Highest — own the full stack	Independent — own wafer/CoWoS allocation	Sunk — committed to the architecture	Model owner at hyperscale (Google, Anthropic-via-AWS, OpenAI)

Heuristic decision frame for the model-owner-builds-silicon question. 'Break-even volume' is directional; the real threshold is workload-specific (see Chapter 7.5).

Deep dive: why TPU's OCS torus and Trainium's all-to-all are opposite bets — and what each costs you

The two most mature XPUs made opposite scale-up bets, and the difference is instructive because it is the same fork merchant-GPU buyers face between NVLink-style all-to-all and a torus. Google's TPU is a 3D torus stitched by optical circuit switches. Each chip talks directly only to its torus neighbors over ICI; the OCS layer physically reconfigures which chips are neighbors, so a job requests a logical topology and the optics provision it. The win: ~9,216-chip coherent domains with no central all-to-all switch silicon, per-job topology-on-demand, and optical bypass of failed chips. The cost: a torus has higher hop-count and more constrained collective patterns than a fully-connected switch, so the compiler (XLA) must be exceptionally good at mapping collectives onto the torus — which it is, for the shapes Google runs, and which is exactly why TPU is hard to use for shapes it does not run.

AWS's Trainium took the NVIDIA page: a NeuronSwitch all-to-all fabric (NeuronLink-v4, 2 TB/s/chip) inside the UltraServer. The win: low-hop, uniform any-to-any bandwidth across the 144-chip domain, which makes tensor- and expert-parallel mapping straightforward and forgiving — closer to how a GPU cluster behaves. The cost: switch silicon and the copper/optics reach that bounds how large the all-to-all domain can grow before you fall back to scale-out (EFA) for the rest. Neither bet is wrong; they encode different convictions about whether the compiler or the fabric should absorb the complexity. For the operator, the practical read-through is the same as the merchant-GPU case in Chapter 8.2: the scale-up domain size — not the per-chip FLOPS — sets the largest model you can train coherently and the blast radius when a chip dies.

Deep dive: the Neuron / XLA software maturity tax (the gap behind the paper FLOPS)

The published peak FLOPS of an XPU is the easy number. The hard number is realized MFU — the fraction of peak you actually achieve on your model — and that is governed by compiler maturity, not silicon. Here the XPUs split sharply by age. XLA is a decade mature: for dense and MoE transformers in JAX/TF, realized efficiency on TPU is competitive with a well-tuned CUDA stack, because Google has had years to co-evolve the compiler with its own models. Neuron is younger and narrower: it lowers PyTorch and JAX well for the model families AWS and Anthropic care about, but a novel operator, an unusual attention pattern, or a custom kernel can hit a wall — either failing to lower efficiently or requiring AWS to add support on a roadmap you do not control. Independent analysis (SemiAnalysis, Trainium3 deep dive) repeatedly frames Neuron's software maturity, not the chip, as the gating factor versus NVIDIA.

The consequence for the decision: the perf/watt and tokens/$/W advantages quoted in a vendor keynote assume the workload already lowers well on that XPU. Budget the port as a real engineering project, validate realized MFU on your model before committing volume, and treat any custom-kernel dependency as a red flag for a fixed-function XPU. A 4x-perf/watt chip running your model at half its achievable MFU is not a 4x chip. → Chapter 7.9 quantifies the realized-MFU gap across CUDA / ROCm / XLA / Neuron.

The captive-supply trap

The most underestimated risk of an XPU is not the software port — it is that you cannot diversify the supply. A merchant-GPU shop hedges allocation pain across NVIDIA, AMD, and the secondary market. An XPU shop has exactly one supplier of the silicon, the fabric, the compiler, and the rack — the cloud that owns it. If that cloud reprioritizes capacity toward its anchor tenant (Anthropic on AWS, Gemini on Google, OpenAI/Copilot on Azure), your slice is the one that gets squeezed, and you have no second source to fail over to because the chip does not exist anywhere else. Before anchoring a production workload to a captive XPU, price the cost of a forced re-port back to GPUs under a capacity crunch — and keep that exit path warm if the workload is revenue-critical.

This chapter sits inside the accelerator selection arc. The merchant accelerators these XPUs compete with are in Chapter 7.2 (NVIDIA) and Chapter 7.3 (AMD); the taxonomy that frames GPU-vs-XPU is Chapter 7.1. The ASIC economics — NRE, lead time, minimum volume, and the Broadcom/Marvell merchant-silicon model — are quantified in Chapter 7.5. The HBM and CoWoS allocation gate that every one of these chips depends on is Chapter 7.6 and Chapter 7.7; the software lock-in (CUDA / ROCm / XLA / Neuron) and the realized-MFU gap is Chapter 7.9; the full TCO and procurement decision is Chapter 7.11. The scale-up fabrics — TPU's OCS torus vs NeuronLink/NVLink/UALink all-to-all — are engineered in Chapter 8.2, and the SerDes/switch-ASIC supply chain underneath them is Chapter 8.3. The inference-share and power-bound forces driving the XPU wave trace back to Chapter 1.3 and Chapter 3.2; long-lead procurement and anchor-tenant supply strategy is Chapter 2.3.