Chapter 7.4
Hyperscaler XPUs: TPU, Trainium/Inferentia, Maia, MTIA
When the company that owns the model also owns the silicon, the accelerator stops being a product you buy and becomes a cost structure you rent into — and the real decision is no longer FLOPS-per-dollar but whether you can tolerate a software stack and a supply chain you do not control.
What you'll decide here
- Whether your workload is stable enough — a fixed model architecture served at volume — to amortize the lock-in of a single-vendor XPU stack (XLA/JAX on TPU, Neuron on Trainium), or whether you need the portability that only merchant GPUs and CUDA still guarantee.
- Whether you can access an XPU at all: TPU, Trainium, Maia, and MTIA are captive — rentable through one cloud or not rentable at all — so the procurement question is which hyperscaler you are willing to anchor to, not which chip you prefer.
- Whether the perf/watt and tokens/$/W advantage of an inference-optimized XPU is large and durable enough to justify re-tooling your serving stack, against a roadmap whose cadence and software maturity you cannot audit from outside.
- Whether to build your own silicon at all — the model-owner-builds-silicon path (OpenAI/Broadcom, Anthropic-on-Trainium) trades a multi-year, nine-figure NRE and a software-team headcount for supply independence and a structurally lower cost-per-token.
- Which scale-up philosophy you are implicitly buying: TPU's optically-switched 3D torus, Trainium's NeuronLink all-to-all, or a merchant-GPU NVLink/UALink fabric — because the interconnect, not the chip, sets the failure blast radius and the largest model you can train coherently.
The previous two chapters covered silicon you can buy: NVIDIA and AMD sell merchant accelerators to anyone with a purchase order. This chapter is about silicon you can only rent, or only build — the hyperscaler XPUs. Google's TPU, AWS's Trainium and Inferentia, Microsoft's Maia, and Meta's MTIA are not products on a price list. They are vertically-integrated cost structures: a chip co-designed with a compiler, wired into a proprietary fabric, deployed in racks the same company owns, serving models the same company (or its anchor tenant) runs. The 2026 reality is that roughly 28% of AI server shipments are now ASIC-based — the highest share since 2023 — and custom-ASIC unit growth is running near 45% year-over-year, nearly triple the rate of merchant GPUs (TrendForce / Tom's Hardware, May 2026). The merchant-GPU monopoly is not collapsing, but the inference half of the market is quietly defecting.
For most operators the XPU is not a thing you select — it is a thing you inherit when you pick a cloud. You do not buy a TPU; you rent a Cloud TPU slice and accept XLA. You do not buy a Trainium; you rent a Trn instance and accept the Neuron SDK. So the forks here are second-order: stable-workload-vs-portability, captive-vs-merchant, rent-the-anchor's-economics-vs-build-your-own. This chapter traces each and ends on the structural trend that ties them together — the model owner deciding that the cheapest way to serve its own tokens is to design the chip that serves them.
Why a hyperscaler builds its own chip
The motive is never raw FLOPS — merchant GPUs win on peak FLOPS and will keep winning, because NVIDIA spreads its R&D across the entire market. The motive is the intersection of three pressures that only a model-owner feels at full force. First, cost-per-token at volume. When you serve hundreds of billions of tokens a day against a model architecture you control and rarely change, a fixed-function accelerator tuned for exactly that shape beats a general-purpose GPU on tokens/$/W — and at hyperscale, single-digit-percent efficiency gains are nine-figure annual line items. Second, supply independence. NVIDIA allocation is the binding constraint of the era (see Chapter 7.6 on HBM and Chapter 2.3 on long-lead procurement); a hyperscaler with its own tape-out and its own TSMC wafer-and-CoWoS allocation is not standing in NVIDIA's queue. Third, the NVIDIA margin. NVIDIA's gross margin is the hyperscaler's cost; eliminating it on the highest-volume, most-predictable workloads is the single largest TCO lever a model-owner has.
The consequence — the reason most operators should not build — is that all three pressures require scale to pay off. Custom silicon carries a $100M-plus NRE per generation, a multi-year lead time, a hard minimum-volume threshold below which per-unit cost is worse than just renting GPUs, and a permanent software-engineering tax to keep the compiler competitive (see Chapter 7.5 for the ASIC break-even math). Only an entity that owns the model, owns the deployment, and ships the workload at hyperscale clears that bar. Everyone else rents the result.
Google TPU: the most mature XPU, and the deepest lock-in
The TPU is the only hyperscaler XPU with a decade of generations behind it, and it is the existence proof that the model-owner-builds-silicon thesis works at frontier scale — Gemini is trained and served on TPUs end to end. The 2026 flagship is Ironwood (TPU v7), made generally available at Google Cloud Next in 2026 and explicitly positioned as 'the first Google TPU for the age of inference.' Each chip delivers 4,614 FP8 TFLOPS and 192 GB of HBM3E at 7.37 TB/s — a 6x memory jump over Trillium (TPU v6), the largest single-generation memory expansion in TPU history — at roughly 2x the perf/watt of the prior generation (Google Cloud; SemiAnalysis, Nov 2025).
The architectural signature that separates TPU from every GPU fabric is the interconnect. TPUs are wired as a 3D torus over the Inter-Chip Interconnect (ICI, 9.6 Tb/s per chip on Ironwood), and the torus is stitched together by optical circuit switches (OCS) — Google's own MEMS-mirror optical switches that physically reconfigure which chips are neighbors. An Ironwood pod scales to 9,216 chips delivering 42.5 FP8 ExaFLOPS with ~1.77 PB of pooled HBM. The OCS decouples the logical topology a job sees from the physical wiring, which buys two things a switched fat-tree cannot: per-job topology-on-demand (a job requests a torus shape, the OCS provisions it), and fast routing-around-failures (a dead chip is optically bypassed rather than failing the slice). This is why Google can run ~9,200-chip coherent domains without the all-to-all switch silicon that dominates a comparable GPU cluster — the optics are the scale-up fabric. → Chapter 8.2 treats OCS-vs-switched scale-up in full.
The lock-in is correspondingly deep. There is no CUDA on a TPU; the path is JAX or TensorFlow lowered through XLA, the compiler that turns a high-level graph into TPU machine code. XLA is genuinely excellent for the dense and MoE transformer shapes Google cares about — but it is a compiler-first model, not a kernel-first one. You express computation and trust XLA to schedule it; you do not hand-write a PTX-equivalent kernel for a novel op the way a CUDA engineer does. For a stable architecture this is a feature (the compiler optimizes across the whole graph); for a research workload chasing a new operator every week it is friction. And TPUs are captive to Google Cloud — you cannot buy one, cannot colocate one, and your exit option is a full re-port to another stack. → Chapter 7.9 quantifies the XLA-vs-CUDA switching cost.
AWS Trainium / Inferentia and the anchor-tenant economics
AWS runs a two-chip split: Inferentia for inference and Trainium for training, both programmed through the Neuron SDK, which sits under PyTorch and JAX. The 2026 story is Trainium3, announced at re:Invent 2025. Each Trainium3 chip delivers 2.52 FP8 PFLOPS with 144 GB of HBM3E at 4.9 TB/s — the first 3nm AI accelerator AWS has shipped. The system unit is the Trn3 UltraServer: up to 144 chips for 362 MXFP8 PFLOPS, 20.7 TB of HBM, and 706 TB/s of aggregate bandwidth, delivering ~4.4x the performance and 4x the perf/watt of the Trn2 UltraServer (AWS, Dec 2025). The scale-up fabric is NeuronLink-v4 with a NeuronSwitch-v1 all-to-all topology at 2 TB/s per chip — AWS taking the same vertical-scale-up page NVIDIA wrote, against an open-Ethernet scale-out (EFA).
What makes Trainium strategically different from TPU is the anchor-tenant model. Google builds TPUs primarily for itself; AWS builds Trainium substantially for Anthropic. Project Rainier — the Trainium2 cluster in St. Joseph County, Indiana, an $11B AWS site — went from announcement (re:Invent, Dec 2024) to a live cluster of ~500,000 Trainium2 chips in under twelve months, and by 2026 Anthropic and AWS report running over one million Trainium2 chips to train and serve Claude, with the partnership committed toward up to 5 GW of compute (AWS; Anthropic, 2026). External adopters such as Uber have cited cost savings on the order of 50% versus NVIDIA on suitable workloads. The anchor tenant de-risks the silicon: a guaranteed million-chip buyer amortizes the NRE before the first external customer ever rents a Trn instance. The consequence for everyone else: you are renting into economics that were sized for someone else's model, on a Neuron stack whose roadmap follows the anchor's needs, not yours.
| XPU | Flagship (2026) | Per-chip peak | HBM / BW | Scale-up fabric | Software | Access model | Primary workload |
|---|---|---|---|---|---|---|---|
| Google TPU | Ironwood (v7) | 4,614 FP8 TFLOPS | 192 GB HBM3E / 7.37 TB/s | 3D torus over ICI + optical circuit switch (OCS); 9,216-chip pod, 42.5 ExaFLOPS | JAX / TF via XLA | Captive — Google Cloud only | Inference-first (also trains Gemini) |
| AWS Trainium | Trainium3 (Trn3) | 2.52 FP8 PFLOPS | 144 GB HBM3E / 4.9 TB/s | NeuronLink-v4 all-to-all (NeuronSwitch-v1, 2 TB/s/chip); 144-chip UltraServer | Neuron SDK (PyTorch/JAX) | Captive — AWS only; anchor: Anthropic | Training + inference (Claude) |
| AWS Inferentia | Inferentia2 | Inference-tuned (lower TFLOPS, high efficiency) | 32 GB HBM / moderate | NeuronLink (smaller domains) | Neuron SDK | Captive — AWS only | Cost-optimized online inference |
| Microsoft Maia | Maia 200 | >10 PFLOPS FP4 / 5 PFLOPS FP8 | 216 GB HBM3E / 7 TB/s | Ethernet-based; wider rack + Sidekick closed-loop liquid | Maia SDK / Triton path | Captive — Azure first-party | Serving GPT-class + Copilot |
| Meta MTIA | MTIA 300-series | Recsys/inference-tuned | HBM (3nm, CoWoS from 300-gen) | Internal Meta fabric | PyTorch-native internal stack | Not rentable — Meta internal only | Recommendation + GenAI inference |
The 'Access model' column governs strategy: every XPU here is captive, and two of them (MTIA, and effectively the first-party tiers of Maia) are not rentable at all. The choice of XPU is therefore downstream of a choice of cloud, and the choice of cloud is downstream of a choice of software stack you are willing to live inside. The perf/watt numbers matter, but the fork is which ecosystem you are entering.
Maia, MTIA, and the model-owner-builds-silicon wave
The third cohort is the newest and the clearest signal of where the industry is heading. Microsoft Maia 200 deployed in early 2026: TSMC 3nm, 140B+ transistors, >10 PFLOPS FP4 / 5 PFLOPS FP8, 216 GB HBM3E at 7 TB/s in a 750 W envelope, which Microsoft claims delivers ~30% better performance-per-dollar than the best hardware in its existing fleet. Maia ships with a co-designed system: a wider rack and the Sidekick closed-loop liquid-cooling sidecar that lets Azure retrofit Maia into existing halls without re-plumbing the building (a deliberate density-ramp hedge — see Chapter 5.10 on liquid retrofits). In 2026 Maia 200 serves GPT-class models for OpenAI and powers Microsoft 365 Copilot. Meta MTIA took the opposite, most-aggressive roadmap stance: in early 2026 Meta disclosed four new generations (MTIA 300 through 500) for deployment through 2027, moving to 3nm with CoWoS packaging — silicon built for Meta's recommendation and GenAI inference and never sold to anyone.
The most consequential entrant is the one that is not a cloud at all. OpenAI, partnered with Broadcom in a ~$10B program, taped out its first custom inference ASIC ('Jalapeño') in roughly nine months, targeting deployment from late 2026 (VentureBeat / Tom's Hardware, 2026). This is the model-owner-builds-silicon trend in its purest form: the entity that owns the most-served model in the world deciding that the cheapest tokens are the ones running on a chip it designed. And it reveals the quiet kingmaker — Broadcom is the design-and-SerDes partner behind Google's TPU, Meta's MTIA, Microsoft's Maia, and OpenAI's Jalapeño (with Marvell playing a similar role elsewhere). The merchant-silicon disruption is not a swarm of independent chip startups; it is a handful of model owners renting Broadcom's and Marvell's packaging, SerDes, and physical-design expertise to convert their workload knowledge into captive silicon. → Chapter 7.5 on the merchant-silicon ASIC economics; Chapter 8.3 on the SerDes and switch-ASIC supply chain underneath all of it.
The decision: rent an XPU, build one, or stay on merchant GPUs
For an operator who is not a frontier model owner, the practical question collapses to three options, and the right one is a function of workload stability and your willingness to enter a single vendor's gravity well.
Stay on merchant GPUs (NVIDIA/AMD). The default, and correct for any workload that churns architecture, must run portably, or needs the deepest kernel ecosystem. You pay the NVIDIA margin and stand in the allocation queue, but you keep CUDA velocity and a multi-vendor exit. → Chapter 7.2, Chapter 7.3.
Rent an XPU through its captive cloud. Right when you have a stable high-volume inference workload and the XPU's tokens/$/W beats GPUs by enough to repay the port — a TPU slice for a JAX-native serving stack, a Trn/Inf instance for a PyTorch model that lowers cleanly through Neuron. The consequence is cloud lock-in: your serving stack now assumes that vendor's compiler, fabric, and roadmap, and your exit is a re-port. → Chapter 7.9, Chapter 7.11.
Build your own. Reserved for entities that own the model, ship at hyperscale, and can clear the $100M+ NRE, the multi-year lead time, and the permanent compiler-team headcount — almost always via a Broadcom/Marvell partnership rather than from scratch. The payoff is supply independence and a structurally lower cost-per-token; the risk is fixed-function rigidity against a model architecture that may move under you. → Chapter 7.5.
| Path | Up-front cost | Software burden | Supply posture | Exit cost | Best-fit |
|---|---|---|---|---|---|
| Merchant GPU (buy/rent) | Capex or opex; pays NVIDIA margin | Lowest — CUDA/ROCm ecosystem | Stands in allocation queue | Lowest — multi-vendor, portable | Architecture churn, portability, frontier research |
| Rent XPU (TPU/Trn/Inf) | Opex only; below-GPU $/token at volume | Moderate — XLA or Neuron port | Inherits the cloud's allocation | High — re-port to exit the cloud | Stable high-volume workload on one cloud |
| Build own silicon | $100M+ NRE/gen + multi-yr lead + compiler team | Highest — own the full stack | Independent — own wafer/CoWoS allocation | Sunk — committed to the architecture | Model owner at hyperscale (Google, Anthropic-via-AWS, OpenAI) |
Deep dive: why TPU's OCS torus and Trainium's all-to-all are opposite bets — and what each costs you
The two most mature XPUs made opposite scale-up bets, and the difference is instructive because it is the same fork merchant-GPU buyers face between NVLink-style all-to-all and a torus. Google's TPU is a 3D torus stitched by optical circuit switches. Each chip talks directly only to its torus neighbors over ICI; the OCS layer physically reconfigures which chips are neighbors, so a job requests a logical topology and the optics provision it. The win: ~9,216-chip coherent domains with no central all-to-all switch silicon, per-job topology-on-demand, and optical bypass of failed chips. The cost: a torus has higher hop-count and more constrained collective patterns than a fully-connected switch, so the compiler (XLA) must be exceptionally good at mapping collectives onto the torus — which it is, for the shapes Google runs, and which is exactly why TPU is hard to use for shapes it does not run.
AWS's Trainium took the NVIDIA page: a NeuronSwitch all-to-all fabric (NeuronLink-v4, 2 TB/s/chip) inside the UltraServer. The win: low-hop, uniform any-to-any bandwidth across the 144-chip domain, which makes tensor- and expert-parallel mapping straightforward and forgiving — closer to how a GPU cluster behaves. The cost: switch silicon and the copper/optics reach that bounds how large the all-to-all domain can grow before you fall back to scale-out (EFA) for the rest. Neither bet is wrong; they encode different convictions about whether the compiler or the fabric should absorb the complexity. For the operator, the practical read-through is the same as the merchant-GPU case in Chapter 8.2: the scale-up domain size — not the per-chip FLOPS — sets the largest model you can train coherently and the blast radius when a chip dies.
Deep dive: the Neuron / XLA software maturity tax (the gap behind the paper FLOPS)
The published peak FLOPS of an XPU is the easy number. The hard number is realized MFU — the fraction of peak you actually achieve on your model — and that is governed by compiler maturity, not silicon. Here the XPUs split sharply by age. XLA is a decade mature: for dense and MoE transformers in JAX/TF, realized efficiency on TPU is competitive with a well-tuned CUDA stack, because Google has had years to co-evolve the compiler with its own models. Neuron is younger and narrower: it lowers PyTorch and JAX well for the model families AWS and Anthropic care about, but a novel operator, an unusual attention pattern, or a custom kernel can hit a wall — either failing to lower efficiently or requiring AWS to add support on a roadmap you do not control. Independent analysis (SemiAnalysis, Trainium3 deep dive) repeatedly frames Neuron's software maturity, not the chip, as the gating factor versus NVIDIA.
The consequence for the decision: the perf/watt and tokens/$/W advantages quoted in a vendor keynote assume the workload already lowers well on that XPU. Budget the port as a real engineering project, validate realized MFU on your model before committing volume, and treat any custom-kernel dependency as a red flag for a fixed-function XPU. A 4x-perf/watt chip running your model at half its achievable MFU is not a 4x chip. → Chapter 7.9 quantifies the realized-MFU gap across CUDA / ROCm / XLA / Neuron.