Guide › Compute, Silicon & System Integration › 7.5

Chapter 7.5

Custom ASICs & the Merchant-Silicon Disruption

Custom silicon is not a technology decision, it is a volume-and-flexibility bet: above a sustained-demand threshold the per-token economics of a fixed-function ASIC crush a merchant GPU, but below it you have spent hundreds of millions in NRE and 18–36 months of lead time to ship a chip your roadmap already obsoleted.

GOODPUTPOWER-BOUND

What you'll decide here

Whether your inference demand is large enough, durable enough, and architecturally stable enough to clear the custom-silicon breakeven — because the NRE and lead time are sunk the moment you tape out, and only sustained hyperscale volume amortizes them.
How much flexibility you are willing to surrender for efficiency — a fixed-function ASIC wins on tokens/$/W only for the workload it was hardened against, and the model architecture it assumes can shift under it within one training cycle.
Whether you are buying merchant GPUs, co-designing an ASIC with Broadcom or Marvell, or building a full in-house design team — three points on a control-vs-burden curve with very different fixed-cost and time-to-silicon profiles.
What fraction of your fleet is fixed-function versus reprogrammable — the hybrid-fleet ratio that hedges architecture-shift risk while still capturing the ASIC cost advantage on the stable, high-volume slice.
Who actually owns the supply-chain risk you create — every custom program lands on the same TSMC advanced-node and CoWoS/HBM allocation as the merchant GPUs you were trying to escape.

The previous four chapters treated accelerators as products you select. This chapter treats silicon as something you can commission — and asks the only question that matters before you do: is the workload big enough, durable enough, and stable enough to pay back a chip you have to design, validate, and manufacture before it earns a cent? The merchant-GPU model (buy NVIDIA, inherit CUDA and the allocation queue) is the default for a reason: someone else paid the fixed cost. A custom ASIC inverts that bargain. You absorb the non-recurring engineering (NRE) and the lead time up front, in exchange for a chip that does exactly your workload at a lower cost per token — and nothing else. The disruption of the 2025–2026 era is that, for the largest inference buyers, that trade has flipped from speculative to obviously correct, and the merchant-silicon design houses (Broadcom, Marvell) have turned chip design into a service you can buy.

This chapter builds the custom-silicon economics — NRE, lead time, minimum-volume thresholds, the breakeven against a merchant GPU — then confronts the fixed-function-risk-vs-flexibility fork that determines whether the bet survives an architecture shift, and closes on the hybrid fleet that most large operators actually run. A custom ASIC is the highest-leverage and least-reversible procurement decision in Part 7. Get the volume threshold right and you have the lowest cost-per-token compute available; get it wrong and you own a stranded mask set and a chip that lost to the next merchant generation before it shipped. → the XPU programs themselves live in Chapter 7.4; the merchant GPUs you are deciding against are in Chapter 7.2 and Chapter 7.3.

The economics that justify custom silicon

A custom ASIC has two cost components that a merchant GPU hides from you because they are already amortized across NVIDIA's millions of units: NRE (the one-time cost to design, verify, and tape out the chip) and per-unit recurring cost (wafer, HBM, packaging, test — the same supply chain the merchant GPU draws from). The whole case for building rests on a single inequality: does the per-unit cost advantage, multiplied by your deployed volume, exceed the NRE plus the opportunity cost of the lead time? Below that threshold you are subsidizing a science project; above it you are minting margin.

Start with NRE, because it is the number that disciplines the decision. At the leading edge the fixed cost of a new design has exploded: a 3 nm chip costs roughly $590M to design end-to-end, against ~$416M at 5 nm and ~$217M at 7 nm (Silicon Analysts, 2026). Most of that is not the mask set — a 3 nm reticle set is ~$15M (vs ~$6.5M at 5 nm) — but the verification, validation, IP licensing, and physical-design labor that a leading-edge tape-out demands. The practitioner shorthand the industry actually quotes for a complete AI-accelerator program is ~$300M–$1B+ in NRE, depending on node, die size, chiplet count, and how much IP is bought versus built (domain synthesis; Tom's Hardware, 2026). The cost of being wrong is steep: a single respin at a leading node runs $50–$100M and delays the product 6–12 months — which is why pre-silicon verification, not the mask, is where the money and the schedule risk concentrate.

Now the lead time. From committed architecture to silicon in a rack is 18–36 months, and that clock is the silent killer of weak custom programs. The merchant vendors ship a new generation roughly annually; a 30-month custom program is therefore racing a moving target that is two generations more efficient per watt by the time your chip lands. The lead time is not just schedule risk — it is obsolescence risk, and it is the reason custom silicon only makes sense for a workload you are confident will still exist, at scale, three years out.

The master fork: do you clear the volume threshold?

Build custom silicon only when the workload clears the volume threshold. A custom ASIC pays back only at sustained hyperscale volume against a stable, high-duty workload. The arithmetic: if your chip delivers a 30–50%+ cost-per-token advantage over the merchant GPU it replaces, and you deploy enough units that the aggregate annual saving exceeds the $300M–$1B+ NRE, you break even inside the first year of deployment — and every year after is pure structural advantage (Tom's Hardware; AWS/Google benchmarks, 2026). At hyperscale that threshold is easily cleared: AWS cites Trainium at 30–40% better price-performance than other AWS hardware, and Google's TPU line claims multiples of price-performance over comparable GPU instances on its own workloads. But the same chip deployed at tens of thousands of units instead of hundreds of thousands never amortizes — the per-unit saving is real but the volume is too thin to overcome the fixed cost. This is why custom silicon is a hyperscaler-and-frontier-lab phenomenon and a trap for everyone else. The volume question is the threshold; everything downstream is a consequence. → the cost-per-token TCO model is built in Chapter 7.11.

Merchant GPU vs co-designed ASIC vs full in-house silicon

Dimension	Buy merchant GPU	Co-design with a partner	Full in-house team
Up-front NRE	~$0 (amortized in unit price)	Shared/serviced; ~$300M–$1B+ program	Highest; ~$300M–$1B+ plus a standing org
Time to first silicon	Order against allocation queue	18–30 months	24–36 months + team build-out
Cost-per-token at scale	Baseline	~30–50%+ below merchant for the target workload	~30–50%+ below; best if volume is enormous
Flexibility	Highest — runs any model, any framework	Bounded by the hardened workload	Bounded by the hardened workload
Architecture-shift risk	Vendor absorbs it	You own it; respin is $50–100M	You own it; respin is $50–100M
Software burden	CUDA mature	Your compiler/runtime stack (XLA/Neuron-class)	Your compiler/runtime stack, fully owned
Best-fit buyer	Anyone; default for variable workloads	Hyperscaler/frontier lab with a stable, huge workload	Hyperscaler at extreme, durable volume

The three procurement points on the control-vs-burden curve. NRE and lead-time figures are 2026 practitioner ranges (Silicon Analysts; Tom's Hardware; SemiAnalysis); 'design partner' = Broadcom/Marvell-class ASIC service.

The table is a fixed-cost ladder. Merchant GPU: zero NRE, maximum flexibility, vendor eats the obsolescence risk — you pay for all of that in the unit price and the allocation queue. Full in-house silicon: maximum control and the lowest unit cost at extreme volume, paid for with the deepest fixed cost and the longest clock. The co-design middle is the structural innovation of this era and deserves its own treatment.

The merchant-silicon disruption: chip design as a service

The reason custom silicon went from a Google-only curiosity to an industry-wide wave is that you no longer need Google's silicon org to do it. Broadcom and Marvell turned ASIC design into a service: they bring the hardened IP (SerDes, the highest-bandwidth interconnect and PCIe/scale-up blocks, memory controllers, packaging methodology) and the proven path through TSMC's advanced nodes and CoWoS, and the customer brings the compute architecture and the workload. This is the merchant-silicon disruption — not that hyperscalers build chips, but that buying a custom chip became a procurement line item rather than a decade-long capability build.

The market structure that resulted is a near-duopoly. Custom ASICs are projected at ~27.8% of AI-server shipments in 2026, growing ~44.6% year-over-year — nearly triple the ~16% growth of merchant GPUs (TrendForce, 2026). Behind that wave, Broadcom holds ~70%+ of the custom-accelerator design-services market and Marvell ~20–25%, together ~95% (industry synthesis, 2026). Broadcom anchors Google's TPU and Meta's MTIA programs and reported AI revenue up ~106% with a multi-tens-of-billions backlog; Marvell anchors AWS Trainium and Microsoft Maia and guides toward ~$11B in AI ASIC revenue. The consequence for a strategist: the custom-silicon path is real and serviceable, but the on-ramp runs through two companies, and both of them — and every chip they design — sit on the same TSMC advanced-node and CoWoS/HBM allocation as the NVIDIA GPUs you were trying to escape. Building custom does not exit the supply chain; it re-enters it from a different door. → the upstream allocation gate is Chapter 7.6 (HBM) and Chapter 7.7 (packaging); procurement strategy is Chapter 2.3.

~27.8%

custom ASIC share of AI-server shipments in 2026; growing ~44.6% YoY (≈3x merchant-GPU growth)

2026TrendForce

~70% / ~20–25%

Broadcom / Marvell share of the custom-accelerator design-services market (≈95% combined)

2026Tom's Hardware; Hashrate Index synthesis

~$590M

end-to-end design cost of a 3 nm chip (vs ~$416M at 5 nm, ~$217M at 7 nm)

2026Silicon Analysts

~$300M–$1B+

practitioner NRE range for a full AI-accelerator program (node/die/chiplet dependent)

2026Tom's Hardware; domain synthesis

18–36 mo

committed-architecture-to-rack lead time for a custom accelerator

2026SemiAnalysis; domain synthesis

$50–100M

cost of a single leading-node respin; 6–12 month product slip

2026SemiAnalysis (verification economics)

~30–50%+

cost-per-token / price-performance advantage cited by custom-silicon programs vs merchant GPU

2026AWS Trainium; Google TPU benchmarks; Tom's Hardware

~$11B / +106%

Marvell 2026 AI-ASIC revenue guide; Broadcom AI-revenue growth

2026Marvell; Broadcom earnings

Fixed-function risk vs reprogrammable flexibility

The cost advantage of a custom ASIC comes from the same property that creates its central risk: it is hardened. A merchant GPU is a general matrix engine that runs whatever the compiler emits; a custom inference ASIC bakes in assumptions about precision, attention pattern, memory hierarchy, and collective shape, and spends the transistors it saved on flexibility to do your workload faster and cooler. That is exactly why it wins on tokens/$/W — and exactly why it is exposed when the workload moves.

The fork is sharp. Fixed-function efficiency: harden the datapath to the model architecture you serve today (a specific MoE shape, a specific KV-cache layout, FP8/FP4 microscaling) and you capture the full cost-per-token advantage — for that architecture. Reprogrammable flexibility: keep enough generality that a new attention mechanism, a new precision format, or a shift from dense to wide-MoE does not strand the silicon — and you give back some of the efficiency that justified building at all. The downstream cost of choosing wrong is asymmetric and unforgiving: a model-architecture shift that lands after tape-out cannot be patched in firmware, and a respin is $50–100M and two to three quarters. The industry's own history is the cautionary tale — fixed-function accelerators that assumed a model shape have been left behind when the research frontier moved, while the parts that retained a programmable core for the matrix math survived the transition.

The architecture-shift trap

The most expensive way to be right about custom silicon and still lose: harden a chip against the model architecture of 2026 and ship it into the workload mix of 2028. The merchant GPU's apparent inefficiency — spending transistors on generality — is in fact an option on the future that the vendor is selling you. A fixed-function ASIC is the opposite: a leveraged bet that the workload is stable for the chip's whole 18–36-month gestation plus its 2–3-year economic life. The defensible posture is to harden only the parts of the datapath that the research frontier has actually settled (the matrix multiply, the dominant precision, the memory-bandwidth-bound decode path) and keep a programmable core for the parts that are still moving (attention variants, routing, new microscaling formats). Pay the flexibility premium where the future is uncertain; bank the efficiency only where it is not. → precision-format volatility is Chapter 7.10; the software stack that absorbs some of this risk is Chapter 7.9.

Deep dive: why inference is the natural home of fixed-function silicon (and training is not)

The fixed-function-vs-flexibility fork resolves differently for training and inference, and understanding why is the key to scoping a custom program. Training is where the architecture is still being discovered: a frontier pre-training or RL run is, definitionally, an experiment, and the model shape, optimizer, and parallelism strategy change run-to-run. Hardening silicon against a target that is by construction unstable is a category error — which is why training remains the most defensible redoubt of the flexible merchant GPU, and why even the strongest custom programs keep a reprogrammable core for the trainer. The cost advantage is real but the obsolescence risk is maximal.

Inference at scale is the inverse. Once a model is deployed to serve production traffic, its architecture is frozen for the life of that deployment — the weights do not change, the attention pattern does not change, the precision is fixed at quantization time. That frozen, high-duty, high-volume workload is precisely what a fixed-function ASIC is built to exploit: the chip and the workload have the same stability horizon. This is why the custom-silicon wave is overwhelmingly an inference wave, and why it accelerated exactly as inference overtook training as the dominant share of AI compute — the economic gravity moved to the one workload whose stability matches the chip's gestation. The strategist's rule: harden against frozen production inference, rent flexibility for the moving frontier. → the workload archetypes that drive this split are Chapter 7.1; the inference economics that reward it are Chapter 7.11.

The hybrid fleet

No serious operator runs an all-custom or all-merchant fleet, because the two failure modes are opposite and the optimal hedge is a mix. An all-merchant fleet leaves the structural cost-per-token advantage on the table for the workloads that could capture it; an all-custom fleet is one architecture shift away from a stranded write-down. The hybrid fleet resolves the tension by routing each workload to the silicon whose stability and volume it matches.

The partition follows directly from the prior sections. Frozen, high-volume production inference — the stable serving tier — goes to fixed-function custom silicon, where it captures the 30–50%+ cost-per-token advantage. The moving frontier — pre-training, RL, research, and any model whose architecture is still in flux — stays on flexible merchant GPUs, where the generality is worth its premium because it is an option on a future you cannot yet specify. The bridge cases — new models still ramping toward stable volume, or workloads with uncertain longevity — stay on merchant GPUs until they prove durable enough to migrate to custom. The ratio between these tiers is itself the hedge: it is tuned to how much of your demand is genuinely frozen-and-huge versus how much is still moving, and it is re-decided each generation rather than committed once.

The consequence for procurement is that the build-vs-buy decision is not binary and not permanent. It is a continuous allocation problem: migrate a workload to custom silicon only once it crosses the volume-and-stability threshold, keep the merchant GPU as both the default and the escape hatch, and treat the custom fraction of the fleet as a position you re-balance — not a one-way door. The operators winning the cost-per-token race in 2026 are not the ones who built the most custom silicon; they are the ones who put the right workloads on it and kept everything else flexible. → fleet composition and heterogeneous procurement is Chapter 7.11; the refresh and depreciation cadence that governs when custom silicon is retired is Chapter 14.9.

Custom silicon does not escape the binding constraint

The common assumption is that building your own chip frees you from NVIDIA's allocation queue and pricing power. It does not. Every custom AI accelerator in this chapter — TPU, Trainium, Maia, MTIA, and whatever your design partner builds for you — competes for the same TSMC advanced-node wafers, the same CoWoS packaging capacity, and the same HBM stacks as the merchant GPUs. The bottleneck on AI compute is not the GPU vendor; it is the packaging and HBM upstream, and custom silicon contends for it on equal footing. What you gain by building is control over the architecture and the cost-per-token; what you do not gain is escape from the physics of the supply chain. A custom program that has not secured its CoWoS and HBM allocation is a $590M design with nowhere to be manufactured. → the allocation gate is Chapter 7.6 and Chapter 7.7.

The XPU programs this chapter generalizes — TPU, Trainium/Inferentia, Maia, MTIA — are profiled in Chapter 7.4; the merchant GPUs you decide against are Chapter 7.2 (NVIDIA) and Chapter 7.3 (AMD); the accelerator taxonomy that frames the GPU-vs-ASIC axis is Chapter 7.1. The upstream allocation that every custom program shares lives in Chapter 7.6 (HBM) and Chapter 7.7 (packaging); the software stack that determines whether a custom chip is usable is Chapter 7.9; the precision-format volatility that hardening must hedge is Chapter 7.10. The cost-per-token TCO model that scores the whole decision is Chapter 7.11; the procurement and lead-time supply chain is Chapter 2.3; the depreciation and refresh cadence that retires custom silicon is Chapter 14.9; and the business-model economics that this NRE bet ultimately answers to are Chapter 1.8.