Guide › Compute, Silicon & System Integration › 7.6

Chapter 7.6

HBM: The Binding Constraint on AI Compute

An accelerator is a memory system with some math attached — and in 2026 the math is cheap, the memory is sold out, so HBM, not the GPU die, is the line that decides how many chips ship and how much each one costs.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

Whether your accelerator selection is governed by FLOPS or by delivered HBM bandwidth and capacity — because for inference and long-context workloads the memory wall, not the matrix engine, sets your token economics.
How exposed your build is to the three-supplier HBM oligopoly and the 2026 sold-out allocation — and whether you have a place in the queue or are buying on the spot margin.
Whether you optimize the part for capacity (fit the model + KV cache) or bandwidth (feed the cores), and what that costs you in stack count, package area, and DRAM-market spillover.
How HBM capacity per stack (12-Hi vs 16-Hi) and HBM/CoWoS allocation jointly gate your achievable density ramp — these are the same upstream gate, decided two years before your rack ships.
Whether to hedge HBM4 timing risk by buying mature HBM3E parts now versus waiting for the bandwidth step that may not have volume until you need it.

Computing performance used to track the logic die: more transistors, more clocks, more FLOPS. The accelerator era inverted that. A modern training or inference GPU spends most of every cycle waiting on memory, the arithmetic units starved far more often than saturated, and the part of the package that costs the most, ships the slowest, and gates the entire supply chain is the stack of DRAM bolted to the die's side. High-Bandwidth Memory (HBM) is the binding constraint on AI compute in 2026, and the operators who understand that buy chips by the terabyte-per-second and the gigabyte, not by the petaFLOP.

This chapter treats HBM as a decision surface, not a spec sheet. We walk the generations (HBM3E to HBM4 to HBM4E) and the three-supplier oligopoly that makes every one of them a scarce, allocated good. We frame HBM as a top-three BOM line and force the capacity-vs-bandwidth optimization that every accelerator architect actually fights over. We trace the 2026 supply crisis and its spillover into the broader DRAM market. And we connect HBM upward to the procurement game (Chapter 2.3) and sideways to the packaging that physically determines how many stacks fit beside a die (Chapter 7.7).

Why memory, not math, is the constraint

The reason HBM exists is a number called arithmetic intensity: the ratio of floating-point operations to bytes moved from memory. Dense matrix multiply — the core of training — has high arithmetic intensity and can keep the cores busy. But autoregressive decode, the dominant cost of modern inference, generates one token at a time and must re-read the entire KV cache and weight set on every step. Its arithmetic intensity is very low: the accelerator is bandwidth-bound, and the math engine idles waiting for bytes. Since inference is now roughly two-thirds of AI compute (Chapter 1.3), the workload that pays the bills is precisely the one that lives or dies on memory bandwidth.

HBM is the answer to that wall. Instead of a few DDR channels reaching out across a motherboard, HBM stacks 8, 12, or 16 DRAM dies vertically, connects them through the silicon with thousands of through-silicon vias (TSVs), and places the whole stack within millimeters of the compute die on a shared interposer. The result is a memory interface 2,048 bits wide per stack in HBM4 — versus 64 bits for a DDR5 channel — delivering terabytes per second at a fraction of the energy-per-bit of off-package DRAM. That proximity is also the trap: HBM only works as part of an advanced package, which means it is gated by the same scarce CoWoS capacity as the die it feeds (Chapter 7.7).

The generations: HBM3E to HBM4 to HBM4E

HBM advances on a JEDEC-anchored cadence, and 2026 sits exactly on the seam between two generations — which is itself a procurement decision (buy the mature part now or wait for the bandwidth step). HBM3E is the workhorse of everything shipping in volume in 2026: 8-Hi and 12-Hi stacks, roughly 1.2 TB/s per stack, delivering the ~8 TB/s per GPU you see on a Blackwell B200/B300, an AMD MI355X, or a Google Ironwood. HBM4 is the generational step that doubles the interface to a 2,048-bit-per-stack PHY, lifting per-stack bandwidth to roughly 2 TB/s and per-GPU bandwidth to ~22 TB/s on an NVIDIA Rubin-class part with eight stacks — about 2.8x HBM3E. HBM4E follows as the speed-bin and capacity refresh, with Samsung shipping industry-first samples at 3.6 TB/s per stack in mid-2026.

The 2026 inflection is not the bandwidth alone — it is who controls the base die. HBM4 moves the logic base die at the bottom of the stack from a commodity DRAM process to an advanced logic node (TSMC and Samsung foundry involvement), turning the base die into a semi-custom interface co-designed with the accelerator. That deepens the coupling between memory vendor, foundry, and GPU designer, and it is the structural reason HBM4 qualification is slower and more entangled than a normal node shrink. Note also the engineering subtlety beneath the marketing: despite years of hybrid-bonding hype, mainstream HBM4 12-Hi largely stayed on advanced microbump (MR-MUF) joining in 2026, with copper-to-copper hybrid bonding deferred toward the taller 16-Hi and HBM4E parts where the thermal and gap budget finally forces it.

HBM generation comparison — the decision-relevant deltas

Generation	Per-stack BW	Per-GPU BW (8 stacks)	Stack height / capacity	Interface	2026 status	Indicative $/stack
HBM3	~0.8 TB/s	~6.4 TB/s	8-Hi / 16-24 GB	1024-bit	Legacy (H100-class)	~$200
HBM3E	~1.2 TB/s	~8 TB/s	8/12-Hi / 24-36 GB	1024-bit	Volume workhorse; sold out	~$300
HBM4	~2.0 TB/s	~22 TB/s	12/16-Hi / 36-48 GB	2048-bit + logic base die	Mass production from H1-H2 2026	~$500 (est.)
HBM4E	~3.6 TB/s	~28+ TB/s	16-Hi+ / 48-64 GB	2048-bit, custom base die	Sampling (Samsung first, mid-2026)	Higher; not yet contracted

Per-stack and per-GPU figures are representative 2026 reference points; exact numbers vary by supplier bin and accelerator SKU. Per-GPU assumes 8 stacks (Blackwell/Rubin class). Pricing is order-of-magnitude street/contract synthesis, not a quote.

The table is a roadmap with a fork at every row. The bandwidth step from HBM3E to HBM4 is the single largest per-GPU memory improvement of the decade, but it lands into a supply environment where the previous generation is already sold out — so the question is never just "is HBM4 better" (it is) but "can I get HBM4 in volume on the schedule my deployment needs, or do I lock allocation on mature HBM3E and accept the bandwidth ceiling?" That is the decision the next section is built around.

The three-supplier oligopoly

There are exactly three companies on Earth that can make HBM at scale: SK hynix, Samsung, and Micron. That is the whole market. SK hynix holds the dominant share — roughly 50-55% overall, and supply-chain estimates put it at 60-70% of NVIDIA Rubin HBM4 volume — having been first to qualify each recent generation and first into the NVIDIA flagship socket. Samsung is the swing supplier, capturing roughly a quarter to a third of Rubin volume and racing to leapfrog on HBM4E; Micron is the smallest but a real third source, qualified across HBM3E and now HBM4. In June 2026, NVIDIA publicly certified all three for Vera Rubin HBM4 — a deliberate multi-sourcing move, because a single-supplier flagship is an unacceptable supply risk at gigawatt scale.

An oligopoly this concentrated has two consequences that flow straight into your build. First, pricing power sits with the supplier: HBM is sold on long-term contracts, allocated quarters or years ahead, and the spot margin for late entrants is punishing. Second, qualification is a moat: getting a new HBM supplier or generation into a shipping accelerator takes 12-18 months of co-engineering and reliability burn-in, so the field of who-can-supply-whom moves slowly and is largely locked by the time you are placing orders. The practical upshot for an operator is simple — you do not negotiate HBM, you queue for it, and your place in the queue was determined by the accelerator vendor's allocation, not yours (Chapter 2.3).

The 2026 supply crisis is structural, not a blip

By the end of 2025 all three suppliers had already sold out their entire 2026 HBM output. SK hynix's CFO stated flatly that 2026 supply was gone; Micron confirmed 2025 and 2026 capacity fully booked. HBM3E carried a ~20% price hike into 2026, and the demand-supply gap is variously estimated near 30%. This is not a transient shortage that clears in two quarters — it is the result of HBM consuming roughly 3x the wafer area per gigabyte of commodity DRAM, against a fab base that cannot be expanded faster than ~18-24 months. If your 2026-2027 deployment plan assumes you can buy incremental HBM-bearing accelerators on demand, it is wrong. Lock allocation early, or design for the parts you can actually get. → procurement and allocation strategy in Chapter 2.3.

HBM as a top-three BOM line

The economics make the constraint concrete. A modern accelerator carries 8 to 12 HBM stacks. At HBM3E pricing of roughly $300 per stack, that is ~$2,400-3,600 of memory on a single package; at HBM4's estimated ~$500 per stack, an eight-stack Rubin-class part carries ~$4,000 of HBM before the GPU die, the interposer, or the substrate is counted. On many accelerator bills of materials, HBM is the single largest line item after the compute die itself — frequently larger than the die when you account for HBM's yield drag — and it is rising faster than any other component as bandwidth and stack height climb.

This reframes accelerator selection (Chapter 7.11). Two parts with similar FLOPS can differ by thousands of dollars of HBM, and that delta shows up directly in cost-per-token. It also reframes the vendor's own roofline math: every extra TB/s of bandwidth a designer buys costs real BOM, so architects fight a continuous optimization between spending the memory budget on capacity (more, taller stacks to fit bigger models and longer context) or bandwidth (faster stacks to feed the cores). That fork is the heart of the part's personality.

Capacity vs bandwidth: the architect's fork

Given a fixed package area, a fixed interposer reach, and a fixed memory budget, an accelerator designer cannot maximize both HBM capacity and HBM bandwidth without bound — they trade against each other through stack count, stack height, and the speed bin chosen. The right answer is set entirely by the dominant workload, and getting it wrong produces a part that is technically impressive and commercially mismatched.

Optimize for capacity when the workload is large-model inference, long-context, or KV-cache-heavy agentic serving: here the binding limit is whether the weights plus the KV cache for a useful batch size fit in HBM at all. Insufficient capacity forces you to shard the model across more GPUs (raising cost-per-token through communication overhead) or to evict KV cache (raising latency). This is why per-GPU HBM capacity has climbed so aggressively — H100 80 GB, H200 141 GB, B200 192 GB, B300 288 GB, toward Rubin Ultra's ~1 TB on HBM4E — and why 16-Hi stacks matter: the extra die per stack is pure capacity. Optimize for bandwidth when the workload is decode-bound throughput at moderate model size, where the cores are starving for bytes faster than they are starving for capacity; here the faster speed bin and the wider HBM4 interface earn their BOM.

The consequence of mismatching this fork is expensive and quiet, because the part still runs — it just runs uneconomically. A capacity-optimized part used for bandwidth-bound decode leaves bandwidth on the table and serves fewer tokens per second than its FLOPS suggest. A bandwidth-optimized part used for a model that does not fit forces sharding and pays a communication tax on every step. The precision and KV-cache choices in Chapter 7.10 are the software-side lever on exactly this tradeoff: quantizing weights and KV cache to FP8/FP4 is, in effect, a way to buy back HBM capacity and bandwidth you could not get in silicon.

Capacity-optimized vs bandwidth-optimized HBM configuration

Axis	Capacity-optimized	Bandwidth-optimized
Lever	Taller stacks (16-Hi), more GB/stack	Faster speed bin, wider/newer-gen PHY
Best-fit workload	Large-model / long-context / agentic inference; big KV cache	Decode-bound throughput; moderate model size
Binding limit it relieves	Model + KV cache fits without sharding	Cores no longer starve for bytes per decode step
Cost it adds	Stack-height yield drag; hybrid-bonding tool gate	Higher per-stack price; tighter power/thermal budget
Failure mode if mismatched	Wasted GB on a bandwidth-bound job	Forced sharding + comms tax when the model won't fit
Software-side hedge	KV-cache quantization (→ 7.10)	Weight quantization to FP8/FP4 (→ 7.10)

The same memory budget pulls in two directions. Which side you land on is set by the dominant workload, not by what is technically maximal.

3 suppliers

SK hynix (~50-55%, 60-70% of Rubin HBM4), Samsung (~25-30%), Micron (rest) — the entire HBM market

2026TrendForce; Momoview; Yahoo/Reuters (Rubin certification)

Sold out 2026

all three suppliers' entire 2026 HBM output pre-booked by end-2025; ~30% demand-supply gap

2026SK hynix / Micron statements; SemiAnalysis

~8 → ~22 TB/s

per-GPU HBM bandwidth, HBM3E (B200/B300 class) → HBM4 (Rubin, 8 stacks) — ~2.8x

2026NVIDIA; Tom's Hardware; TrendForce

~$300 → ~$500

indicative HBM3E → HBM4 per-stack price; 8-12 stacks per accelerator

2026Momoview / TrendForce synthesis

~20% hike

HBM3E contract price increase into 2026; DRAM spot ~+90% Q1-2026 QoQ on spillover

2026TrendForce

~3x wafers

HBM consumes ~3x the DRAM wafer area per GB of commodity DDR5 — the structural crowding-out driver

2026Tom's Hardware; Tech Times

~30% / ~8%

HBM as share of DRAM revenue (~30%) vs share of DRAM bits (~8%) — why fabs prioritize it

2026Tom's Hardware; tech-insider

80 → 288 GB → ~1 TB

per-GPU HBM capacity: H100 80, B200 192, B300 288 (HBM3E), Rubin Ultra ~1 TB (HBM4E)

2026 (roadmap)NVIDIA Developer

DRAM spillover: when AI memory eats the world's RAM

The HBM crisis does not stay in the data center. Because HBM consumes roughly 3x the wafer area per gigabyte of commodity DRAM, and because an HBM module sells for $60-100 against $5-10 for the equivalent commodity DDR5, rational memory makers convert wafer starts away from consumer and server DRAM toward HBM as fast as qualification allows. HBM reached roughly 30% of all DRAM revenue while occupying only ~8% of DRAM bit output — a margin concentration that makes the conversion economically irresistible. The result in 2026 was a broad memory shortage: DDR5 spot prices surged sharply (commodity modules multiplying through 2025-2026), server DRAM budgets blew out, and even non-AI buyers found themselves competing for capacity that had been redirected to feed accelerators.

For an AI operator, this spillover is a second-order cost that is easy to miss at scoping time. The host memory for your GPU servers — the LPDDR5X or DDR5/MRDIMM that pairs with the accelerators (Chapter 7.8) — is drawn from the same crowded-out commodity pool, so HBM scarcity inflates not just the accelerator BOM but the host BOM beside it. A build that budgeted host memory at 2024 prices and assumed easy availability is exposed on a line item nobody flagged. The lesson is that HBM is not a contained component decision; it reprices the entire memory stack of the machine, and the ripple reaches your CPU sockets and your storage cache.

Deep dive: why HBM stack height is a yield and tooling cliff, not a slope

It is tempting to read "8-Hi → 12-Hi → 16-Hi" as a smooth capacity ramp. It is not. Each added die in the stack multiplies the ways the stack can fail and tightens a thermal and mechanical budget that was already marginal. The TSVs must align through every die; warpage accumulates with height; and the heat generated at the bottom of the stack must escape through the dies above it, so the top dies run hottest exactly where the package is most thermally constrained. Pushing from 12-Hi to 16-Hi is where the joining technology itself has to change: advanced microbump (MR-MUF) reflow runs out of vertical gap budget, and copper-to-copper hybrid bonding — which eliminates the solder bump, shrinks the die-to-die gap, and cuts joint thermal resistance ~20% — becomes mandatory rather than optional.

The catch is that hybrid bonding is gated by a tiny global installed base of the required bonding tools — on the order of a hundred machines worldwide in 2026 — so the 16-Hi capacity step is throttled not by DRAM fab capacity but by a packaging-tool bottleneck that takes years to relieve. This is why a designer choosing capacity-optimization (taller stacks) is implicitly betting on tool availability, and why the per-GPU capacity headline numbers on a 2027 roadmap carry more execution risk than the bandwidth numbers. The full packaging treatment — interposer area, reticle stitching, stack-count-per-package — lives in Chapter 7.7; here the point is that stack height is a discontinuity with a tooling cliff under it, not a dial you turn freely.

HBM and CoWoS: the same upstream gate

It is a recurring mistake to treat the HBM shortage and the packaging shortage as two separate problems. They are one gate. HBM only delivers its bandwidth because it sits on a 2.5D advanced package — TSMC's CoWoS family and its equivalents — that places the stacks within interposer reach of the compute die. So an accelerator cannot ship unless both the HBM stacks exist and there is CoWoS interposer area and assembly capacity to mount them. Through 2026 both were sold out, and the accelerator vendors had locked large shares of CoWoS capacity (NVIDIA holding roughly half) years in advance.

The consequence for procurement is that the binding constraint on how many accelerators reach your floor is decided upstream of the GPU vendor's order book — at the memory supplier and the OSAT/foundry packaging line — and decided 18-24 months before your rack ships. This is the engineering reason the procurement chapter (Chapter 2.3) treats HBM/CoWoS allocation as the real lead-time driver above assembly, and why design-for-substitution (an accelerator SKU that can take HBM3E or HBM4, a known second supplier) is worth more than raw integration speed. The interposer area available to a package also sets how many HBM stacks can physically sit beside the die — the direct link from packaging geometry to the capacity-vs-bandwidth fork above. That stack-count-per-package engineering is the subject of Chapter 7.7.

The HBM fork, stated plainly

For a 2026-2027 build the HBM decision reduces to three coupled choices. (1) Generation timing: lock mature HBM3E allocation now and accept its bandwidth ceiling, or wait for HBM4 volume and accept schedule and qualification risk. (2) Capacity vs bandwidth: match the part's HBM personality to whether your dominant workload is capacity-bound (big models, long context, KV cache) or bandwidth-bound (decode throughput) — and use quantization (Chapter 7.10) as the software hedge on whichever you under-buy. (3) Allocation exposure: secure a place in the supplier queue via the accelerator vendor's allocation (Chapter 2.3), or design for substitution so you are not hostage to one supplier or one generation. Decide all three before you commit power, because the HBM lead time, not your construction schedule, is the long pole.

Deep dive: HBM4's logic base die changes the competitive map

HBM3E and earlier put a relatively dumb DRAM-process die at the bottom of the stack as the interface layer. HBM4 moves that base die onto an advanced logic node, fabricated with foundry involvement (TSMC for SK hynix; Samsung's own foundry for Samsung). This sounds like a manufacturing detail; it is a strategic shift. A logic base die can host more of the memory controller, signal conditioning, and even custom logic co-designed with the specific accelerator — turning HBM from a standard catalog part into a semi-custom subsystem negotiated between the memory vendor, the foundry, and the GPU designer.

Three consequences follow. First, qualification slows and tightens, because the base die is now a co-engineered interface rather than a drop-in — part of why HBM4 timing carries real risk. Second, the foundry enters the HBM value chain, deepening the dependence of every HBM4 part on the same TSMC/Samsung capacity that the compute dies already compete for. Third, differentiation moves into the stack: HBM4E and beyond open the door to genuinely custom base dies per customer, which favors the largest accelerator buyers who can fund a custom interface and further entrenches the oligopoly's pricing power. For an operator, the takeaway is that HBM4 is less a commodity than HBM3E was, and the gap between who can get the best memory and who cannot is widening — a supply-security and concentration risk that bleeds into the geopolitical exposure of Korean HBM and Taiwanese packaging.

Where this leaves the operator

HBM is the clearest case in this entire guide of a component decision that is really a strategy decision. You cannot buy your way out of the shortage at the spot margin, you cannot accelerate the fabs, and you cannot move the qualification timeline — so the only levers you actually hold are which part you select for its memory personality, how early you secure allocation, and how much software flexibility you build to substitute precision for memory you could not get in silicon. Operators who internalize that HBM, not the GPU die, is the binding constraint make three moves the others miss: they read accelerators memory-first, they treat allocation as a years-ahead commitment rather than an order, and they design fleets that can absorb whatever HBM generation and supplier actually shows up.

HBM is the upstream gate that the procurement and allocation strategy in Chapter 2.3 is built around; the packaging that sets how many stacks physically fit beside the die — interposer area, CoWoS families, hybrid bonding, stack-count-per-package — is engineered in Chapter 7.7. The host-memory spillover lands in Chapter 7.8; the precision and KV-cache quantization that trade software for HBM capacity and bandwidth are in Chapter 7.10; the cost-per-token framing that makes memory the governing accelerator-selection metric is in Chapter 7.11. The inference workload that makes bandwidth the binding number is characterized in Chapter 1.3, and the consolidated compute/memory roadmap is in Chapter 16.2.