Chapter 7.6
HBM: The Binding Constraint on AI Compute
An accelerator is a memory system with some math attached — and in 2026 the math is cheap, the memory is sold out, so HBM, not the GPU die, is the line that decides how many chips ship and how much each one costs.
What you'll decide here
- Whether your accelerator selection is governed by FLOPS or by delivered HBM bandwidth and capacity — because for inference and long-context workloads the memory wall, not the matrix engine, sets your token economics.
- How exposed your build is to the three-supplier HBM oligopoly and the 2026 sold-out allocation — and whether you have a place in the queue or are buying on the spot margin.
- Whether you optimize the part for capacity (fit the model + KV cache) or bandwidth (feed the cores), and what that costs you in stack count, package area, and DRAM-market spillover.
- How HBM capacity per stack (12-Hi vs 16-Hi) and HBM/CoWoS allocation jointly gate your achievable density ramp — these are the same upstream gate, decided two years before your rack ships.
- Whether to hedge HBM4 timing risk by buying mature HBM3E parts now versus waiting for the bandwidth step that may not have volume until you need it.
Computing performance used to track the logic die: more transistors, more clocks, more FLOPS. The accelerator era inverted that. A modern training or inference GPU spends most of every cycle waiting on memory, the arithmetic units starved far more often than saturated, and the part of the package that costs the most, ships the slowest, and gates the entire supply chain is the stack of DRAM bolted to the die's side. High-Bandwidth Memory (HBM) is the binding constraint on AI compute in 2026, and the operators who understand that buy chips by the terabyte-per-second and the gigabyte, not by the petaFLOP.
This chapter treats HBM as a decision surface, not a spec sheet. We walk the generations (HBM3E to HBM4 to HBM4E) and the three-supplier oligopoly that makes every one of them a scarce, allocated good. We frame HBM as a top-three BOM line and force the capacity-vs-bandwidth optimization that every accelerator architect actually fights over. We trace the 2026 supply crisis and its spillover into the broader DRAM market. And we connect HBM upward to the procurement game (Chapter 2.3) and sideways to the packaging that physically determines how many stacks fit beside a die (Chapter 7.7).
Why memory, not math, is the constraint
The reason HBM exists is a number called arithmetic intensity: the ratio of floating-point operations to bytes moved from memory. Dense matrix multiply — the core of training — has high arithmetic intensity and can keep the cores busy. But autoregressive decode, the dominant cost of modern inference, generates one token at a time and must re-read the entire KV cache and weight set on every step. Its arithmetic intensity is very low: the accelerator is bandwidth-bound, and the math engine idles waiting for bytes. Since inference is now roughly two-thirds of AI compute (Chapter 1.3), the workload that pays the bills is precisely the one that lives or dies on memory bandwidth.
HBM is the answer to that wall. Instead of a few DDR channels reaching out across a motherboard, HBM stacks 8, 12, or 16 DRAM dies vertically, connects them through the silicon with thousands of through-silicon vias (TSVs), and places the whole stack within millimeters of the compute die on a shared interposer. The result is a memory interface 2,048 bits wide per stack in HBM4 — versus 64 bits for a DDR5 channel — delivering terabytes per second at a fraction of the energy-per-bit of off-package DRAM. That proximity is also the trap: HBM only works as part of an advanced package, which means it is gated by the same scarce CoWoS capacity as the die it feeds (Chapter 7.7).
The generations: HBM3E to HBM4 to HBM4E
HBM advances on a JEDEC-anchored cadence, and 2026 sits exactly on the seam between two generations — which is itself a procurement decision (buy the mature part now or wait for the bandwidth step). HBM3E is the workhorse of everything shipping in volume in 2026: 8-Hi and 12-Hi stacks, roughly 1.2 TB/s per stack, delivering the ~8 TB/s per GPU you see on a Blackwell B200/B300, an AMD MI355X, or a Google Ironwood. HBM4 is the generational step that doubles the interface to a 2,048-bit-per-stack PHY, lifting per-stack bandwidth to roughly 2 TB/s and per-GPU bandwidth to ~22 TB/s on an NVIDIA Rubin-class part with eight stacks — about 2.8x HBM3E. HBM4E follows as the speed-bin and capacity refresh, with Samsung shipping industry-first samples at 3.6 TB/s per stack in mid-2026.
The 2026 inflection is not the bandwidth alone — it is who controls the base die. HBM4 moves the logic base die at the bottom of the stack from a commodity DRAM process to an advanced logic node (TSMC and Samsung foundry involvement), turning the base die into a semi-custom interface co-designed with the accelerator. That deepens the coupling between memory vendor, foundry, and GPU designer, and it is the structural reason HBM4 qualification is slower and more entangled than a normal node shrink. Note also the engineering subtlety beneath the marketing: despite years of hybrid-bonding hype, mainstream HBM4 12-Hi largely stayed on advanced microbump (MR-MUF) joining in 2026, with copper-to-copper hybrid bonding deferred toward the taller 16-Hi and HBM4E parts where the thermal and gap budget finally forces it.
| Generation | Per-stack BW | Per-GPU BW (8 stacks) | Stack height / capacity | Interface | 2026 status | Indicative $/stack |
|---|---|---|---|---|---|---|
| HBM3 | ~0.8 TB/s | ~6.4 TB/s | 8-Hi / 16-24 GB | 1024-bit | Legacy (H100-class) | ~$200 |
| HBM3E | ~1.2 TB/s | ~8 TB/s | 8/12-Hi / 24-36 GB | 1024-bit | Volume workhorse; sold out | ~$300 |
| HBM4 | ~2.0 TB/s | ~22 TB/s | 12/16-Hi / 36-48 GB | 2048-bit + logic base die | Mass production from H1-H2 2026 | ~$500 (est.) |
| HBM4E | ~3.6 TB/s | ~28+ TB/s | 16-Hi+ / 48-64 GB | 2048-bit, custom base die | Sampling (Samsung first, mid-2026) | Higher; not yet contracted |
The table is a roadmap with a fork at every row. The bandwidth step from HBM3E to HBM4 is the single largest per-GPU memory improvement of the decade, but it lands into a supply environment where the previous generation is already sold out — so the question is never just "is HBM4 better" (it is) but "can I get HBM4 in volume on the schedule my deployment needs, or do I lock allocation on mature HBM3E and accept the bandwidth ceiling?" That is the decision the next section is built around.
The three-supplier oligopoly
There are exactly three companies on Earth that can make HBM at scale: SK hynix, Samsung, and Micron. That is the whole market. SK hynix holds the dominant share — roughly 50-55% overall, and supply-chain estimates put it at 60-70% of NVIDIA Rubin HBM4 volume — having been first to qualify each recent generation and first into the NVIDIA flagship socket. Samsung is the swing supplier, capturing roughly a quarter to a third of Rubin volume and racing to leapfrog on HBM4E; Micron is the smallest but a real third source, qualified across HBM3E and now HBM4. In June 2026, NVIDIA publicly certified all three for Vera Rubin HBM4 — a deliberate multi-sourcing move, because a single-supplier flagship is an unacceptable supply risk at gigawatt scale.
An oligopoly this concentrated has two consequences that flow straight into your build. First, pricing power sits with the supplier: HBM is sold on long-term contracts, allocated quarters or years ahead, and the spot margin for late entrants is punishing. Second, qualification is a moat: getting a new HBM supplier or generation into a shipping accelerator takes 12-18 months of co-engineering and reliability burn-in, so the field of who-can-supply-whom moves slowly and is largely locked by the time you are placing orders. The practical upshot for an operator is simple — you do not negotiate HBM, you queue for it, and your place in the queue was determined by the accelerator vendor's allocation, not yours (Chapter 2.3).
HBM as a top-three BOM line
The economics make the constraint concrete. A modern accelerator carries 8 to 12 HBM stacks. At HBM3E pricing of roughly $300 per stack, that is ~$2,400-3,600 of memory on a single package; at HBM4's estimated ~$500 per stack, an eight-stack Rubin-class part carries ~$4,000 of HBM before the GPU die, the interposer, or the substrate is counted. On many accelerator bills of materials, HBM is the single largest line item after the compute die itself — frequently larger than the die when you account for HBM's yield drag — and it is rising faster than any other component as bandwidth and stack height climb.
This reframes accelerator selection (Chapter 7.11). Two parts with similar FLOPS can differ by thousands of dollars of HBM, and that delta shows up directly in cost-per-token. It also reframes the vendor's own roofline math: every extra TB/s of bandwidth a designer buys costs real BOM, so architects fight a continuous optimization between spending the memory budget on capacity (more, taller stacks to fit bigger models and longer context) or bandwidth (faster stacks to feed the cores). That fork is the heart of the part's personality.
Capacity vs bandwidth: the architect's fork
Given a fixed package area, a fixed interposer reach, and a fixed memory budget, an accelerator designer cannot maximize both HBM capacity and HBM bandwidth without bound — they trade against each other through stack count, stack height, and the speed bin chosen. The right answer is set entirely by the dominant workload, and getting it wrong produces a part that is technically impressive and commercially mismatched.
Optimize for capacity when the workload is large-model inference, long-context, or KV-cache-heavy agentic serving: here the binding limit is whether the weights plus the KV cache for a useful batch size fit in HBM at all. Insufficient capacity forces you to shard the model across more GPUs (raising cost-per-token through communication overhead) or to evict KV cache (raising latency). This is why per-GPU HBM capacity has climbed so aggressively — H100 80 GB, H200 141 GB, B200 192 GB, B300 288 GB, toward Rubin Ultra's ~1 TB on HBM4E — and why 16-Hi stacks matter: the extra die per stack is pure capacity. Optimize for bandwidth when the workload is decode-bound throughput at moderate model size, where the cores are starving for bytes faster than they are starving for capacity; here the faster speed bin and the wider HBM4 interface earn their BOM.
The consequence of mismatching this fork is expensive and quiet, because the part still runs — it just runs uneconomically. A capacity-optimized part used for bandwidth-bound decode leaves bandwidth on the table and serves fewer tokens per second than its FLOPS suggest. A bandwidth-optimized part used for a model that does not fit forces sharding and pays a communication tax on every step. The precision and KV-cache choices in Chapter 7.10 are the software-side lever on exactly this tradeoff: quantizing weights and KV cache to FP8/FP4 is, in effect, a way to buy back HBM capacity and bandwidth you could not get in silicon.
| Axis | Capacity-optimized | Bandwidth-optimized |
|---|---|---|
| Lever | Taller stacks (16-Hi), more GB/stack | Faster speed bin, wider/newer-gen PHY |
| Best-fit workload | Large-model / long-context / agentic inference; big KV cache | Decode-bound throughput; moderate model size |
| Binding limit it relieves | Model + KV cache fits without sharding | Cores no longer starve for bytes per decode step |
| Cost it adds | Stack-height yield drag; hybrid-bonding tool gate | Higher per-stack price; tighter power/thermal budget |
| Failure mode if mismatched | Wasted GB on a bandwidth-bound job | Forced sharding + comms tax when the model won't fit |
| Software-side hedge | KV-cache quantization (→ 7.10) | Weight quantization to FP8/FP4 (→ 7.10) |
DRAM spillover: when AI memory eats the world's RAM
The HBM crisis does not stay in the data center. Because HBM consumes roughly 3x the wafer area per gigabyte of commodity DRAM, and because an HBM module sells for $60-100 against $5-10 for the equivalent commodity DDR5, rational memory makers convert wafer starts away from consumer and server DRAM toward HBM as fast as qualification allows. HBM reached roughly 30% of all DRAM revenue while occupying only ~8% of DRAM bit output — a margin concentration that makes the conversion economically irresistible. The result in 2026 was a broad memory shortage: DDR5 spot prices surged sharply (commodity modules multiplying through 2025-2026), server DRAM budgets blew out, and even non-AI buyers found themselves competing for capacity that had been redirected to feed accelerators.
For an AI operator, this spillover is a second-order cost that is easy to miss at scoping time. The host memory for your GPU servers — the LPDDR5X or DDR5/MRDIMM that pairs with the accelerators (Chapter 7.8) — is drawn from the same crowded-out commodity pool, so HBM scarcity inflates not just the accelerator BOM but the host BOM beside it. A build that budgeted host memory at 2024 prices and assumed easy availability is exposed on a line item nobody flagged. The lesson is that HBM is not a contained component decision; it reprices the entire memory stack of the machine, and the ripple reaches your CPU sockets and your storage cache.
Deep dive: why HBM stack height is a yield and tooling cliff, not a slope
It is tempting to read "8-Hi → 12-Hi → 16-Hi" as a smooth capacity ramp. It is not. Each added die in the stack multiplies the ways the stack can fail and tightens a thermal and mechanical budget that was already marginal. The TSVs must align through every die; warpage accumulates with height; and the heat generated at the bottom of the stack must escape through the dies above it, so the top dies run hottest exactly where the package is most thermally constrained. Pushing from 12-Hi to 16-Hi is where the joining technology itself has to change: advanced microbump (MR-MUF) reflow runs out of vertical gap budget, and copper-to-copper hybrid bonding — which eliminates the solder bump, shrinks the die-to-die gap, and cuts joint thermal resistance ~20% — becomes mandatory rather than optional.
The catch is that hybrid bonding is gated by a tiny global installed base of the required bonding tools — on the order of a hundred machines worldwide in 2026 — so the 16-Hi capacity step is throttled not by DRAM fab capacity but by a packaging-tool bottleneck that takes years to relieve. This is why a designer choosing capacity-optimization (taller stacks) is implicitly betting on tool availability, and why the per-GPU capacity headline numbers on a 2027 roadmap carry more execution risk than the bandwidth numbers. The full packaging treatment — interposer area, reticle stitching, stack-count-per-package — lives in Chapter 7.7; here the point is that stack height is a discontinuity with a tooling cliff under it, not a dial you turn freely.
HBM and CoWoS: the same upstream gate
It is a recurring mistake to treat the HBM shortage and the packaging shortage as two separate problems. They are one gate. HBM only delivers its bandwidth because it sits on a 2.5D advanced package — TSMC's CoWoS family and its equivalents — that places the stacks within interposer reach of the compute die. So an accelerator cannot ship unless both the HBM stacks exist and there is CoWoS interposer area and assembly capacity to mount them. Through 2026 both were sold out, and the accelerator vendors had locked large shares of CoWoS capacity (NVIDIA holding roughly half) years in advance.
The consequence for procurement is that the binding constraint on how many accelerators reach your floor is decided upstream of the GPU vendor's order book — at the memory supplier and the OSAT/foundry packaging line — and decided 18-24 months before your rack ships. This is the engineering reason the procurement chapter (Chapter 2.3) treats HBM/CoWoS allocation as the real lead-time driver above assembly, and why design-for-substitution (an accelerator SKU that can take HBM3E or HBM4, a known second supplier) is worth more than raw integration speed. The interposer area available to a package also sets how many HBM stacks can physically sit beside the die — the direct link from packaging geometry to the capacity-vs-bandwidth fork above. That stack-count-per-package engineering is the subject of Chapter 7.7.
Deep dive: HBM4's logic base die changes the competitive map
HBM3E and earlier put a relatively dumb DRAM-process die at the bottom of the stack as the interface layer. HBM4 moves that base die onto an advanced logic node, fabricated with foundry involvement (TSMC for SK hynix; Samsung's own foundry for Samsung). This sounds like a manufacturing detail; it is a strategic shift. A logic base die can host more of the memory controller, signal conditioning, and even custom logic co-designed with the specific accelerator — turning HBM from a standard catalog part into a semi-custom subsystem negotiated between the memory vendor, the foundry, and the GPU designer.
Three consequences follow. First, qualification slows and tightens, because the base die is now a co-engineered interface rather than a drop-in — part of why HBM4 timing carries real risk. Second, the foundry enters the HBM value chain, deepening the dependence of every HBM4 part on the same TSMC/Samsung capacity that the compute dies already compete for. Third, differentiation moves into the stack: HBM4E and beyond open the door to genuinely custom base dies per customer, which favors the largest accelerator buyers who can fund a custom interface and further entrenches the oligopoly's pricing power. For an operator, the takeaway is that HBM4 is less a commodity than HBM3E was, and the gap between who can get the best memory and who cannot is widening — a supply-security and concentration risk that bleeds into the geopolitical exposure of Korean HBM and Taiwanese packaging.
Where this leaves the operator
HBM is the clearest case in this entire guide of a component decision that is really a strategy decision. You cannot buy your way out of the shortage at the spot margin, you cannot accelerate the fabs, and you cannot move the qualification timeline — so the only levers you actually hold are which part you select for its memory personality, how early you secure allocation, and how much software flexibility you build to substitute precision for memory you could not get in silicon. Operators who internalize that HBM, not the GPU die, is the binding constraint make three moves the others miss: they read accelerators memory-first, they treat allocation as a years-ahead commitment rather than an order, and they design fleets that can absorb whatever HBM generation and supplier actually shows up.