Guide › Compute, Silicon & System Integration › 7.2

Chapter 7.2

NVIDIA Accelerators: Hopper → Blackwell → Vera Rubin → Rubin Ultra → Feynman

NVIDIA's accelerator roadmap is not a spec sheet you read but a power-and-density ramp you are forced to design against — each annual generation moves the unit of purchase from the chip to the rack to the multi-rack pod, and committing to the wrong rung sets your cooling plant, power architecture, and refresh economics for years.

POWER-BOUNDDENSITY-RAMPGOODPUT

What you'll decide here

Which generation you actually buy into — Hopper, Blackwell, or Rubin — and therefore the rack power envelope (40 kW → 132 kW → 190+ kW) your facility substrate must already accommodate.
Whether your unit of purchase is the GPU, the HGX board, or the rack-scale NVL system (NVL72/144/576) — because the scale-up domain you buy is the scale-up domain you are stuck with until refresh.
Whether to ride the annual cadence at every step (Hopper → Blackwell → Blackwell Ultra → Rubin → Rubin Ultra → Feynman) or skip generations — and how to amortize a 2–3 year economic life against a 1-year obsolescence clock.
For inference at long context, whether to adopt disaggregated serving (Rubin CPX context GPUs + Rubin generation GPUs) or stay monolithic — a fork that changes your BOM, your fabric, and your cost per token.
Whether the 800 VDC / Kyber transition is a bridge you design toward now (reserved busbar, floor loading, water) or a wall you hit later when a 600 kW rack will not fit the hall you built.

Per-rack power has climbed ~15x in one GPU generation, and each step forces a power-and-cooling substrate decision you cannot cheaply undo.

NVIDIA sells a cadence: an annual rhythm of accelerator generations, each one re-drawing the rack, the fabric, and the power chain underneath it. The decision this chapter forces is not "which GPU is fastest" but "which rung of the ramp am I committing my building to, and what does the next rung cost me if I guessed wrong." Since 2022 the relevant unit of purchase has migrated upward: from the H100 as a board, to the GB200 NVL72 as a 132 kW rack you buy whole, to the Vera Rubin NVL144 and the Rubin Ultra Kyber NVL576 as multi-rack pods plumbed for 800 VDC. Each migration is a one-way door for the facility that hosts it. You can defer the silicon; you cannot defer the floor loading, the water, and the interconnection slot the silicon implies.

This chapter walks the roadmap as a sequence of decisions and their downstream costs. We trace the per-GPU specs across Hopper → Blackwell → Blackwell Ultra → Vera Rubin → Rubin Ultra → Feynman; we explain why the NVL system — not the GPU — became the unit of procurement, and why the size of the scale-up domain you buy (8 → 72 → 144 → 576 GPUs) is a strategic commitment, not a datasheet line; we cover the disaggregated-inference fork that Rubin CPX introduces; and we treat the annual cadence as the lever that compresses competitors' design windows and your own depreciation schedule simultaneously. Per-GPU NVLink bandwidth appears here as a datasheet attribute; the NVLink/NVSwitch fabric that aggregates it has its canonical home in Chapter 8.2.

The master fork: you are buying a power envelope, not a FLOPS number

The instinct is to compare generations on peak FLOPS. That is the marketing-number trap (Chapter 7.1): the headline figures are sparse FP4 with all the asterisks stripped, and they tell you almost nothing about what you must build. The number that actually cascades through your facility is rack power. An H100 air-cooled rack lands near 40 kW; a GB200 NVL72 draws ~132 kW and mandates direct-to-chip liquid; a GB300 NVL72 pushes ~140 kW; the Vera Rubin VR200 NVL144 lands in the ~190–230 kW band; and the Rubin Ultra Kyber NVL576 targets ~600 kW on an 800 VDC bus. That is a 15x escalation in rack power across roughly four years.

The consequence: a hall scoped for the previous generation's density cannot absorb the next one without a substrate it does not have. You do not get to "upgrade" from a 40 kW air hall to a 132 kW liquid hall by swapping racks — the floor loading is wrong (a wet NVL72 is ~1.36 t / 3,000 lb), the plenum is wrong, the electrical headroom is wrong, and there is no facility water. The density wall (Chapter 5.1) and the DLC default (Chapter 5.4) are downstream of which rung of this ramp you bought into. The power curve governs; the compute curve follows.

The unit-of-purchase fork: GPU vs HGX board vs NVL rack

Decide what you are actually buying before you compare specs. The GPU (a Hopper or Blackwell SXM module) is a component you slot into an OEM server — maximal flexibility, but you own the integration risk. The HGX board (8 GPUs + NVSwitch on a baseboard) is the inference-and-mainstream unit: an HGX B200 server fits a 30–60 kW rack, air or liquid, and gives you an 8-GPU scale-up domain. The NVL rack-scale system (GB200 NVL72, VR200 NVL144, Kyber NVL576) is a different animal entirely — you buy the rack, the NVLink spine, the NVSwitch trays, the busbar, and the liquid loop as one SKU, and you inherit a 72-, 144-, or 576-GPU scale-up domain you cannot resize. The fork matters because the scale-up domain is the lever that sets your tensor-parallel and expert-parallel ceilings (Chapter 8.2). Training and wide-MoE inference want the biggest domain you can afford; latency-bound small-model inference is wasting money on it. Buy the domain your workload uses, not the one on the keynote slide.

Hopper → Blackwell → Vera Rubin → Rubin Ultra → Feynman: the per-GPU arc

Hopper (H100, 2022 / H200, 2024) is the generation most of the installed base still runs. H100 ships 80 GB HBM3 at ~3.35 TB/s, ~700 W TDP, FP8 Transformer Engine, NVLink 4 at 900 GB/s per GPU. H200 is the same compute die with 141 GB HBM3E at ~4.8 TB/s — a memory-bandwidth refresh that disproportionately helps inference decode. Hopper is air-coolable, which is exactly why it became the default and why the jump to Blackwell broke so many facility assumptions.

Blackwell (B200 / GB200, 2024–2025) is a dual-die GPU — two reticle-limited compute dies on one package behaving as a single CUDA device over a 10 TB/s die-to-die link — with 192 GB HBM3E, a second-generation Transformer Engine adding native FP4, and NVLink 5 at 1.8 TB/s per GPU. The GB200 superchip pairs two Blackwell GPUs with one Grace CPU over NVLink-C2C. Blackwell Ultra (B300 / GB300, 2025) lifts HBM to 288 GB and adds steady-power and transient-mitigation features (capacitor energy storage, ramp smoothing) that exist because a 140 kW rack toggling between idle and full all-reduce is a grid problem (Chapter 7.12).

Vera Rubin (VR200, H2 2026) is the next platform, not just a chip. The Rubin GPU is again dual-die on a 4-reticle CoWoS-L interposer — ~336 billion transistors, 1.6x Blackwell — with 288 GB HBM4 across 8 stacks at up to ~22 TB/s, sixth-generation Tensor Cores, and NVLink 6 at 3.6 TB/s per GPU. The Vera CPU is NVIDIA's custom Arm successor to Grace. The rack-scale unit is the NVL144 — marketed as 144 because it counts compute dies (72 dual-die packages), delivering ~3.3x the GB300 NVL72 on inference, ~3.6 EF FP4 inference / ~1.2 EF FP8 training per rack, with ~260 TB/s of scale-up NVLink bandwidth. Full production is targeted for H2 2026.

Rubin Ultra (H2 2027) is where the unit of purchase jumps again. It packs four compute dies per package (~100 PFLOPS FP4, 1 TB HBM4e per package) and deploys in the Kyber NVL576 rack — 144 quad-die packages = 576 GPU compute dies, ~600 kW per rack on 800 VDC, ~15 EF FP4 inference / ~5 EF FP8 training, ~365 TB total memory. Feynman (2028) is the next architecture on the roadmap — TSMC A16 (1.6 nm) with backside power delivery, NVLink/NVSwitch and ConnectX/Spectrum generations advancing in lockstep. The cadence is explicit and locked: a new architecture every year.

NVIDIA accelerator generations — the per-GPU and per-rack arc

Generation	GPU memory	Mem BW	NVLink/GPU	TDP/GPU	Rack unit	Rack power	Availability
Hopper H100	80 GB HBM3	~3.35 TB/s	900 GB/s (NVLink 4)	~700 W	HGX 8-GPU / DGX	~40 kW (air)	2022
Hopper H200	141 GB HBM3E	~4.8 TB/s	900 GB/s (NVLink 4)	~700 W	HGX 8-GPU	~40 kW (air)	2024
Blackwell GB200	192 GB HBM3E	~8 TB/s	1.8 TB/s (NVLink 5)	~1.0–1.2 kW	NVL72 rack	~120–132 kW (DLC)	2024–2025
Blackwell Ultra GB300	288 GB HBM3E	~8 TB/s	1.8 TB/s (NVLink 5)	~1.4 kW	NVL72 rack	~140 kW (DLC)	2025
Vera Rubin VR200	288 GB HBM4	~22 TB/s	3.6 TB/s (NVLink 6)	~1.8 kW	NVL144 rack	~190–230 kW (DLC)	H2 2026 (announced)
Rubin Ultra	1 TB HBM4e/pkg	(4-die pkg)	(NVLink 6+)	~2.3 kW	Kyber NVL576	~600 kW (800 VDC)	H2 2027 (announced)
Feynman	HBM4e+ (TBD)	TBD	(NVLink 7)	TBD	Kyber-class	≥600 kW (roadmap)	2028 (roadmap)

Per-GPU figures are NVIDIA datasheet / roadmap; 2026+ rows are announced, not shipping. FLOPS are peak FP4 sparse (the marketing precision); see Chapter 7.1 for the dense-vs-sparse, peak-vs-sustained discount. Rack-power and HBM figures cross-checked against provenance.js.

The rack-power column governs the table, not the FLOPS column. The compute numbers grow impressively, but they are the easy part — TSMC and HBM deliver them on schedule. The hard part, the part that strands capital, is the rightmost columns: the rack unit changes shape (board → 72-GPU rack → 144 → 576), the power per rack escalates an order of magnitude, and the cooling and voltage architecture flip underneath. A facility that bought into Blackwell at 132 kW and liquid cooling is one substrate decision (reserved busbar capacity, water headroom, floor loading) away from Rubin; a facility that bought into Hopper at 40 kW and air is a demolition-and-rebuild away. The generation you choose is the substrate you commit to.

Why the NVL system became the unit of purchase

Through Hopper, the unit was the 8-GPU HGX board and the scale-up domain was 8 GPUs wide. Blackwell broke that model: the GB200 NVL72 fuses 72 Blackwell GPUs and 36 Grace CPUs into a single NVLink domain — 18 compute trays and 9 NVSwitch trays connected by a copper NVLink spine carrying ~130 TB/s of aggregate rack bandwidth across more than 5,000 in-rack copper cables — so that all 72 GPUs address each other at full NVLink speed as one coherent memory fabric (~13.4 TB of unified memory, ~1.44 EF FP4 sparse). You do not assemble this from parts; you buy the rack as a SKU. The reason this matters strategically is that the scale-up domain size is now a purchasing decision that sets your parallelism ceilings until your next refresh.

The consequence runs in both directions. A wide domain (72 → 144 → 576 GPUs) lets you fit tensor-parallel and pipeline-parallel groups, and especially wide expert-parallel MoE inference, entirely inside the NVLink fabric — where bandwidth is ~5–10x the scale-out NIC — instead of spilling collectives onto the slower back-end network. Wide-EP MoE serving (e.g., EP32 vs EP8) is the canonical workload that the big domain unlocks (Chapter 8.2). But a wide domain you do not use is stranded capital: a latency-bound 8B-parameter inference service pinned to a 72-GPU NVLink rack is paying for a fabric it never lights up. Buy the domain the workload consumes — the NVL system is a commitment, not a default.

Deep dive: NVLink per-GPU bandwidth as a datasheet attribute (and where the fabric lives)

Every generation advertises a per-GPU NVLink number — 900 GB/s on Hopper (NVLink 4), 1.8 TB/s on Blackwell (NVLink 5), 3.6 TB/s on Rubin (NVLink 6) — and it is tempting to treat it like memory bandwidth, a property of the chip. It is not. The per-GPU figure is the injection bandwidth into a switched fabric; what you actually get depends on the NVSwitch generation, the domain size, and the topology that aggregates it. On an NVL72 the 1.8 TB/s per GPU aggregates to ~130 TB/s of rack scale-up bandwidth; on Rubin NVL144 the 3.6 TB/s per GPU aggregates to ~260 TB/s per rack; Rubin Ultra's eight-Kyber-rack pod reaches ~10 PB/s all-to-all. The datasheet attribute is real and comparable across vendors (it is roughly an order of magnitude above the scale-out NIC), but the design decisions it drives — switch-tray count, copper-vs-optical reach, NVLink-SHARP in-network reduction, domain partitioning — are fabric decisions.

So we record the per-GPU number here, in the accelerator chapter, because it is a property you compare when choosing silicon. We engineer the fabric that consumes it — NVSwitch topology, NVLink-SHARP collective offload, the copper-reach wall that is pushing Rubin Ultra toward optical scale-up, and how the scale-up domain is partitioned and scheduled — in Chapter 8.2. Treat the two as a split: the chip chapter owns the attribute; the network chapter owns the system.

The disaggregated-inference fork: Rubin CPX

Rubin introduces a second, quieter fork that reshapes the inference BOM. Long-context inference has two phases with opposite hardware profiles: the context (prefill) phase reads and encodes the entire input — compute-bound, hungry for FLOPS, light on memory bandwidth — while the generation (decode) phase emits tokens one at a time, memory-bandwidth-bound and latency-sensitive, leaning on HBM and the KV cache. A monolithic GPU sized for decode (expensive HBM) is overpaying to do prefill; a GPU sized for prefill is starved on memory for decode. NVIDIA's answer is Rubin CPX — a context-phase accelerator with ~30 PFLOPS NVFP4, 3x attention acceleration over GB300, and crucially 128 GB of GDDR7 rather than HBM, which SemiAnalysis estimates is roughly 5x more cost-effective per byte than HBM for this compute-bound role.

The decision: for million-token-context workloads, do you adopt disaggregated serving — a pool of Rubin CPX GPUs doing prefill feeding a pool of Rubin (HBM) GPUs doing decode, coupled over the fabric via KV-cache transfer — or stay monolithic? The Vera Rubin NVL144 CPX rack packages the answer: 144 Rubin GPUs + 144 Rubin CPX GPUs + 36 Vera CPUs, ~100 TB fast memory, ~1.7 PB/s memory bandwidth, ~8 EF NVFP4. Disaggregation wins on cost-per-token at long context because you stop paying HBM prices for compute-bound prefill; it costs you a more complex serving stack (separate pools, KV-cache transport, careful ratio tuning) and a fabric that must move KV cache between phases efficiently (Chapter 10.11). For short-context, latency-flat workloads the disaggregation overhead does not pay — this is a long-context-specific fork.

~132 kW

GB200 NVL72 rack draw (~115 kW liquid + ~17 kW air); ~1.36 t wet, 18 compute + 9 NVSwitch trays

2025NVIDIA OCP / Introl

~600 kW

Rubin Ultra Kyber NVL576 rack on 800 VDC — 144 quad-die packages = 576 compute dies

H2 2027 (announced)NVIDIA GTC; The Next Platform; Tom's Hardware

3.6 TB/s

NVLink 6 per-GPU bandwidth (Rubin); 900 GB/s Hopper, 1.8 TB/s Blackwell — ~260 TB/s per NVL144 rack

H2 2026 (announced)NVIDIA Vera Rubin POD blog

288 GB

HBM4 per Rubin GPU at ~22 TB/s; trajectory H100 80 GB → H200 141 GB → B200 192 GB → B300 288 GB → Rubin Ultra 1 TB

2026NVIDIA Developer (Rubin platform)

~336 B

transistors per Rubin GPU (dual-die, 4-reticle CoWoS-L) — 1.6x Blackwell's ~208 B

2026 (announced)NVIDIA GTC 2026; Barrack AI breakdown

128 GB

GDDR7 on Rubin CPX context GPU (~30 PFLOPS NVFP4, 3x attention) — ~5x more cost-effective than HBM for prefill

end-2026 (announced)NVIDIA Newsroom; SemiAnalysis

annual

architecture cadence — Blackwell 2024, Blackwell Ultra 2025, Rubin 2026, Rubin Ultra 2027, Feynman 2028

2026NVIDIA roadmap; The Next Platform

2–3 yr

accelerated economic life vs 5–6 yr book life; the annual cadence compresses the obsolescence clock

2025Goldman Sachs; SemiAnalysis synthesis

The annual cadence as a strategic weapon

The cadence is not merely a delivery schedule — it is a competitive instrument, and it cuts two ways. Against competitors, a yearly architecture compresses the window any challenger has to close a gap: by the time an AMD or a custom-ASIC roadmap matches Blackwell, Rubin is shipping, and the comparison resets (Chapter 7.3, Chapter 7.5). The software moat (CUDA, the NCCL/Dynamo/TensorRT stack) compounds this — a one-year hardware tick gives the ecosystem a fresh target every twelve months. Against the buyer, the same cadence is a depreciation accelerant: a frontier accelerator's economic life is now 2–3 years against a 5–6 year book life, because next year's part does the same work at materially lower cost-per-token. The cadence that protects NVIDIA's lead also shortens your amortization runway.

This sets up the buyer's real decision: ride every generation, or skip? Riding each step maximizes performance-per-dollar-per-token but means continuous capital outlay and the operational churn of new power, cooling, and fabric envelopes every year. Skipping a generation (e.g., Hopper → Rubin, bypassing Blackwell) reduces churn and lets one substrate investment serve longer, at the cost of running a generation behind on token economics during the gap. The deciding variables are your residual-value assumption (do used GPUs hold enough value to backstop the refresh — see Chapter 7.11) and whether your facility substrate can even accept the generation you would skip to. You cannot skip from a 40 kW air hall to a 600 kW Kyber pod; the skip is only available if you provisioned the substrate for it.

The 800 VDC / Kyber wall is a substrate decision you make now

The Rubin-Ultra-era ~600 kW Kyber rack does not run on the 415/480 VAC power architecture that serves Blackwell-class halls — it requires an 800 VDC distribution path (disaggregated sidecar power, DC busbar, supercapacitor ride-through). That is not a rack-swap; it is a power-chain re-architecture. If there is any chance your facility hosts a Rubin-Ultra-or-later generation, the irreversible substrate decisions — reserved electrical capacity and switchgear room, busbar and pipe-rack space, floor loading for 600 kW racks, and facility water at the implied heat load — must be made at scoping time, not when the SKU is announced. Design toward 800 VDC at scoping (Chapter 4.7 on the DC transition). The buyers who get caught are the ones who scoped to the rack they were buying instead of the ramp they were entering.

Ride-every-generation vs skip-a-generation — the refresh fork

Strategy	Token-economics position	Capital cadence	Substrate/ops churn	Best fit
Ride every generation	Always at the frontier of cost-per-token	Continuous, annual outlay	High — new power/cooling/fabric envelope each year	Frontier labs; neoclouds competing on price/token
Skip one generation	One step behind during the gap	Lumpy, every ~2 years	Moderate — one substrate serves two cycles	Enterprises with stable workloads; substrate-constrained sites
Hold (run to economic end)	Falls behind; relies on residual demand	Minimal until forced refresh	Lowest — until a hard substrate/density wall	Batch/offline inference; depreciation-sensitive operators

Heuristic, not a rule; the right answer is set by residual-value assumptions, substrate readiness, and token-economics sensitivity. Quantitative NPV in Chapter 7.11.

Deep dive: why "NVL144" counts 144 and the dual-die accounting trap

The naming will trip up anyone reading the roadmap as a procurement spec. GB200 NVL72 means 72 Blackwell GPUs in the NVLink domain — and each of those GPUs is itself a dual-die package, so the rack contains 144 compute dies behaving as 72 CUDA devices. Vera Rubin NVL144 does not mean twice as many packages; it means NVIDIA changed the accounting to count compute dies — 144 dies = 72 dual-die Rubin packages — so the NVL144 rack has the same 72-package footprint as NVL72, not double. Rubin Ultra Kyber NVL576 then means 576 compute dies = 144 quad-die packages, a genuine 2x in package count over NVL72 and a 4x in die count.

Why this matters in practice: if you size power, cooling, and fabric by reading "144" as "twice the GPUs of 72," you will mis-budget. The honest comparison is die-to-die and rack-to-rack power: NVL72 at ~132 kW, Rubin NVL144 at ~190–230 kW (same 72-package footprint, higher per-package power), Kyber NVL576 at ~600 kW (double the packages, an 800 VDC rack). Always reduce the marketing nomenclature to (packages per rack) × (dies per package) × (per-package power) before you put a number in a design-basis document. The cross-vendor version of this discipline — dense vs sparse, peak vs sustained, the marketing-number trap — is the subject of Chapter 7.1.

What the roadmap commits, and what it leaves reversible

The discipline that separates a defensible accelerator strategy from a fragile one is the same one that governs facility scoping (Chapter 1.1): sort the decisions by the cost of changing your mind. The roadmap makes some things reversible and some things irreversible, and they are not the ones people assume.

Reversible (defer, re-decide at refresh): the specific accelerator generation within a power/cooling envelope — a hall plumbed for 132 kW liquid can take GB200, GB300, and likely an early Rubin SKU without re-architecting; the HGX-vs-NVL choice for an inference fleet; the disaggregation decision for inference serving; the ride-vs-skip refresh cadence.
Irreversible (commit at scoping): whether the hall is plumbed for liquid at all; the floor-loading basis for 3,000-lb-plus wet racks; the electrical capacity and voltage path (415/480 VAC vs an 800 VDC future); and reserved physical headroom — busbar runs, pipe racks, switchgear room — for the next density step. These are the decisions where the option premium is cheap to pay now and very expensive to retrofit later.

The strategic move is the same as everywhere in this guide: convert irreversible decisions into reversible ones while the option is cheap. Reserve the busbar capacity and water for a density step-up you have not committed to; buy into the NVL domain your workload uses today but provision the substrate for the domain you might buy in two years. The roadmap is a ramp. Design the rungs you can reach into the building before you need them.

The accelerator landscape and the datasheet-reading discipline that frames this chapter are in Chapter 7.1; the open challenger and hyperscaler XPUs that the annual cadence is aimed at are in Chapter 7.3 and Chapter 7.4; the HBM and packaging constraints that gate every generation are in Chapter 7.6 and Chapter 7.7; on-package power delivery and transient mitigation in Chapter 7.12; the rack-as-integration-unit in Chapter 7.13; and accelerator selection, TCO, and the refresh economics in Chapter 7.11. The NVLink/NVSwitch fabric that consumes the per-GPU bandwidth recorded here is engineered in Chapter 8.2; scale-out topology and oversubscription in Chapter 8.5; disaggregated inference serving in Chapter 10.11. The density wall and DLC default the rack-power ramp forces are in Chapter 5.1 and Chapter 5.4; the 800 VDC transition in Chapter 4.7; and the reversible-vs-irreversible scoping discipline in Chapter 1.1.