Chapter 7.2
NVIDIA Accelerators: Hopper → Blackwell → Vera Rubin → Rubin Ultra → Feynman
NVIDIA's accelerator roadmap is not a spec sheet you read but a power-and-density ramp you are forced to design against — each annual generation moves the unit of purchase from the chip to the rack to the multi-rack pod, and committing to the wrong rung sets your cooling plant, power architecture, and refresh economics for years.
What you'll decide here
- Which generation you actually buy into — Hopper, Blackwell, or Rubin — and therefore the rack power envelope (40 kW → 132 kW → 190+ kW) your facility substrate must already accommodate.
- Whether your unit of purchase is the GPU, the HGX board, or the rack-scale NVL system (NVL72/144/576) — because the scale-up domain you buy is the scale-up domain you are stuck with until refresh.
- Whether to ride the annual cadence at every step (Hopper → Blackwell → Blackwell Ultra → Rubin → Rubin Ultra → Feynman) or skip generations — and how to amortize a 2–3 year economic life against a 1-year obsolescence clock.
- For inference at long context, whether to adopt disaggregated serving (Rubin CPX context GPUs + Rubin generation GPUs) or stay monolithic — a fork that changes your BOM, your fabric, and your cost per token.
- Whether the 800 VDC / Kyber transition is a bridge you design toward now (reserved busbar, floor loading, water) or a wall you hit later when a 600 kW rack will not fit the hall you built.
NVIDIA sells a cadence: an annual rhythm of accelerator generations, each one re-drawing the rack, the fabric, and the power chain underneath it. The decision this chapter forces is not "which GPU is fastest" but "which rung of the ramp am I committing my building to, and what does the next rung cost me if I guessed wrong." Since 2022 the relevant unit of purchase has migrated upward: from the H100 as a board, to the GB200 NVL72 as a 132 kW rack you buy whole, to the Vera Rubin NVL144 and the Rubin Ultra Kyber NVL576 as multi-rack pods plumbed for 800 VDC. Each migration is a one-way door for the facility that hosts it. You can defer the silicon; you cannot defer the floor loading, the water, and the interconnection slot the silicon implies.
This chapter walks the roadmap as a sequence of decisions and their downstream costs. We trace the per-GPU specs across Hopper → Blackwell → Blackwell Ultra → Vera Rubin → Rubin Ultra → Feynman; we explain why the NVL system — not the GPU — became the unit of procurement, and why the size of the scale-up domain you buy (8 → 72 → 144 → 576 GPUs) is a strategic commitment, not a datasheet line; we cover the disaggregated-inference fork that Rubin CPX introduces; and we treat the annual cadence as the lever that compresses competitors' design windows and your own depreciation schedule simultaneously. Per-GPU NVLink bandwidth appears here as a datasheet attribute; the NVLink/NVSwitch fabric that aggregates it has its canonical home in Chapter 8.2.
The master fork: you are buying a power envelope, not a FLOPS number
The instinct is to compare generations on peak FLOPS. That is the marketing-number trap (Chapter 7.1): the headline figures are sparse FP4 with all the asterisks stripped, and they tell you almost nothing about what you must build. The number that actually cascades through your facility is rack power. An H100 air-cooled rack lands near 40 kW; a GB200 NVL72 draws ~132 kW and mandates direct-to-chip liquid; a GB300 NVL72 pushes ~140 kW; the Vera Rubin VR200 NVL144 lands in the ~190–230 kW band; and the Rubin Ultra Kyber NVL576 targets ~600 kW on an 800 VDC bus. That is a 15x escalation in rack power across roughly four years.
The consequence: a hall scoped for the previous generation's density cannot absorb the next one without a substrate it does not have. You do not get to "upgrade" from a 40 kW air hall to a 132 kW liquid hall by swapping racks — the floor loading is wrong (a wet NVL72 is ~1.36 t / 3,000 lb), the plenum is wrong, the electrical headroom is wrong, and there is no facility water. The density wall (Chapter 5.1) and the DLC default (Chapter 5.4) are downstream of which rung of this ramp you bought into. The power curve governs; the compute curve follows.
Hopper → Blackwell → Vera Rubin → Rubin Ultra → Feynman: the per-GPU arc
Hopper (H100, 2022 / H200, 2024) is the generation most of the installed base still runs. H100 ships 80 GB HBM3 at ~3.35 TB/s, ~700 W TDP, FP8 Transformer Engine, NVLink 4 at 900 GB/s per GPU. H200 is the same compute die with 141 GB HBM3E at ~4.8 TB/s — a memory-bandwidth refresh that disproportionately helps inference decode. Hopper is air-coolable, which is exactly why it became the default and why the jump to Blackwell broke so many facility assumptions.
Blackwell (B200 / GB200, 2024–2025) is a dual-die GPU — two reticle-limited compute dies on one package behaving as a single CUDA device over a 10 TB/s die-to-die link — with 192 GB HBM3E, a second-generation Transformer Engine adding native FP4, and NVLink 5 at 1.8 TB/s per GPU. The GB200 superchip pairs two Blackwell GPUs with one Grace CPU over NVLink-C2C. Blackwell Ultra (B300 / GB300, 2025) lifts HBM to 288 GB and adds steady-power and transient-mitigation features (capacitor energy storage, ramp smoothing) that exist because a 140 kW rack toggling between idle and full all-reduce is a grid problem (Chapter 7.12).
Vera Rubin (VR200, H2 2026) is the next platform, not just a chip. The Rubin GPU is again dual-die on a 4-reticle CoWoS-L interposer — ~336 billion transistors, 1.6x Blackwell — with 288 GB HBM4 across 8 stacks at up to ~22 TB/s, sixth-generation Tensor Cores, and NVLink 6 at 3.6 TB/s per GPU. The Vera CPU is NVIDIA's custom Arm successor to Grace. The rack-scale unit is the NVL144 — marketed as 144 because it counts compute dies (72 dual-die packages), delivering ~3.3x the GB300 NVL72 on inference, ~3.6 EF FP4 inference / ~1.2 EF FP8 training per rack, with ~260 TB/s of scale-up NVLink bandwidth. Full production is targeted for H2 2026.
Rubin Ultra (H2 2027) is where the unit of purchase jumps again. It packs four compute dies per package (~100 PFLOPS FP4, 1 TB HBM4e per package) and deploys in the Kyber NVL576 rack — 144 quad-die packages = 576 GPU compute dies, ~600 kW per rack on 800 VDC, ~15 EF FP4 inference / ~5 EF FP8 training, ~365 TB total memory. Feynman (2028) is the next architecture on the roadmap — TSMC A16 (1.6 nm) with backside power delivery, NVLink/NVSwitch and ConnectX/Spectrum generations advancing in lockstep. The cadence is explicit and locked: a new architecture every year.
| Generation | GPU memory | Mem BW | NVLink/GPU | TDP/GPU | Rack unit | Rack power | Availability |
|---|---|---|---|---|---|---|---|
| Hopper H100 | 80 GB HBM3 | ~3.35 TB/s | 900 GB/s (NVLink 4) | ~700 W | HGX 8-GPU / DGX | ~40 kW (air) | 2022 |
| Hopper H200 | 141 GB HBM3E | ~4.8 TB/s | 900 GB/s (NVLink 4) | ~700 W | HGX 8-GPU | ~40 kW (air) | 2024 |
| Blackwell GB200 | 192 GB HBM3E | ~8 TB/s | 1.8 TB/s (NVLink 5) | ~1.0–1.2 kW | NVL72 rack | ~120–132 kW (DLC) | 2024–2025 |
| Blackwell Ultra GB300 | 288 GB HBM3E | ~8 TB/s | 1.8 TB/s (NVLink 5) | ~1.4 kW | NVL72 rack | ~140 kW (DLC) | 2025 |
| Vera Rubin VR200 | 288 GB HBM4 | ~22 TB/s | 3.6 TB/s (NVLink 6) | ~1.8 kW | NVL144 rack | ~190–230 kW (DLC) | H2 2026 (announced) |
| Rubin Ultra | 1 TB HBM4e/pkg | (4-die pkg) | (NVLink 6+) | ~2.3 kW | Kyber NVL576 | ~600 kW (800 VDC) | H2 2027 (announced) |
| Feynman | HBM4e+ (TBD) | TBD | (NVLink 7) | TBD | Kyber-class | ≥600 kW (roadmap) | 2028 (roadmap) |
The rack-power column governs the table, not the FLOPS column. The compute numbers grow impressively, but they are the easy part — TSMC and HBM deliver them on schedule. The hard part, the part that strands capital, is the rightmost columns: the rack unit changes shape (board → 72-GPU rack → 144 → 576), the power per rack escalates an order of magnitude, and the cooling and voltage architecture flip underneath. A facility that bought into Blackwell at 132 kW and liquid cooling is one substrate decision (reserved busbar capacity, water headroom, floor loading) away from Rubin; a facility that bought into Hopper at 40 kW and air is a demolition-and-rebuild away. The generation you choose is the substrate you commit to.
Why the NVL system became the unit of purchase
Through Hopper, the unit was the 8-GPU HGX board and the scale-up domain was 8 GPUs wide. Blackwell broke that model: the GB200 NVL72 fuses 72 Blackwell GPUs and 36 Grace CPUs into a single NVLink domain — 18 compute trays and 9 NVSwitch trays connected by a copper NVLink spine carrying ~130 TB/s of aggregate rack bandwidth across more than 5,000 in-rack copper cables — so that all 72 GPUs address each other at full NVLink speed as one coherent memory fabric (~13.4 TB of unified memory, ~1.44 EF FP4 sparse). You do not assemble this from parts; you buy the rack as a SKU. The reason this matters strategically is that the scale-up domain size is now a purchasing decision that sets your parallelism ceilings until your next refresh.
The consequence runs in both directions. A wide domain (72 → 144 → 576 GPUs) lets you fit tensor-parallel and pipeline-parallel groups, and especially wide expert-parallel MoE inference, entirely inside the NVLink fabric — where bandwidth is ~5–10x the scale-out NIC — instead of spilling collectives onto the slower back-end network. Wide-EP MoE serving (e.g., EP32 vs EP8) is the canonical workload that the big domain unlocks (Chapter 8.2). But a wide domain you do not use is stranded capital: a latency-bound 8B-parameter inference service pinned to a 72-GPU NVLink rack is paying for a fabric it never lights up. Buy the domain the workload consumes — the NVL system is a commitment, not a default.
Deep dive: NVLink per-GPU bandwidth as a datasheet attribute (and where the fabric lives)
Every generation advertises a per-GPU NVLink number — 900 GB/s on Hopper (NVLink 4), 1.8 TB/s on Blackwell (NVLink 5), 3.6 TB/s on Rubin (NVLink 6) — and it is tempting to treat it like memory bandwidth, a property of the chip. It is not. The per-GPU figure is the injection bandwidth into a switched fabric; what you actually get depends on the NVSwitch generation, the domain size, and the topology that aggregates it. On an NVL72 the 1.8 TB/s per GPU aggregates to ~130 TB/s of rack scale-up bandwidth; on Rubin NVL144 the 3.6 TB/s per GPU aggregates to ~260 TB/s per rack; Rubin Ultra's eight-Kyber-rack pod reaches ~10 PB/s all-to-all. The datasheet attribute is real and comparable across vendors (it is roughly an order of magnitude above the scale-out NIC), but the design decisions it drives — switch-tray count, copper-vs-optical reach, NVLink-SHARP in-network reduction, domain partitioning — are fabric decisions.
So we record the per-GPU number here, in the accelerator chapter, because it is a property you compare when choosing silicon. We engineer the fabric that consumes it — NVSwitch topology, NVLink-SHARP collective offload, the copper-reach wall that is pushing Rubin Ultra toward optical scale-up, and how the scale-up domain is partitioned and scheduled — in Chapter 8.2. Treat the two as a split: the chip chapter owns the attribute; the network chapter owns the system.
The disaggregated-inference fork: Rubin CPX
Rubin introduces a second, quieter fork that reshapes the inference BOM. Long-context inference has two phases with opposite hardware profiles: the context (prefill) phase reads and encodes the entire input — compute-bound, hungry for FLOPS, light on memory bandwidth — while the generation (decode) phase emits tokens one at a time, memory-bandwidth-bound and latency-sensitive, leaning on HBM and the KV cache. A monolithic GPU sized for decode (expensive HBM) is overpaying to do prefill; a GPU sized for prefill is starved on memory for decode. NVIDIA's answer is Rubin CPX — a context-phase accelerator with ~30 PFLOPS NVFP4, 3x attention acceleration over GB300, and crucially 128 GB of GDDR7 rather than HBM, which SemiAnalysis estimates is roughly 5x more cost-effective per byte than HBM for this compute-bound role.
The decision: for million-token-context workloads, do you adopt disaggregated serving — a pool of Rubin CPX GPUs doing prefill feeding a pool of Rubin (HBM) GPUs doing decode, coupled over the fabric via KV-cache transfer — or stay monolithic? The Vera Rubin NVL144 CPX rack packages the answer: 144 Rubin GPUs + 144 Rubin CPX GPUs + 36 Vera CPUs, ~100 TB fast memory, ~1.7 PB/s memory bandwidth, ~8 EF NVFP4. Disaggregation wins on cost-per-token at long context because you stop paying HBM prices for compute-bound prefill; it costs you a more complex serving stack (separate pools, KV-cache transport, careful ratio tuning) and a fabric that must move KV cache between phases efficiently (Chapter 10.11). For short-context, latency-flat workloads the disaggregation overhead does not pay — this is a long-context-specific fork.
The annual cadence as a strategic weapon
The cadence is not merely a delivery schedule — it is a competitive instrument, and it cuts two ways. Against competitors, a yearly architecture compresses the window any challenger has to close a gap: by the time an AMD or a custom-ASIC roadmap matches Blackwell, Rubin is shipping, and the comparison resets (Chapter 7.3, Chapter 7.5). The software moat (CUDA, the NCCL/Dynamo/TensorRT stack) compounds this — a one-year hardware tick gives the ecosystem a fresh target every twelve months. Against the buyer, the same cadence is a depreciation accelerant: a frontier accelerator's economic life is now 2–3 years against a 5–6 year book life, because next year's part does the same work at materially lower cost-per-token. The cadence that protects NVIDIA's lead also shortens your amortization runway.
This sets up the buyer's real decision: ride every generation, or skip? Riding each step maximizes performance-per-dollar-per-token but means continuous capital outlay and the operational churn of new power, cooling, and fabric envelopes every year. Skipping a generation (e.g., Hopper → Rubin, bypassing Blackwell) reduces churn and lets one substrate investment serve longer, at the cost of running a generation behind on token economics during the gap. The deciding variables are your residual-value assumption (do used GPUs hold enough value to backstop the refresh — see Chapter 7.11) and whether your facility substrate can even accept the generation you would skip to. You cannot skip from a 40 kW air hall to a 600 kW Kyber pod; the skip is only available if you provisioned the substrate for it.
| Strategy | Token-economics position | Capital cadence | Substrate/ops churn | Best fit |
|---|---|---|---|---|
| Ride every generation | Always at the frontier of cost-per-token | Continuous, annual outlay | High — new power/cooling/fabric envelope each year | Frontier labs; neoclouds competing on price/token |
| Skip one generation | One step behind during the gap | Lumpy, every ~2 years | Moderate — one substrate serves two cycles | Enterprises with stable workloads; substrate-constrained sites |
| Hold (run to economic end) | Falls behind; relies on residual demand | Minimal until forced refresh | Lowest — until a hard substrate/density wall | Batch/offline inference; depreciation-sensitive operators |
Deep dive: why "NVL144" counts 144 and the dual-die accounting trap
The naming will trip up anyone reading the roadmap as a procurement spec. GB200 NVL72 means 72 Blackwell GPUs in the NVLink domain — and each of those GPUs is itself a dual-die package, so the rack contains 144 compute dies behaving as 72 CUDA devices. Vera Rubin NVL144 does not mean twice as many packages; it means NVIDIA changed the accounting to count compute dies — 144 dies = 72 dual-die Rubin packages — so the NVL144 rack has the same 72-package footprint as NVL72, not double. Rubin Ultra Kyber NVL576 then means 576 compute dies = 144 quad-die packages, a genuine 2x in package count over NVL72 and a 4x in die count.
Why this matters in practice: if you size power, cooling, and fabric by reading "144" as "twice the GPUs of 72," you will mis-budget. The honest comparison is die-to-die and rack-to-rack power: NVL72 at ~132 kW, Rubin NVL144 at ~190–230 kW (same 72-package footprint, higher per-package power), Kyber NVL576 at ~600 kW (double the packages, an 800 VDC rack). Always reduce the marketing nomenclature to (packages per rack) × (dies per package) × (per-package power) before you put a number in a design-basis document. The cross-vendor version of this discipline — dense vs sparse, peak vs sustained, the marketing-number trap — is the subject of Chapter 7.1.
What the roadmap commits, and what it leaves reversible
The discipline that separates a defensible accelerator strategy from a fragile one is the same one that governs facility scoping (Chapter 1.1): sort the decisions by the cost of changing your mind. The roadmap makes some things reversible and some things irreversible, and they are not the ones people assume.
- Reversible (defer, re-decide at refresh): the specific accelerator generation within a power/cooling envelope — a hall plumbed for 132 kW liquid can take GB200, GB300, and likely an early Rubin SKU without re-architecting; the HGX-vs-NVL choice for an inference fleet; the disaggregation decision for inference serving; the ride-vs-skip refresh cadence.
- Irreversible (commit at scoping): whether the hall is plumbed for liquid at all; the floor-loading basis for 3,000-lb-plus wet racks; the electrical capacity and voltage path (415/480 VAC vs an 800 VDC future); and reserved physical headroom — busbar runs, pipe racks, switchgear room — for the next density step. These are the decisions where the option premium is cheap to pay now and very expensive to retrofit later.
The strategic move is the same as everywhere in this guide: convert irreversible decisions into reversible ones while the option is cheap. Reserve the busbar capacity and water for a density step-up you have not committed to; buy into the NVL domain your workload uses today but provision the substrate for the domain you might buy in two years. The roadmap is a ramp. Design the rungs you can reach into the building before you need them.