Guide › Compute, Silicon & System Integration › 7.3

Chapter 7.3

AMD Instinct & the Open Challenger

AMD's hardware caught NVIDIA on memory and FLOPS and, with MI400/Helios, on rack-scale fabric — so the second-source decision is no longer about silicon, it is about whether your workload can pay the ROCm-maturity tax and whether you believe an open scale-up stack will be production-proven before your cluster depreciates.

GOODPUTDENSITY-RAMPPOWER-BOUND

What you'll decide here

Whether your dominant workload is memory-capacity-bound inference (where AMD's larger HBM and fewer-GPU sharding is a structural win today) or tightly-coupled training (where the realized-MFU gap and immature collectives still favor CUDA).
Whether you are buying single-node MI300X/MI325X/MI355X boxes — where AMD competes node-for-node now — or committing to rack-scale MI400/Helios, where you are betting on UALink-over-Ethernet maturing before NVL144/NVL576 lock-in hardens.
How much engineering payroll you will spend closing the ROCm software gap, and whether that cost is amortized across enough fleet to beat the 15-30% hardware-price discount AMD is offering to buy its way in.
Whether a credible second source is worth pursuing for supply leverage and allocation insurance alone — even at a TCO that only breaks even — given NVIDIA's allocation pain.
Which decisions are reversible (porting a portable inference stack to ROCm) versus irreversible (committing a training campus's scale-up fabric to an open standard whose switches are still sampling).

For most of the modern accelerator era the AMD question answered itself: the hardware was competitive on paper, the software was not, and a frontier operator could not afford to discover the difference on a production training run. That is no longer the safe default. As of mid-2026 AMD ships a part — the MI355X — that matches or beats NVIDIA's Blackwell generation on HBM capacity and FP4 throughput, has a hyperscaler anchor (the OpenAI 6 GW agreement) and a sovereign-cloud anchor (Oracle's 50,000-GPU MI450 commitment) underwriting its roadmap, and is bringing a 72-GPU open rack — Helios, built on Meta's OCP Open Rack Wide — to challenge the one place NVIDIA was uncontested: the rack-scale coherent domain. The silicon argument for a second source is largely won. The argument that remains is about realized performance, software maturity, and the timing of an open fabric standard — and that argument is where the money is actually made or lost.

This chapter walks the product line — MI300X, MI325X, MI355X, and the rack-scale MI400/MI450 family inside Helios — and the scale-up bet AMD is making with UALink-over-Ethernet. The practitioner's fork is where AMD wins today, and what the ROCm-maturity tax costs you when it does not. Paper FLOPS and HBM datasheets are necessary but not sufficient; the binding question is goodput — the fraction of that paper performance your software actually extracts — and whether the hardware discount survives contact with the porting bill.

The product line: catching up on memory first, FLOPS second, fabric last

AMD's strategy reads cleanly off the product cadence: lead with the axis NVIDIA was weakest on — memory capacity per GPU — because that is the axis that decides how many accelerators a large model needs to be sharded across, and therefore the axis that most directly drives inference cost-per-token. The MI300X (CDNA 3, launched December 2023) shipped with 192 GB of HBM3 at 5.3 TB/s and a 750 W TDP against the H100's 80 GB — a 2.4x capacity advantage that let a 70B-class model fit on a single GPU and a 405B-class model shard across one node instead of two. That is not a benchmark-chart win; it is a topology win, because fewer GPUs per model replica means less collective traffic, fewer failure domains, and a smaller KV-cache footprint per accelerator.

The MI325X (still CDNA 3, Q4 2024) widened the memory lead to 256 GB of HBM3E at 6 TB/s and 1,000 W — a mid-cycle capacity bump aimed squarely at NVIDIA's H200 (141 GB). The MI355X (CDNA 4, TSMC N3P, mid-2025) is the part that closed the compute gap: 288 GB of HBM3E at 8 TB/s, native FP6 and FP4 datatypes, roughly 20 PFLOPS of FP4 and ~10 PFLOPS of FP8 dense, at a 1,400 W liquid-cooled board power. That is HBM capacity ahead of a Blackwell GB200 die and FP4 throughput in the same class — for the first time, AMD is not conceding either of the two headline numbers. The precision story matters downstream: native FP4/FP6 is what makes the memory advantage pay off on modern inference, and it ties directly into the quantization tradeoffs in Chapter 7.10.

AMD Instinct generation-over-generation vs the NVIDIA reference point

Part	Arch / node	HBM capacity	HBM bandwidth	FP4 (dense)	Board power	NVIDIA reference
MI300X	CDNA 3 / N5	192 GB HBM3	5.3 TB/s	n/a (no native FP4)	750 W	H100 (80 GB) / H200 (141 GB)
MI325X	CDNA 3 / N5	256 GB HBM3E	6.0 TB/s	n/a (no native FP4)	1,000 W	H200 (141 GB)
MI355X	CDNA 4 / N3P	288 GB HBM3E	8.0 TB/s	~20 PFLOPS	1,400 W (DLC)	GB200 / B200 (Blackwell)
MI450X (Helios)	CDNA 'Next' / N2	432 GB HBM4	~19.6 TB/s	~40 PFLOPS	~liquid, rack-scale	Rubin / NVL144 (announced)

Datasheet (paper) figures, 2026-current. NVIDIA columns are the contemporaneous competitive reference, not a claim of equivalence — realized performance diverges from paper, see the goodput discussion and keynumbers below. FP4 figures are dense unless noted.

The fourth row is the one that changes the strategic picture. Single-GPU and single-node parts let AMD compete box-for-box — but NVIDIA's moat from the GB200 generation onward was never the box, it was the rack: a 72-GPU NVLink-coherent domain that software treats as one giant accelerator, which no competitor could match because no competitor had a scale-up fabric. The MI400/MI450 family inside Helios is AMD's answer. Each MI450X carries 432 GB of HBM4 at ~19.6 TB/s; a Helios rack fuses 72 of them into a single scale-up domain delivering ~260 TB/s of aggregate scale-up bandwidth, ~31 TB of HBM4, ~1.4 EFLOPS FP8 and ~2.9 EFLOPS FP4. On the headline rack numbers, that is NVL144-class. The catch is how the 72 GPUs are fused.

The scale-up bet: UALink-over-Ethernet, and the switch that is still sampling

NVIDIA's NVLink is a proprietary, production-proven, purpose-built scale-up switch fabric. AMD cannot ship that, so it is betting on the open alternative: UALink, a consortium standard (AMD, Intel, Google, Meta, Microsoft, Broadcom, and others) finalized as UALink 1.0 — 200 GT/s per lane, load/store memory semantics, sub-1-microsecond round trips, up to 1,024 accelerators in a domain. On paper UALink is a genuine peer to NVLink. The problem is timing. True UALink switch silicon (from Astera Labs, Marvell, and others) is sampling and slips toward late 2026/2027 — which is after Helios needs to ship. So the first Helios racks run UALink's load/store protocol tunneled over Ethernet (UALink-over-Ethernet, UALoE) on Broadcom Tomahawk-class switches, not over native UALink switches.

The tradeoff is sharp. Tunneling over Ethernet gets AMD a 72-GPU coherent domain on schedule, riding the mature, multi-vendor Ethernet PHY and switch ecosystem instead of waiting on a brand-new switch ASIC. The downstream cost is that you are running a memory-semantic scale-up fabric over a substrate that was not designed for it, with a software and reliability stack that is younger than NVLink's by several production generations. NVLink's value was never the bandwidth number alone — it was that the failure modes, the firmware, the collective libraries, and the diagnostics were beaten into shape across two or three large deployments. An open fabric inherits none of that maturity for free. The full scale-up fabric comparison — NVLink vs UALink vs UALoE, domain sizing, and the copper/optics reach implications — lives in Chapter 8.2; here the point is narrower: committing a training campus's scale-up fabric to UALoE in 2026 is a bet that an open standard matures on your depreciation schedule.

The fork: node-for-node now, or rack-scale on faith

Two AMD decisions sit at very different risk tiers, and conflating them is the most common mistake. Buying MI300X/MI325X/MI355X nodes for memory-bound inference is a low-regret, largely reversible move: the parts win on HBM capacity today, the workload is loosely coupled so the scale-up fabric barely matters, and a portable inference stack ports to ROCm in weeks. Committing to rack-scale MI400/Helios for tightly-coupled training is a different animal: you are underwriting an open scale-up fabric (UALoE today, native UALink switches later) whose production reliability is unproven at the scale you need, against an NVLink incumbent that has already survived the GB200 ramp's backplane pain. Decide which bet you are actually making. The node bet you can take on a quarter's notice; the rack bet you take on a multi-year procurement and cannot cheaply unwind.

Where AMD wins today

The honest answer is workload-specific, and it splits along the training/inference line that governs everything else in Chapter 7.11. AMD's structural advantage is memory-capacity-bound inference. When the constraint is fitting a large model and its KV cache into HBM at a given concurrency, capacity-per-GPU is the lever, and AMD has led on it every generation. A 405B-class or 670B-class model that needs two H100 nodes can fit on one MI300X node; the larger HBM lets you run higher batch sizes before you spill, raising throughput-per-GPU. SemiAnalysis's independent inference benchmarking found exactly this texture: at small batch sizes and on the largest models (Llama 405B, DeepSeek V3/R1-class), MI300X delivers competitive or superior performance-per-dollar versus H100 — on the order of a ~20% cost-per-token advantage in those regimes — precisely because the memory advantage and AMD's hardware discount stack.

The mirror image is where AMD does not win: latency-sensitive, throughput-tuned inference against the H200, and tightly-coupled training. The same independent benchmarks show the H200 consistently delivering lower latency than MI300X across many configurations — not because of silicon, but because NVIDIA's TensorRT-LLM and CUDA collective stack extract more of the paper FLOPS than AMD's vLLM-on-ROCm path did at test time. That is the realized-MFU gap, and it is the entire reason the datasheet does not settle the question.

The ROCm-maturity tax

Here is the cost AMD's price discount is buying down. Paper FLOPS are an upper bound; goodput is what you pay for. Across the public record, MI300X has realized somewhere in the range of ~37-66% of H100/H200 effective performance on inference workloads despite carrying ~1.5x the paper FLOPS and more HBM bandwidth — the gap is software: less-optimized GEMM and attention kernels, a younger RCCL collective library, framework features that land on CUDA first, and a long tail of operators that work but are not yet tuned. ROCm 7.x has narrowed this materially (better vLLM/SGLang support, improved attention kernels, broader framework coverage), and AMD's day-zero model enablement has improved — but the gap is structural, not cosmetic: CUDA has a ~15-year head start, a deeper third-party library ecosystem, and the gravitational pull that every new model and kernel ships against it first. The full software-lock-in treatment, including how to quantify switching cost, is in Chapter 7.9.

The tax is paid in three currencies, and a TCO model that ignores any of them flatters AMD. First, engineering payroll: the FTEs who port, tune, and maintain the ROCm path — real money that only amortizes if spread across enough fleet. Second, realized utilization: every point of MFU you fail to extract inflates your effective cost-per-token, eating into the hardware discount. Third, schedule and risk: the new-model enablement lag and the thinner operational tooling translate into slower time-to-production and more debugging at scale. The discipline is to convert the discount and the tax into the same unit — cost-per-million-tokens at your realized utilization, not the datasheet's — and only then compare. → Chapter 7.11.

The second-source decision: when AMD pays off, and when it does not

Workload	Binding constraint	AMD structural fit	ROCm tax severity	Net TCO outcome (2026)
Memory-bound inference (large/MoE models)	HBM capacity & bandwidth per GPU	Strong — leads on capacity every gen	Low–moderate (vLLM/SGLang mature)	Favorable to neutral
Latency-tuned inference vs H200	Kernel & serving-stack maturity	Hardware fine; software trails	Moderate–high (TensorRT-LLM edge)	Neutral to unfavorable
Frontier tightly-coupled training	Collectives + scale-up fabric maturity	Catching up (Helios/UALoE unproven)	High (RCCL, fabric reliability)	Unfavorable today; watch 2027
Supply leverage / allocation hedge	Second-source availability	Strong — real alternative supply	N/A (strategic, not perf)	Worth it even at TCO break-even

A decision matrix, not a scorecard. 'Realized gap' is the practitioner-observed software penalty, narrowing with ROCm 7.x but workload-dependent. TCO outcome assumes AMD's 15-30% hardware discount is on the table.

288 GB

MI355X HBM3E capacity @ 8 TB/s; ~20 PFLOPS FP4, 1,400 W DLC (CDNA 4, N3P)

2025AMD Instinct MI355X product brief / datasheet

192 GB

MI300X HBM3 @ 5.3 TB/s, 750 W (CDNA 3) — 2.4x H100 capacity; MI325X 256 GB @ 6 TB/s, 1,000 W

2024AMD Instinct MI300/MI325 product pages

432 GB

MI450X HBM4 @ ~19.6 TB/s per GPU; Helios = 72 GPUs, ~31 TB HBM4, ~260 TB/s scale-up, 2.9 EFLOPS FP4

2026AMD Helios / Advancing AI; TechPowerUp

37–66%

MI300X realized inference performance vs H100/H200 despite ~1.5x paper FLOPS — the ROCm goodput gap

2025SemiAnalysis AMD vs NVIDIA inference benchmark

~20%

MI300X cost-per-token advantage vs H100 SXM in small-batch / largest-model inference regimes

2025SemiAnalysis inference benchmark (cost/Mtoken)

15–30%

AMD hardware-price discount vs comparable NVIDIA SXM parts (the second-source incentive)

2025SemiAnalysis; domain research keyNumbers

6 GW

AMD–OpenAI Instinct supply agreement over 5 yr; first 1 GW (MI450) H2 2026; up to 160M AMD shares to OpenAI

2025AMD / TechCrunch / DCD

50,000

Oracle OCI MI450 GPU commitment, public H2 2026, expanding 2027 — the sovereign-cloud anchor tenant

2025DCD; Tom's Hardware

Deep dive: why the memory advantage is a topology advantage, not just a bigger number

It is tempting to read AMD's HBM lead as a spec-sheet bragging right. It is more than that, because memory capacity per GPU sets the minimum sharding factor for a given model, and the sharding factor propagates into nearly everything downstream. Consider a dense 405B-parameter model at FP8: weights alone are ~405 GB, before KV cache and activation overhead. On 80 GB H100s you need a tensor-parallel group spanning multiple GPUs and likely two nodes; on 192 GB MI300X you fit the replica in a single 8-GPU node, and on 288 GB MI355X you have headroom for a larger KV cache and higher concurrency within the node.

The consequences cascade. Fewer GPUs per replica means less scale-out collective traffic per token (the all-reduces and all-to-alls that dominate distributed inference shrink), which is why AMD can compete on inference even where its scale-up fabric trails — the loosely-coupled inference workload simply doesn't lean on the fabric the way training does. It means a smaller failure blast radius (a replica that fits in one node fails as one node, not two). And it means higher achievable batch size before KV-cache spill, which directly raises throughput-per-GPU and lowers cost-per-token. This is why AMD's structural win is specifically memory-bound inference: it is the workload where capacity-per-GPU is the binding constraint and the scale-up-fabric immaturity is least exposed. For wide-MoE models the same logic applies to fitting more experts per GPU. The capacity-vs-bandwidth optimization and the HBM supply oligopoly that gate all of this are in Chapter 7.6.

Deep dive: anchor-tenant economics and why OpenAI + Oracle de-risk the roadmap

A challenger accelerator's biggest risk is the chicken-and-egg of ecosystem investment, not the silicon: software houses won't optimize for a platform with no volume, and buyers won't commit volume to a platform with thin software. The way out is anchor tenants who commit enough volume to fund the software flywheel. AMD secured two in 2025. The OpenAI agreement is for 6 GW of Instinct compute over five years, with the first 1 GW (MI450) landing in H2 2026, and — critically — a warrant for up to 160 million AMD shares (~10%) that aligns OpenAI's incentive with AMD's success. The Oracle OCI commitment is 50,000 MI450 GPUs going public in H2 2026, expanding in 2027.

Why this matters to a buyer who is neither OpenAI nor Oracle: anchor tenants fund the exact thing the ROCm tax is made of. A 6 GW workload forces AMD and the framework ecosystem to harden kernels, collectives, and serving stacks at frontier scale — work that then flows downstream to every smaller buyer as a more mature ROCm. The anchor deals are, in effect, the market pricing in that AMD's software gap will close, and paying AMD to close it. The strategic-supply argument — that a credible second source is worth pursuing for allocation leverage even at TCO break-even, because it caps the incumbent's pricing power and insures against the allocation pain documented in Chapter 2.3 — gets materially stronger once two anchor tenants have de-risked the roadmap. That is procurement strategy, treated in full in Chapter 7.11.

The density-and-power consequence

AMD's catch-up on FLOPS came the same way NVIDIA's did — by spending power. The MI300X's 750 W grew to the MI325X's 1,000 W and the MI355X's 1,400 W, and MI400/Helios is a liquid-cooled rack-scale part from the outset. The downstream consequence is identical to the NVIDIA density-ramp story in Part 5: at MI355X's 1,400 W the part is past the air-cooling cliff and direct-to-chip liquid is mandatory, and Helios — built on Meta's Open Rack Wide double-wide form factor with sidecar 400/800 VDC power — is a liquid, rack-scale integration unit, not a box you rack by hand. A buyer evaluating AMD on cost-per-token must therefore evaluate it on the same facility basis as NVIDIA: liquid cooling plant, high-density power chain, and reinforced floors. There is no air-cooled shortcut to the competitive AMD parts. The one genuinely differentiating facility choice is the open OCP rack lineage: Helios deliberately rides Meta's OCP contributions (Open Rack Wide, Mt. Diablo power), which is a hedge against single-vendor rack lock-in but does not change the underlying density/cooling physics. The cross-vendor cost-per-token model that decides all of this is built in Chapter 7.11; the consolidated per-generation roadmap is in Chapter 16.2.

Do not benchmark the datasheet

The costliest error in evaluating AMD is accepting paper FLOPS and paper HBM bandwidth as the comparison. Every credible cross-vendor benchmark — and the hard-won lesson of the MI300X-vs-H100 training comparisons — is that measured diverges sharply from spec, and the divergence is software, environment, and tuning, all of which favor the incumbent at test time. Set your acceptance bar on your realized utilization, on your models, with the framework versions and serving stack you will actually run, and run it for long enough to surface the operational tail (silent data corruption checks, collective stalls, new-model enablement lag). An AMD evaluation that ends at the datasheet has measured the best case AMD will never deliver and the worst case CUDA never hits.

The 2026 read

The defensible position in mid-2026 is neither the bull's nor the bear's. AMD is a real second source — the first one the industry has had — and for memory-bound inference it is frequently the right economic choice on its own merits, before you even count the strategic value of supply leverage. For tightly-coupled frontier training it is still the higher-risk choice, gated less by silicon than by RCCL/collective maturity and by an open scale-up fabric (UALoE today, native UALink switches in 2027) that has not yet been proven at the scale and reliability NVLink has. The open question that decides AMD's trajectory is whether ROCm closes the realized-MFU gap and whether UALink hardware matures fast enough to make multi-vendor rack integration a genuine procurement hedge — and the anchor-tenant deals are the market's bet that it does. The practitioner's move is to capture the discount where the tax is already low, pursue the second source for leverage everywhere it breaks even, and keep the irreversible commitments (a training campus's scale-up fabric) on a shorter leash than the reversible ones (a portable inference stack) until the open fabric has shipped at scale.

AMD is the open foil to the NVIDIA roadmap in Chapter 7.2 and the hyperscaler XPUs in Chapter 7.4. The scale-up fabric bet — NVLink vs UALink vs UALink-over-Ethernet, domain sizing, and reach — is engineered in Chapter 8.2. The HBM capacity advantage that underpins AMD's inference win, and the supply oligopoly that gates it, are in Chapter 7.6; the packaging that sets stack-count-per-package in Chapter 7.7. The ROCm-vs-CUDA realized-MFU gap and switching-cost quantification live in Chapter 7.9; the FP4/FP6 precision story that makes the memory lead pay off in Chapter 7.10. The cost-per-token TCO model and the buy-vs-rent-vs-build / second-source procurement strategy that actually decide AMD-vs-NVIDIA are in Chapter 7.11; the allocation-leverage argument connects to procurement in Chapter 2.3; the consolidated 2026→2030 roadmap in Chapter 16.2.