Guide › Compute, Silicon & System Integration › 7.10

Chapter 7.10

Precision, Quantization & the Compute-Memory Tradeoff

Precision is the cheapest performance lever in the building and the most dangerous: every step down the ladder roughly doubles tensor-core throughput and halves the memory footprint, but it spends accuracy headroom you cannot always get back — so the engineering question is never 'how low can we go' but 'how low, with which scaling scheme, before this specific workload crosses its quality floor.'

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

Where on the precision ladder (BF16 → FP8 → FP6 → FP4) each phase of your workload runs — and that the answer differs between pre-training, the matmuls inside a forward pass, the master weights, and the KV-cache.
Which microscaling format you standardize on for FP4 — NVFP4 (16-value blocks, FP8 scale) vs MXFP4 (32-value blocks, power-of-two scale) — because it is silently baked into the silicon you buy and the checkpoints you ship.
Whether you commit to FP4 training now (Pareto-optimal at fixed compute, but with a recipe tax and a narrow stability margin) or hold the line at FP8 training and reserve FP4 for inference only.
How aggressively you quantize the KV-cache — the memory tier that, not weights, now governs how many concurrent users a decode node can hold — and at what point KV quantization starts degrading long-context recall.
Whether your accuracy gate is a perplexity delta or a downstream task-eval suite, signed off per model family, because the format that passes one can fail the other.

There is a number on every accelerator datasheet that is larger than all the others, and it is almost always an FP4 number. The GB200 NVL72 headline of ~1.44 ExaFLOPS, the Rubin NVL72's 3.6 ExaFLOPS, the MI355X's ~20 PFLOPS — these are FP4-with-sparsity peaks, and they exist because precision is the one lever that multiplies throughput without adding a single transistor of compute area or a single watt of power. A tensor core that does one BF16 multiply-accumulate per cycle does roughly two in FP8 and four in FP4, in the same silicon, at the same clock. Halving the bit-width also halves the bytes you move across HBM, NVLink, and the network — and in an inference world that is memory-bandwidth-bound, that second effect often matters more than the first.

So precision is free performance — except it is not. Every bit you remove from a number narrows its dynamic range or its mantissa resolution, and somewhere down the ladder the model's outputs start to drift: a perplexity bump, a few points off MMLU, a long-context retrieval that quietly fails. This chapter walks that fork. We climb the precision ladder rung by rung (FP32 → BF16 → FP8 → FP6 → FP4), explain why microscaling — and the NVFP4-vs-MXFP4 split inside it — is the thing that made FP4 viable at all, weigh FP4 training against holding at FP8, and treat the KV-cache as the precision decision that most directly buys or burns inference goodput.

The precision ladder: what each rung actually buys

A floating-point format is a budget split between exponent bits (dynamic range — how large and small a number it can represent) and mantissa bits (precision — how finely it resolves values within that range). The history of AI numerics is the history of discovering that neural networks need a lot of range but tolerate surprisingly little precision, and then ruthlessly cutting the mantissa to cash that in.

FP32 (1-8-23) was the training default through ~2017 and is now a niche: a place to accumulate, a fallback for the few operations that are numerically fragile (some normalizations, some loss scaling). TF32 (NVIDIA's 1-8-10 internal tensor-core format) was a transitional hack — FP32's range, a truncated mantissa — that bought a free speedup with no code change. BF16 (1-8-7) was the breakthrough that defined the modern era: it keeps FP32's full 8-bit exponent (so loss scaling and gradient dynamics behave like FP32) while throwing away two-thirds of the mantissa. BF16 is still the safe, boring default for training stability and the reference everything else is measured against. FP16 (1-5-10) trades range for precision and is why the industry needed loss scaling; outside legacy inference paths it has largely lost to BF16.

The interesting rungs are below 16 bits, where a raw format no longer has enough range to survive on its own and must be paired with a scaling scheme. FP8 comes in two flavors — E4M3 (more precision) for forward-pass activations and weights, E5M2 (more range) for gradients — and by 2026 FP8 is the production default for inference and, increasingly, for the bulk of training matmuls. FP6 (typically E3M2/E2M3) is the awkward middle rung: AMD's CDNA4 and some inference stacks support it, and it occasionally wins as a quality-preserving step for weights that FP4 degrades too far, but it lacks FP4's clean 2x throughput story, so it remains a specialist tool rather than a tier. FP4 (E2M1 — two exponent bits, one mantissa bit, sixteen representable values) is the current frontier and the reason this chapter exists: it is where the throughput is, and where the accuracy cliff lives.

The precision ladder — range, throughput, and where each rung lives in 2026

Format	Bits (S-E-M)	Rel. tensor throughput	Needs scaling	Primary 2026 role
FP32	1-8-23	0.25x	No	Accumulation; numerically fragile ops
TF32	1-8-10	0.5x	No	Legacy free speedup; mostly superseded
BF16	1-8-7	1x (reference)	No	Training stability default; master weights
FP8 (E4M3/E5M2)	1-4-3 / 1-5-2	~2x	Per-tensor / per-block	Production inference; most training matmuls
FP6 (E3M2/E2M3)	1-3-2 / 1-2-3	~2x	Per-block (microscaling)	Specialist quality step (CDNA4, some stacks)
FP4 (E2M1)	1-2-1	~4x	Microscaling (mandatory)	Inference frontier; emerging FP4 training

Throughput multipliers are relative to BF16 on the same tensor core (dense), and are silicon-architecture approximations, not exact per-SKU figures. 'Scaling' = whether the format needs an external scale to have usable range. FP4/FP6 figures assume Blackwell/CDNA4-class hardware.

The table is a sequence of one-way savings. Each downward step roughly doubles tensor throughput and halves operand bytes, but the 'needs scaling' column is the catch. At FP8 and below, the raw format's dynamic range is too narrow to represent the spread of values inside a real tensor (activations have outliers; weights have long tails). The fix is to factor each tensor into a low-precision payload plus a higher-precision scale, and the granularity of that scale — per-tensor, per-channel, or per-small-block — is the single most important quality lever below 8 bits. That granularity question is what microscaling answers, and it is where NVFP4 and MXFP4 part ways.

Microscaling: the scheme that made FP4 usable

The problem with a single scale factor for a whole tensor is that one outlier value forces the scale large, which crushes every small value into the same few FP4 codes — you spend all sixteen representable values describing the outlier and have nothing left for the signal. Microscaling (MX) solves this by giving every small block of elements its own scale. A block of values that happen to be small gets a small scale and fine resolution; a block with an outlier gets a large scale locally, without contaminating the rest of the tensor. This is the idea, standardized by the Open Compute Project's MX formats, that turned FP4 from a research curiosity into a shippable inference format.

The fork inside microscaling is the one practitioners actually have to decide, because it is wired into the hardware and the checkpoint format. MXFP4 (the OCP open standard) uses 32-element blocks with an E8M0 scale — a pure power-of-two exponent, so every scale snaps to the nearest 2ⁿ. NVFP4 (NVIDIA's Blackwell-native format) uses 16-element blocks with an E4M3 (FP8) scale — half the block size for more localized adaptation, plus a real mantissa in the scale so it can land between powers of two, backed by a second-level FP32 per-tensor scale for global range. The consequence is concrete: NVIDIA's own data shows the E4M3 scale cuts per-block mean-squared quantization error from 0.72 (E8M0) to 0.08 — an 88% error reduction from the scale format alone, before counting the smaller block size (NVIDIA, Introducing NVFP4, 2025).

NVFP4 vs MXFP4 — the FP4 format fork

Attribute	MXFP4 (OCP)	NVFP4 (NVIDIA)	Consequence of choosing
Block size	32 elements	16 elements	Smaller block = finer local adaptation, more scale overhead
Scale format	E8M0 (power-of-two)	E4M3 (FP8, fractional)	E4M3 cuts per-block MSE ~88% vs E8M0
Second-level scale	None	Per-tensor FP32	NVFP4 extends global dynamic range
Effective bits/value	~4.25	~4.5	NVFP4 pays slight extra memory for accuracy
Ecosystem	Open, vendor-neutral (AMD, Intel, OCP)	NVIDIA Blackwell-native	Portability vs best-on-NVIDIA accuracy
Inference accuracy	Larger drop at 4-bit	≤1% drop vs FP8 on DeepSeek-R1	Quality margin before re-tuning

Both are 4-bit E2M1 payloads; the difference is entirely in the scaling. Figures from NVIDIA's NVFP4 disclosures and the OCP MX specification; accuracy deltas are vs an FP8 baseline on the cited model.

The format you standardize on is a portability-vs-accuracy fork

NVFP4 is more accurate at 4 bits and is the native path on Blackwell and Rubin — but it is a single-vendor format, and a checkpoint quantized to NVFP4 is not a clean drop-in on AMD or hyperscaler silicon. MXFP4 is the open, OCP-standard format supported across AMD CDNA4, Intel, and the broader ecosystem, at a measurable accuracy cost. If you run a single-vendor NVIDIA fleet and chase the last point of quality, NVFP4 is the rational default. If you run a heterogeneous fleet (see Chapter 7.5) or care about checkpoint portability across vendors (Chapter 7.9), MXFP4 keeps you free at a quality tax you must measure per model. Decide this before you bake quantized weights into a serving pipeline — re-quantizing a deployed model family across formats is a re-validation project, not a config flag.

FP4 training: viable, but not free

For years the consensus was that FP4 was an inference-only format — fine for a frozen model, fatal for the delicate dynamics of a gradient-descent loop where small errors compound over trillions of steps. That consensus broke in 2025. Peer research established that native FP4 training can be Pareto-optimal at a fixed compute budget — for a given number of FLOPS, an FP4 run can reach better loss than a BF16 or FP8 run, because the throughput gain buys more tokens or more parameters per dollar (Castro et al., Quartet, arXiv 2025). NVIDIA then demonstrated it at scale: a 12B hybrid Mamba-Transformer pre-trained on 10 trillion tokens in NVFP4, the longest publicly documented 4-bit training run, with a validation-loss curve tracking the FP8 baseline and final MMLU-Pro of 62.58% vs 62.62% for FP8 — within noise (NVIDIA, Pretraining LLMs with NVFP4, 2025/2026).

The word doing the work in 'Pareto-optimal' is recipe. FP4 training does not work by flipping a precision flag; it works because of a stack of techniques that keep the gradient signal alive: master weights held in higher precision (BF16/FP32) while only the matmul operands are FP4; stochastic rounding so quantization error is unbiased rather than systematically truncating; selective high-precision layers (embeddings, the final projection, and the most sensitive blocks stay at FP8/BF16); and careful management of the backward pass, where gradients have the widest dynamic range and are the first thing to break. Skip any of these and the run either diverges or silently underperforms a cheaper FP8 run — the worst outcome, because you paid the FP4 risk premium and got nothing.

FP4 training narrows your stability margin — budget for it

An FP8 training run forgives a lot: a mistuned learning-rate warmup, a slightly hot batch, a single bad shard. An FP4 run forgives less. The dynamic range at 4 bits is so tight that the same perturbation that an FP8 run absorbs can push an FP4 run into a loss spike or a slow, invisible divergence that only shows up in downstream evals. The operational consequence is that FP4 training raises the value of everything in Chapter 9.4 (frequent checkpointing, fast restart) and Chapter 12.2 (goodput discipline): you will restart more, and you need the loss-curve and eval telemetry to catch a quiet divergence before you have burned a week of cluster time. The throughput is real; so is the babysitting. For a first frontier run on a new model family, holding training at FP8 and reserving FP4 for inference is still the conservative, defensible call.

Inference quantization: where the quality floor actually bites

Inference is where quantization pays off most and where the failure modes are most visible to users. The post-training quantization (PTQ) toolkit is mature: weight-only schemes (GPTQ, AWQ) that quantize the weights and keep activations at higher precision — ideal for memory-bound, small-batch decode where the bottleneck is reading weights from HBM; weight-and-activation schemes (the FP8/FP4 path) that quantize both and unlock the full tensor-core throughput, needed when you are compute-bound at large batch or in prefill; and SmoothQuant-style transforms that migrate activation outliers into the weights so both can be quantized cleanly. The decision is workload-shaped: a latency-bound chatbot serving one request at a time is memory-bound and wants weight-only INT4/FP4; a high-throughput batch pipeline is compute-bound and wants full FP8/FP4 matmuls.

The hard part is the accuracy gate, and the trap is measuring the wrong thing. A perplexity delta of a few hundredths looks reassuring and can hide a real regression on a downstream task that matters — code generation, multi-step math, instruction-following on a specific domain. The 2026 discipline is to gate quantization on a task-eval suite per model family, not a single perplexity number, because the same NVFP4 recipe that costs ≤1% on MMLU-Pro for one model can cost several points of pass@1 on a code benchmark for another. NVFP4's headline result — ≤1% degradation vs FP8 on DeepSeek-R1, with AIME actually improving in NVIDIA's measurement — is real, but it is a per-model result, not a guarantee. Reasoning models that emit long chains are especially exposed: a small per-token error compounds over a 10K-token decode, so a format that is fine for short answers can degrade a long reasoning trace.

Inference quantization decision — which scheme for which workload

Workload shape	Bound by	Recommended scheme	Quality risk	Why
Low-batch interactive decode	HBM bandwidth (weight reads)	Weight-only INT4/FP4 (AWQ/GPTQ)	Low–moderate	Activations stay high-precision; halves weight bytes
High-throughput batch / prefill	Tensor-core compute	Full FP8, then NVFP4/MXFP4	Moderate	Quantizing both operands unlocks ~2–4x FLOPS
Long-context reasoning decode	KV-cache capacity + bandwidth	KV-cache quant (FP8/INT8) + weight FP4	Moderate–high	KV memory, not weights, caps concurrency
Quality-critical / regulated	Accuracy, not cost	FP8 weight+activation; FP4 only if gated	Low	Keep the largest margin to the quality floor

Heuristic mapping, not a rule. 'Bound by' is the dominant bottleneck the scheme targets. Quality risk is relative and must be confirmed with a per-model eval gate.

KV-cache quantization: the memory tier that sets your concurrency

For long-context and reasoning inference, the binding resource on a decode node is no longer the weights — it is the KV-cache, the per-token key/value tensors that every attention step must re-read. Its size scales with batch size times sequence length, and for a frontier model at 100K+ context it can dwarf the model weights in a decode node's HBM budget. That makes KV-cache precision the lever that most directly governs how many concurrent users a node can hold: quantize the cache from BF16 to FP8 and you roughly double the number of sequences in flight; quantize to INT4 and you double it again. In an inference business, that is a direct multiplier on goodput per GPU and therefore on cost-per-token (Chapter 7.11).

The catch is that the KV-cache is more sensitive to quantization than the weights, because errors in the keys and values feed directly into the attention softmax, where they compound across every subsequent token. FP8 KV-cache is broadly safe in 2026 and widely deployed. INT4 KV-cache is where you start to see long-context recall degrade — the model loses the fine distinctions that let it retrieve a fact buried 80K tokens back — so it is typically applied selectively (recent tokens at higher precision, older tokens compressed) rather than uniformly. The mature pattern is a tiered KV hierarchy: hot KV in HBM at FP8, warm KV spilled to host DRAM or CXL, cold KV offloaded to NVMe-attached flash, with the precision dropping as the tier cools. This is the inference-side analog of the storage memory hierarchy, and it is where KV quantization, disaggregated prefill/decode, and the new Ethernet-attached context-memory tier converge (Chapter 9.7).

~2x / ~4x

tensor throughput per step down the ladder (BF16→FP8→FP4), same silicon

2026NVIDIA Blackwell architecture; domain research keyNumbers

88%

per-block MSE reduction from NVFP4's E4M3 scale vs MXFP4's E8M0 (0.72→0.08)

2025NVIDIA, Introducing NVFP4 for Low-Precision Inference

16 vs 32

NVFP4 block size (16 values, FP8 scale) vs MXFP4 (32 values, power-of-two scale)

2025NVIDIA NVFP4 disclosure; OCP MX specification

≤1%

NVFP4 inference accuracy drop vs FP8 on DeepSeek-R1 (MMLU-Pro 84% vs 85%)

2025NVIDIA, Introducing NVFP4

10T tokens

longest documented 4-bit pre-training run (12B Mamba-Transformer, NVFP4); 62.58% vs 62.62% MMLU-Pro vs FP8

2025–2026NVIDIA, Pretraining LLMs with NVFP4 (arXiv 2509.25149)

3.5x / 1.8x

NVFP4 memory reduction vs FP16 / vs FP8

2025NVIDIA, Introducing NVFP4

~50 PFLOPS

Vera Rubin per-GPU FP4 inference (Rubin Ultra ~100 PFLOPS NVFP4)

2026 (roadmap)NVIDIA Rubin platform; The Next Platform roadmap

~20 PFLOPS

AMD MI355X FP4 (sparse), CDNA4, with native FP6 support

2025AMD MI355X product page

Deep dive: why the same FP4 format passes one model and fails another

The uncomfortable truth of low-precision inference is that there is no universal answer to 'is FP4 safe.' Quantization sensitivity is a property of the weight and activation distributions a specific training run produced, and those differ by architecture, by training recipe, and even by random seed. Three things move the needle. First, outlier structure: models with persistent large-magnitude activation channels (common in some attention and MLP layers) force large block scales that crush the FP4 codes for everything else — these are the models that need SmoothQuant-style transforms or per-channel handling before FP4 is viable. Second, depth and decode length: a per-token quantization error that is invisible on a 200-token answer accumulates over a 20K-token reasoning trace, so reasoning models stress quantization far harder than chat models at the same nominal accuracy. Third, task brittleness: code and math have sharp pass/fail boundaries — one wrong token breaks a program — so they expose quantization damage that a fuzzy NLU benchmark averages away.

The consequence for an operator is a process, not a number. You cannot certify 'we run FP4' as a blanket policy; you certify it per model family, against a downstream eval suite that includes the brittle tasks, with a documented quality gate and a fallback to FP8 for the families that fail it. The NVFP4 advantage over MXFP4 — finer blocks, fractional scales — buys margin precisely on the hardest of these cases, which is why a heterogeneous fleet's MXFP4 quality tax is not a flat percentage but a per-model risk you have to measure. Mixed-precision serving (FP4 for the easy 80% of traffic, FP8 for the quality-critical 20%) is the pragmatic 2026 production pattern.

Deep dive: why the compute-memory tradeoff drives precision choices

It is tempting to read precision purely as a throughput knob, but the deeper reason it dominates 2026 accelerator design is the compute-memory tradeoff. Inference is overwhelmingly memory-bandwidth-bound: in low-batch decode, the GPU spends most of its time waiting to read weights and KV from HBM, and the FLOPS sit idle. In that regime, halving the bytes per parameter via FP4/INT4 is a near-linear speedup even though the FLOPS were never the constraint — you are buying bandwidth, not compute. This is why weight-only quantization wins for interactive serving and why HBM capacity and bandwidth (Chapter 7.6) are the binding constraint they are: precision is partly a way to stretch a scarce, expensive HBM budget across more tokens and more users.

Prefill and large-batch training invert this: they are compute-bound, the tensor cores are the bottleneck, and there the ~2x/~4x FLOPS multiplier from FP8/FP4 is the prize. The same model, on the same GPU, sits on opposite sides of this tradeoff depending on whether it is prefilling a long prompt or decoding one token at a time — which is exactly why disaggregated prefill/decode (Chapter 10.11) and per-phase precision choices co-evolved. The unifying lesson: precision is not one decision but a per-phase one, and the right ladder rung for the prefill matmul is frequently not the right rung for the decode loop or the KV-cache.

Anti-patterns

The recurring mistakes all come from treating precision as a single global flag instead of a per-phase, per-workload decision gated on the right metric.

Gating on perplexity instead of task evals. A reassuring perplexity delta hides a real regression on the brittle tasks (code, math, long-context retrieval) that users actually feel. Quantization must be gated per model family on a downstream eval suite, not a single intrinsic-metric number.
Flipping the precision flag and calling it done. FP4 training without master weights, stochastic rounding, and selective high-precision layers either diverges or silently underperforms a cheaper FP8 run — you pay the FP4 risk premium for negative return.
Uniform KV-cache quantization for long context. Pushing the entire KV-cache to INT4 to maximize concurrency, then losing long-context recall because the oldest tokens — the ones a retrieval depends on — lost their precision. Tier the cache; keep recent/hot KV at higher precision.
Picking a format for the headline FLOPS number. Buying silicon on its FP4-with-sparsity peak, then discovering your quality gate forces FP8 for most traffic and the FP4 peak is unreachable. Spec to the precision you can actually ship, not the datasheet's largest number (Chapter 7.1).

Precision is the connective tissue of Part 7. The FP4-with-sparsity peaks on every datasheet are read critically in Chapter 7.1; the NVIDIA and AMD FP4/FP6 silicon that implements these formats is specced in Chapter 7.2 and Chapter 7.3; the hyperscaler XPUs and their native formats (MXFP8 on Trainium, FP8 on TPU) are in Chapter 7.4. The HBM budget that quantization stretches is the binding constraint of Chapter 7.6; the format-portability dimension of lock-in is in Chapter 7.9; and the cost-per-token math that every precision decision ultimately serves is the governing metric of Chapter 7.11. On the systems side, FP4 training's stability cost raises the stakes of checkpointing in Chapter 9.4 and goodput discipline in Chapter 12.2, while KV-cache quantization is one half of the disaggregated-inference memory hierarchy built out in Chapter 9.7 and the serving-engineering tradeoffs of Chapter 10.11.