Chapter 7.10
Precision, Quantization & the Compute-Memory Tradeoff
Precision is the cheapest performance lever in the building and the most dangerous: every step down the ladder roughly doubles tensor-core throughput and halves the memory footprint, but it spends accuracy headroom you cannot always get back — so the engineering question is never 'how low can we go' but 'how low, with which scaling scheme, before this specific workload crosses its quality floor.'
What you'll decide here
- Where on the precision ladder (BF16 → FP8 → FP6 → FP4) each phase of your workload runs — and that the answer differs between pre-training, the matmuls inside a forward pass, the master weights, and the KV-cache.
- Which microscaling format you standardize on for FP4 — NVFP4 (16-value blocks, FP8 scale) vs MXFP4 (32-value blocks, power-of-two scale) — because it is silently baked into the silicon you buy and the checkpoints you ship.
- Whether you commit to FP4 training now (Pareto-optimal at fixed compute, but with a recipe tax and a narrow stability margin) or hold the line at FP8 training and reserve FP4 for inference only.
- How aggressively you quantize the KV-cache — the memory tier that, not weights, now governs how many concurrent users a decode node can hold — and at what point KV quantization starts degrading long-context recall.
- Whether your accuracy gate is a perplexity delta or a downstream task-eval suite, signed off per model family, because the format that passes one can fail the other.
There is a number on every accelerator datasheet that is larger than all the others, and it is almost always an FP4 number. The GB200 NVL72 headline of ~1.44 ExaFLOPS, the Rubin NVL72's 3.6 ExaFLOPS, the MI355X's ~20 PFLOPS — these are FP4-with-sparsity peaks, and they exist because precision is the one lever that multiplies throughput without adding a single transistor of compute area or a single watt of power. A tensor core that does one BF16 multiply-accumulate per cycle does roughly two in FP8 and four in FP4, in the same silicon, at the same clock. Halving the bit-width also halves the bytes you move across HBM, NVLink, and the network — and in an inference world that is memory-bandwidth-bound, that second effect often matters more than the first.
So precision is free performance — except it is not. Every bit you remove from a number narrows its dynamic range or its mantissa resolution, and somewhere down the ladder the model's outputs start to drift: a perplexity bump, a few points off MMLU, a long-context retrieval that quietly fails. This chapter walks that fork. We climb the precision ladder rung by rung (FP32 → BF16 → FP8 → FP6 → FP4), explain why microscaling — and the NVFP4-vs-MXFP4 split inside it — is the thing that made FP4 viable at all, weigh FP4 training against holding at FP8, and treat the KV-cache as the precision decision that most directly buys or burns inference goodput.
The precision ladder: what each rung actually buys
A floating-point format is a budget split between exponent bits (dynamic range — how large and small a number it can represent) and mantissa bits (precision — how finely it resolves values within that range). The history of AI numerics is the history of discovering that neural networks need a lot of range but tolerate surprisingly little precision, and then ruthlessly cutting the mantissa to cash that in.
FP32 (1-8-23) was the training default through ~2017 and is now a niche: a place to accumulate, a fallback for the few operations that are numerically fragile (some normalizations, some loss scaling). TF32 (NVIDIA's 1-8-10 internal tensor-core format) was a transitional hack — FP32's range, a truncated mantissa — that bought a free speedup with no code change. BF16 (1-8-7) was the breakthrough that defined the modern era: it keeps FP32's full 8-bit exponent (so loss scaling and gradient dynamics behave like FP32) while throwing away two-thirds of the mantissa. BF16 is still the safe, boring default for training stability and the reference everything else is measured against. FP16 (1-5-10) trades range for precision and is why the industry needed loss scaling; outside legacy inference paths it has largely lost to BF16.
The interesting rungs are below 16 bits, where a raw format no longer has enough range to survive on its own and must be paired with a scaling scheme. FP8 comes in two flavors — E4M3 (more precision) for forward-pass activations and weights, E5M2 (more range) for gradients — and by 2026 FP8 is the production default for inference and, increasingly, for the bulk of training matmuls. FP6 (typically E3M2/E2M3) is the awkward middle rung: AMD's CDNA4 and some inference stacks support it, and it occasionally wins as a quality-preserving step for weights that FP4 degrades too far, but it lacks FP4's clean 2x throughput story, so it remains a specialist tool rather than a tier. FP4 (E2M1 — two exponent bits, one mantissa bit, sixteen representable values) is the current frontier and the reason this chapter exists: it is where the throughput is, and where the accuracy cliff lives.
| Format | Bits (S-E-M) | Rel. tensor throughput | Needs scaling | Primary 2026 role |
|---|---|---|---|---|
| FP32 | 1-8-23 | 0.25x | No | Accumulation; numerically fragile ops |
| TF32 | 1-8-10 | 0.5x | No | Legacy free speedup; mostly superseded |
| BF16 | 1-8-7 | 1x (reference) | No | Training stability default; master weights |
| FP8 (E4M3/E5M2) | 1-4-3 / 1-5-2 | ~2x | Per-tensor / per-block | Production inference; most training matmuls |
| FP6 (E3M2/E2M3) | 1-3-2 / 1-2-3 | ~2x | Per-block (microscaling) | Specialist quality step (CDNA4, some stacks) |
| FP4 (E2M1) | 1-2-1 | ~4x | Microscaling (mandatory) | Inference frontier; emerging FP4 training |
The table is a sequence of one-way savings. Each downward step roughly doubles tensor throughput and halves operand bytes, but the 'needs scaling' column is the catch. At FP8 and below, the raw format's dynamic range is too narrow to represent the spread of values inside a real tensor (activations have outliers; weights have long tails). The fix is to factor each tensor into a low-precision payload plus a higher-precision scale, and the granularity of that scale — per-tensor, per-channel, or per-small-block — is the single most important quality lever below 8 bits. That granularity question is what microscaling answers, and it is where NVFP4 and MXFP4 part ways.
Microscaling: the scheme that made FP4 usable
The problem with a single scale factor for a whole tensor is that one outlier value forces the scale large, which crushes every small value into the same few FP4 codes — you spend all sixteen representable values describing the outlier and have nothing left for the signal. Microscaling (MX) solves this by giving every small block of elements its own scale. A block of values that happen to be small gets a small scale and fine resolution; a block with an outlier gets a large scale locally, without contaminating the rest of the tensor. This is the idea, standardized by the Open Compute Project's MX formats, that turned FP4 from a research curiosity into a shippable inference format.
The fork inside microscaling is the one practitioners actually have to decide, because it is wired into the hardware and the checkpoint format. MXFP4 (the OCP open standard) uses 32-element blocks with an E8M0 scale — a pure power-of-two exponent, so every scale snaps to the nearest 2ⁿ. NVFP4 (NVIDIA's Blackwell-native format) uses 16-element blocks with an E4M3 (FP8) scale — half the block size for more localized adaptation, plus a real mantissa in the scale so it can land between powers of two, backed by a second-level FP32 per-tensor scale for global range. The consequence is concrete: NVIDIA's own data shows the E4M3 scale cuts per-block mean-squared quantization error from 0.72 (E8M0) to 0.08 — an 88% error reduction from the scale format alone, before counting the smaller block size (NVIDIA, Introducing NVFP4, 2025).
| Attribute | MXFP4 (OCP) | NVFP4 (NVIDIA) | Consequence of choosing |
|---|---|---|---|
| Block size | 32 elements | 16 elements | Smaller block = finer local adaptation, more scale overhead |
| Scale format | E8M0 (power-of-two) | E4M3 (FP8, fractional) | E4M3 cuts per-block MSE ~88% vs E8M0 |
| Second-level scale | None | Per-tensor FP32 | NVFP4 extends global dynamic range |
| Effective bits/value | ~4.25 | ~4.5 | NVFP4 pays slight extra memory for accuracy |
| Ecosystem | Open, vendor-neutral (AMD, Intel, OCP) | NVIDIA Blackwell-native | Portability vs best-on-NVIDIA accuracy |
| Inference accuracy | Larger drop at 4-bit | ≤1% drop vs FP8 on DeepSeek-R1 | Quality margin before re-tuning |
FP4 training: viable, but not free
For years the consensus was that FP4 was an inference-only format — fine for a frozen model, fatal for the delicate dynamics of a gradient-descent loop where small errors compound over trillions of steps. That consensus broke in 2025. Peer research established that native FP4 training can be Pareto-optimal at a fixed compute budget — for a given number of FLOPS, an FP4 run can reach better loss than a BF16 or FP8 run, because the throughput gain buys more tokens or more parameters per dollar (Castro et al., Quartet, arXiv 2025). NVIDIA then demonstrated it at scale: a 12B hybrid Mamba-Transformer pre-trained on 10 trillion tokens in NVFP4, the longest publicly documented 4-bit training run, with a validation-loss curve tracking the FP8 baseline and final MMLU-Pro of 62.58% vs 62.62% for FP8 — within noise (NVIDIA, Pretraining LLMs with NVFP4, 2025/2026).
The word doing the work in 'Pareto-optimal' is recipe. FP4 training does not work by flipping a precision flag; it works because of a stack of techniques that keep the gradient signal alive: master weights held in higher precision (BF16/FP32) while only the matmul operands are FP4; stochastic rounding so quantization error is unbiased rather than systematically truncating; selective high-precision layers (embeddings, the final projection, and the most sensitive blocks stay at FP8/BF16); and careful management of the backward pass, where gradients have the widest dynamic range and are the first thing to break. Skip any of these and the run either diverges or silently underperforms a cheaper FP8 run — the worst outcome, because you paid the FP4 risk premium and got nothing.
Inference quantization: where the quality floor actually bites
Inference is where quantization pays off most and where the failure modes are most visible to users. The post-training quantization (PTQ) toolkit is mature: weight-only schemes (GPTQ, AWQ) that quantize the weights and keep activations at higher precision — ideal for memory-bound, small-batch decode where the bottleneck is reading weights from HBM; weight-and-activation schemes (the FP8/FP4 path) that quantize both and unlock the full tensor-core throughput, needed when you are compute-bound at large batch or in prefill; and SmoothQuant-style transforms that migrate activation outliers into the weights so both can be quantized cleanly. The decision is workload-shaped: a latency-bound chatbot serving one request at a time is memory-bound and wants weight-only INT4/FP4; a high-throughput batch pipeline is compute-bound and wants full FP8/FP4 matmuls.
The hard part is the accuracy gate, and the trap is measuring the wrong thing. A perplexity delta of a few hundredths looks reassuring and can hide a real regression on a downstream task that matters — code generation, multi-step math, instruction-following on a specific domain. The 2026 discipline is to gate quantization on a task-eval suite per model family, not a single perplexity number, because the same NVFP4 recipe that costs ≤1% on MMLU-Pro for one model can cost several points of pass@1 on a code benchmark for another. NVFP4's headline result — ≤1% degradation vs FP8 on DeepSeek-R1, with AIME actually improving in NVIDIA's measurement — is real, but it is a per-model result, not a guarantee. Reasoning models that emit long chains are especially exposed: a small per-token error compounds over a 10K-token decode, so a format that is fine for short answers can degrade a long reasoning trace.
| Workload shape | Bound by | Recommended scheme | Quality risk | Why |
|---|---|---|---|---|
| Low-batch interactive decode | HBM bandwidth (weight reads) | Weight-only INT4/FP4 (AWQ/GPTQ) | Low–moderate | Activations stay high-precision; halves weight bytes |
| High-throughput batch / prefill | Tensor-core compute | Full FP8, then NVFP4/MXFP4 | Moderate | Quantizing both operands unlocks ~2–4x FLOPS |
| Long-context reasoning decode | KV-cache capacity + bandwidth | KV-cache quant (FP8/INT8) + weight FP4 | Moderate–high | KV memory, not weights, caps concurrency |
| Quality-critical / regulated | Accuracy, not cost | FP8 weight+activation; FP4 only if gated | Low | Keep the largest margin to the quality floor |
KV-cache quantization: the memory tier that sets your concurrency
For long-context and reasoning inference, the binding resource on a decode node is no longer the weights — it is the KV-cache, the per-token key/value tensors that every attention step must re-read. Its size scales with batch size times sequence length, and for a frontier model at 100K+ context it can dwarf the model weights in a decode node's HBM budget. That makes KV-cache precision the lever that most directly governs how many concurrent users a node can hold: quantize the cache from BF16 to FP8 and you roughly double the number of sequences in flight; quantize to INT4 and you double it again. In an inference business, that is a direct multiplier on goodput per GPU and therefore on cost-per-token (Chapter 7.11).
The catch is that the KV-cache is more sensitive to quantization than the weights, because errors in the keys and values feed directly into the attention softmax, where they compound across every subsequent token. FP8 KV-cache is broadly safe in 2026 and widely deployed. INT4 KV-cache is where you start to see long-context recall degrade — the model loses the fine distinctions that let it retrieve a fact buried 80K tokens back — so it is typically applied selectively (recent tokens at higher precision, older tokens compressed) rather than uniformly. The mature pattern is a tiered KV hierarchy: hot KV in HBM at FP8, warm KV spilled to host DRAM or CXL, cold KV offloaded to NVMe-attached flash, with the precision dropping as the tier cools. This is the inference-side analog of the storage memory hierarchy, and it is where KV quantization, disaggregated prefill/decode, and the new Ethernet-attached context-memory tier converge (Chapter 9.7).
Deep dive: why the same FP4 format passes one model and fails another
The uncomfortable truth of low-precision inference is that there is no universal answer to 'is FP4 safe.' Quantization sensitivity is a property of the weight and activation distributions a specific training run produced, and those differ by architecture, by training recipe, and even by random seed. Three things move the needle. First, outlier structure: models with persistent large-magnitude activation channels (common in some attention and MLP layers) force large block scales that crush the FP4 codes for everything else — these are the models that need SmoothQuant-style transforms or per-channel handling before FP4 is viable. Second, depth and decode length: a per-token quantization error that is invisible on a 200-token answer accumulates over a 20K-token reasoning trace, so reasoning models stress quantization far harder than chat models at the same nominal accuracy. Third, task brittleness: code and math have sharp pass/fail boundaries — one wrong token breaks a program — so they expose quantization damage that a fuzzy NLU benchmark averages away.
The consequence for an operator is a process, not a number. You cannot certify 'we run FP4' as a blanket policy; you certify it per model family, against a downstream eval suite that includes the brittle tasks, with a documented quality gate and a fallback to FP8 for the families that fail it. The NVFP4 advantage over MXFP4 — finer blocks, fractional scales — buys margin precisely on the hardest of these cases, which is why a heterogeneous fleet's MXFP4 quality tax is not a flat percentage but a per-model risk you have to measure. Mixed-precision serving (FP4 for the easy 80% of traffic, FP8 for the quality-critical 20%) is the pragmatic 2026 production pattern.
Deep dive: why the compute-memory tradeoff drives precision choices
It is tempting to read precision purely as a throughput knob, but the deeper reason it dominates 2026 accelerator design is the compute-memory tradeoff. Inference is overwhelmingly memory-bandwidth-bound: in low-batch decode, the GPU spends most of its time waiting to read weights and KV from HBM, and the FLOPS sit idle. In that regime, halving the bytes per parameter via FP4/INT4 is a near-linear speedup even though the FLOPS were never the constraint — you are buying bandwidth, not compute. This is why weight-only quantization wins for interactive serving and why HBM capacity and bandwidth (Chapter 7.6) are the binding constraint they are: precision is partly a way to stretch a scarce, expensive HBM budget across more tokens and more users.
Prefill and large-batch training invert this: they are compute-bound, the tensor cores are the bottleneck, and there the ~2x/~4x FLOPS multiplier from FP8/FP4 is the prize. The same model, on the same GPU, sits on opposite sides of this tradeoff depending on whether it is prefilling a long prompt or decoding one token at a time — which is exactly why disaggregated prefill/decode (Chapter 10.11) and per-phase precision choices co-evolved. The unifying lesson: precision is not one decision but a per-phase one, and the right ladder rung for the prefill matmul is frequently not the right rung for the decode loop or the KV-cache.
Anti-patterns
The recurring mistakes all come from treating precision as a single global flag instead of a per-phase, per-workload decision gated on the right metric.
- Gating on perplexity instead of task evals. A reassuring perplexity delta hides a real regression on the brittle tasks (code, math, long-context retrieval) that users actually feel. Quantization must be gated per model family on a downstream eval suite, not a single intrinsic-metric number.
- Flipping the precision flag and calling it done. FP4 training without master weights, stochastic rounding, and selective high-precision layers either diverges or silently underperforms a cheaper FP8 run — you pay the FP4 risk premium for negative return.
- Uniform KV-cache quantization for long context. Pushing the entire KV-cache to INT4 to maximize concurrency, then losing long-context recall because the oldest tokens — the ones a retrieval depends on — lost their precision. Tier the cache; keep recent/hot KV at higher precision.
- Picking a format for the headline FLOPS number. Buying silicon on its FP4-with-sparsity peak, then discovering your quality gate forces FP8 for most traffic and the FP4 peak is unreachable. Spec to the precision you can actually ship, not the datasheet's largest number (Chapter 7.1).