Chapter 10.11
Inference Serving Engineering: SLOs, Batching, Disaggregation & Goodput-Optimal Scheduling
Inference serving is the constrained optimization of serving the most tokens that meet your SLO — not a throughput problem and not a latency problem on its own — and every lever (batching, chunking, disaggregation, speculation, routing) is a different bet on where that goodput-optimal point sits for your model, your traffic, and your fleet.
What you'll decide here
- Which two latency targets actually bind your SLO — TTFT (prefill-bound) or TPOT/ITL (decode-bound) — because that single ranking decides whether you optimize the prefill path, the decode path, or pay to separate them.
- Whether to run aggregated serving with chunked prefill, or pay the disaggregation tax (a KV-cache transfer between prefill and decode tiers) to hit strict TTFT and TPOT simultaneously — the central fork of the modern stack.
- What you are actually maximizing: raw tokens/sec is the wrong objective; SLO-attainment-constrained goodput (tokens served that met the SLO) is the number that ties serving to revenue and to $/token.
- When speculative or parallel decoding pays for itself — it trades extra FLOPs for fewer sequential decode steps, which only wins when you are memory-bandwidth-bound and have spare compute, i.e. at low-to-moderate batch.
- Which serving engine and orchestration layer (vLLM, TensorRT-LLM, SGLang; Dynamo, llm-d, KServe) you standardize on, and whether your autoscaler scales against the SLO or against the easy-but-wrong signal of GPU utilization.
Every other layer of the stack in this guide exists to put a working accelerator under a token. This chapter is about the last hop — the software that turns a powered, healthy, scheduled GPU into served tokens that a user is willing to pay for. It is where the capital you spent on power, cooling, fabric, and silicon either earns its $/token or leaks away as idle bubbles, missed deadlines, and over-provisioned headroom. And it is the layer where the binding objective is almost universally mis-stated. The naive goal is throughput — tokens per second per GPU. The naive fix when latency complaints arrive is to chase p99 latency. Both are wrong on their own, because they optimize one axis while silently destroying the other. The correct objective is goodput: the rate of tokens served that also met the latency SLO. A server doing 50,000 tok/s where a third of requests blew their deadline has less useful goodput than one doing 38,000 tok/s where all of them landed — and only the goodput number tracks revenue.
We start with the latency taxonomy that defines the SLO (TTFT, TPOT, ITL), because the ranking of those targets deterministically selects every downstream lever. We work through the throughput levers — continuous/in-flight batching and chunked prefill — then the central architectural fork of the 2025-2026 stack: prefill/decode (P/D) disaggregation, and the KV-cache transfer that is its tax. We formalize SLO-constrained goodput, size a bursty always-on fleet with queueing theory, price out speculative and parallel decoding, route requests with prefix-cache awareness, and close on the engine landscape (vLLM, TensorRT-LLM, SGLang) and the disaggregated orchestrators (Dynamo, llm-d, KServe) — and on autoscaling that scales against the SLO rather than against a lie.
The latency taxonomy that governs the SLO
Autoregressive inference is two physically different workloads wearing one API. Prefill ingests the whole prompt in a single forward pass, computing the KV cache for every input token at once — it is compute-bound, a dense matmul that saturates the tensor cores and scales with prompt length. Decode then emits one token at a time, each step reading the entire model's weights and the growing KV cache from HBM to produce a single token — it is memory-bandwidth-bound, and the GPU's FLOPs sit mostly idle waiting on memory. These two phases have opposite bottlenecks, opposite batching behavior, and opposite scaling laws, and almost every serving decision in this chapter flows from refusing to treat them as one thing.
The user-facing SLO is expressed in three numbers, and you must know which one binds before you tune anything:
- TTFT (time-to-first-token) — the wait from request arrival to the first streamed token. It is dominated by queueing delay plus the prefill forward pass, so it scales with prompt length and with how loaded the prefill path is. This is the number a user feels as "is it responding?"; for interactive chat it is typically targeted in the low hundreds of milliseconds, and for voice/agentic loops far tighter.
- TPOT (time-per-output-token) — the average steady-state decode-step time once generation is underway. It sets the streaming speed and therefore the perceived "reading pace"; a TPOT of ~40-50 ms is roughly human reading speed, and reasoning workloads that emit thousands of decode tokens make this the dominant contributor to total request latency.
- ITL (inter-token latency) — the distribution of gaps between consecutive tokens, not just the mean. ITL is where batching interference shows up: a smooth TPOT average can hide ugly ITL stalls when a heavy prefill is admitted into a batch of in-flight decodes and freezes the stream for tens of milliseconds. SLOs that only specify mean TPOT and ignore p95/p99 ITL get gamed by exactly this.
The decision that cascades from here is the ranking of these targets. A batch-summarization product cares about end-to-end completion and tolerates a slow first token — TTFT is loose. A live voice agent cares about first-token responsiveness above all — TTFT binds hard. A long-form reasoning product lives and dies on TPOT/ITL because it emits a wall of tokens. Whichever binds tells you which physical phase to protect, and that is the fork that selects chunked prefill versus disaggregation below.
The throughput levers: continuous batching and chunked prefill
Static batching — wait for N requests, run them in lockstep, return when the longest finishes — is the obvious approach and it is catastrophic for generation, because requests finish at wildly different lengths and the whole batch is held hostage by its longest sequence while finished slots sit idle. Continuous (in-flight) batching is the fix that defines the modern era of serving: the scheduler operates at the granularity of a single decode step, evicting completed sequences and admitting waiting ones every iteration, so the GPU never waits for a batch to drain. Paired with paged KV-cache management (PagedAttention and its descendants), which allocates KV memory in non-contiguous blocks to eliminate the fragmentation that static pre-allocation causes, continuous batching is what lets a single GPU keep dozens to hundreds of sequences in flight at high utilization. It is table stakes in 2026; an engine that does not do it is not a serious contender.
Continuous batching solves the decode-side bubble but reintroduces the prefill-versus-decode conflict from the callout above: when do you admit a new prompt's prefill into a batch that is busy decoding? The two classical policies are both bad. Prefill-first (admit new prompts eagerly to minimize their TTFT) lets a long prefill freeze every in-flight decode stream — good TTFT, terrible ITL. Decode-first (drain decodes before admitting prefill) protects the streams but lets new requests queue, inflating TTFT. Chunked prefill is the technique that dissolves the dilemma: it splits a long prompt's prefill into bite-sized token chunks and interleaves those chunks with ongoing decode steps inside a fixed per-iteration token budget. No single iteration is dominated by a giant prefill, so decode streams keep flowing (ITL stays smooth) while new prompts make steady prefill progress (TTFT stays bounded). The cost is a modest hit to peak prefill throughput and a tuning knob — the per-step token budget — that trades TTFT against TPOT and must be set against your SLO ranking, not a vendor default.
| Policy | What it does | Protects | Sacrifices | Pick it when |
|---|---|---|---|---|
| Static batching | Fixed batch runs to completion in lockstep | Simplicity | Throughput & latency both — long-tail sequence stalls the batch | Never, for generative serving |
| Continuous batching | Per-step admit/evict; paged KV cache | Throughput & GPU utilization | Adds scheduler complexity; raw prefill/decode conflict remains | Always — baseline for any 2026 engine |
| Prefill-first | Eagerly admit new prompts | TTFT of new requests | ITL — long prefill freezes in-flight decode streams | TTFT binds and decode streams are short |
| Decode-first | Drain decodes before admitting prefill | ITL / TPOT of in-flight streams | TTFT — new requests queue behind decodes | TPOT binds and you can tolerate first-token wait |
| Chunked prefill | Split prefill into chunks, interleave with decode under a token budget | Both TTFT and TPOT, smoothly | Peak prefill throughput; adds a budget knob to tune | You need both targets met on one tier (the aggregated default) |
The central fork: prefill/decode disaggregation and its tax
Chunked prefill makes aggregated serving — one pool of GPUs doing both phases — hold both TTFT and TPOT reasonably well across a range of traffic. But there is a regime where it runs out of room: when both targets are strict and prompts are long, the interleaving still couples the two phases enough that you cannot independently meet them. The phases also want different hardware: prefill is compute-bound and rewards raw FLOPs and large tensor parallelism, while decode is memory-bound and rewards HBM capacity and bandwidth and high batch concurrency. Forcing both onto one homogeneous tier means you provision for the worse case on both axes and waste one of them.
Prefill/decode (P/D) disaggregation is the architectural answer: physically split the fleet into a prefill tier and a decode tier, each independently sized, parallelized, and scaled. A request runs prefill on the prefill tier, then its KV cache is handed to a decode worker, which streams the output. The wins are real and measured: the prefill tier is never interrupted by decode work, so TTFT drops sharply and stays stable under load; the decode tier runs uninterrupted at the big batch it wants, so TPOT/ITL stay smooth; and you can scale the two tiers independently (the runtime-reconfigurable xPyD pattern — x prefill workers feeding y decode workers — lets you re-balance as your traffic's prompt-to-output ratio shifts). For workloads with strict-on-both-axes SLOs, disaggregation is the better choice and the published goodput gains over chunked-prefill-on-one-tier are substantial.
The catch — and the reason this is a genuine fork rather than a free upgrade — is the disaggregation tax: the KV cache computed during prefill must be transferred to the decode worker before generation can begin, and that transfer sits on the critical path of TTFT. For a long prompt the KV cache is large (it scales with prompt length x layers x heads x head-dim x 2 for K and V), so the transport matters enormously. Over fast intra-node NVLink the tax is small; over a node boundary it demands RDMA and a purpose-built transfer library (NVIDIA's NIXL is the de facto transport that llm-d and Dynamo both lean on, moving KV asynchronously across memory and storage tiers); over a slow link it can erase the entire TTFT win. This is why disaggregation only pays at sufficient scale and on a fast fabric, and why the KV-cache hierarchy is its own engineering problem. The transfer mechanics, the memory tiering, and the offload-to-flash economics are the subject of Chapter 9.7; this chapter governs when to pay the tax, not how to plumb it.
| Dimension | Aggregated + chunked prefill | Disaggregated (P/D) |
|---|---|---|
| TTFT under load | Good; bounded by chunk budget, degrades as load rises | Excellent & stable — prefill tier never blocked by decode |
| TPOT / ITL | Good; chunking smooths interference | Excellent — decode tier runs uninterrupted at large batch |
| Hardware efficiency | One homogeneous tier; over-provisions one axis | Each tier sized to its bottleneck (FLOPs vs HBM/BW) |
| Added cost | A tuning knob (token budget) | KV-cache transfer on TTFT critical path + RDMA/NIXL plumbing |
| Scaling | Scale one pool; phases coupled | xPyD — scale prefill and decode tiers independently |
| Best fit | Most workloads; small/medium fleets; mixed traffic | Strict-on-both SLOs, long prompts, large fleet, fast fabric |
SLO-attainment-constrained goodput: the real objective
With the levers on the table, we can state the objective they all serve precisely. Define goodput as the rate of tokens (or completed requests) served that satisfied the SLO — every token from a request that met its TTFT and TPOT targets counts; tokens from a request that breached either are badput, served but worthless or worse, because they consumed capacity while failing the customer. The serving problem is then a constrained optimization: maximize goodput subject to TTFT ≤ target and TPOT ≤ target at the required percentile. This reframing is not academic. It changes what you measure, what you alert on, and how you size the fleet.
The consequence that practitioners get wrong: the throughput-maximizing operating point is past the goodput-maximizing one. As you push batch size and load up, raw tokens/sec keeps climbing toward saturation — but TTFT and TPOT degrade non-linearly as queues build, and at some load level the marginal tokens you add are all breaching the SLO. Goodput peaks before throughput and then falls, because beyond the knee you are converting good tokens into bad ones (every admitted request slows the ones already running). Running a serving fleet at maximum utilization is therefore a goodput error: you are proudly reporting a throughput number while shipping deadline misses. The right operating point sits at the goodput knee, which is deliberately below saturation — headroom that looks like waste on a utilization dashboard and is actually the SLO doing its job. This is the same idea as the reliability-overhead framing in training goodput, applied to latency rather than failures; the economic version of it governs $/token in Chapter 1.8 and Chapter 7.11.
Queueing-theory fleet sizing for bursty, always-on demand
Online inference traffic is bursty and always-on: arrivals are roughly Poisson at short timescales with strong diurnal and event-driven swings, and there is no batch window to hide behind — a user is waiting. Sizing the fleet is a queueing problem, and the queueing-theory result that matters is the one that surprises people: tail latency explodes as utilization approaches 1, and it does so non-linearly. In an M/M/c-style model the expected queueing delay scales roughly as 1/(1-ρ) where ρ is utilization; at 80% utilization your queue wait is already several times the service time, and at 95% it is an order of magnitude worse. Because TTFT contains queue wait, an SLO on TTFT is really a cap on ρ — you literally cannot run a latency-bound fleet near saturation and meet a tail SLO, no matter how good your engine is.
The decisions that fall out of this: (1) Size to the goodput knee, not to average load — provision enough replicas that peak diurnal demand still sits below the utilization ceiling your TTFT SLO implies, which means carrying idle headroom by design. (2) Exploit the c in M/M/c — one big pool of c servers has far better tail behavior than c isolated single-server queues, because a momentary burst can be absorbed by any free replica; this is the queueing-theoretic argument for fleet-wide load balancing over per-replica sharding, and it is why prefix-cache-aware fleet routing (below) beats local scheduling. (3) Shed or downgrade rather than breach — when a burst exceeds capacity, admission control that rejects or routes-to-a-smaller-model the marginal request preserves goodput for everyone else, whereas blindly admitting converts the whole fleet into badput. The SLA-contract framing of these targets — what you promise, what you measure, what a breach costs — is developed in Chapter 12.4.
Speculative and parallel decoding: when the extra compute pays
Decode is sequential and memory-bound: each token requires a full pass over the weights, and the FLOPs sit idle waiting on HBM. Speculative decoding exploits exactly that idle compute. A cheap draft mechanism (a small draft model, a few extra prediction heads as in Medusa/EAGLE/MTP, or n-gram lookahead) proposes several tokens ahead; the full model then verifies all of them in a single parallel forward pass and accepts the longest correct prefix. When the draft is good, you produce multiple tokens per expensive model pass instead of one — fewer sequential decode steps, lower TPOT, same output distribution (verification is exact, so quality is unchanged). Parallel decoding schemes (lookahead, Jacobi-style) generalize the idea without a separate draft model.
The economics are a clean trade and they have a sharp boundary. Speculation spends FLOPs to buy back memory-bound stalls. It therefore pays precisely when you have spare FLOPs — i.e. at low-to-moderate batch sizes, where decode is firmly memory-bound and the tensor cores are starved. The win is real there: 2-3x effective decode speedup on amenable workloads, at high acceptance rates. But as batch size climbs, the decode step becomes compute-bound (you are now amortizing weight reads across many sequences), the spare FLOPs vanish, and the extra verification work of speculation becomes pure overhead — at high batch, speculation can reduce throughput. The decision is therefore not "turn it on"; it is "turn it on for the low-batch, latency-bound regime, and off (or adaptively) at high batch." Acceptance rate is the other knob: a draft that is too weak gets rejected often and wastes the verification pass, so the draft-model choice is itself an economic decision. For agentic and reasoning workloads that emit long, low-batch decode streams under tight TPOT, speculation is one of the highest-leverage levers available; for a saturated high-throughput batch endpoint it is often net-negative.
Deep dive: prefix-cache-aware routing — turning request structure into goodput
A large fraction of real inference traffic shares prefixes: the same long system prompt across every request to an agent, a shared few-shot preamble, a multi-turn conversation where each turn re-sends the growing history, a RAG pipeline where many queries hit the same retrieved context. The KV cache for a shared prefix is identical and can be computed once and reused — prefix caching turns a long, expensive prefill into a cache hit, collapsing TTFT and freeing prefill capacity. SGLang's RadixAttention organizes the cache as a radix tree so any shared prefix among in-flight and recent requests is automatically reused; vLLM and TensorRT-LLM expose automatic prefix caching as well.
The fleet-level decision is routing. A request only benefits from a prefix cache if it lands on the replica that already holds that prefix's KV. Naive round-robin or least-loaded routing scatters requests and squanders the cache. Prefix-cache-aware (KV-aware) routing — the headline feature of disaggregated orchestrators like Dynamo and llm-d — inspects the request's prefix, looks up which replica holds the matching KV, and routes there, maximizing cache hits and minimizing recomputed prefill. The tension is load balance versus cache locality: always routing to the cache-holder can hot-spot a replica, so the router blends a cache-affinity score with a load score. Done well, KV-aware routing is one of the largest single goodput wins available on prefix-heavy traffic, because it converts repeated prefill work — pure badput-adjacent overhead — into near-free cache hits. The memory hierarchy that makes cross-replica and offloaded KV reuse possible (HBM → host DRAM → flash, moved over NIXL) is engineered in Chapter 9.7.
The serving-engine landscape
By 2026 the serving layer has consolidated around three engines and a layer of disaggregated orchestrators above them. The engine choice is a real fork with real consequences — portability and iteration speed versus peak hardware-locked performance versus programmable structured workloads — and it is reversible (you can re-platform an endpoint) but not cheap, because operational tooling, kernels, and tuning accrete around whatever you pick.
vLLM is the open, vendor-neutral default: it popularized PagedAttention and continuous batching, runs across NVIDIA, AMD, and other backends, has the broadest model coverage and the fastest community cadence, and is the reference implementation most new techniques land in first. TensorRT-LLM is NVIDIA's performance-maximizing engine: compiled, kernel-fused, hardware-locked to NVIDIA GPUs, and typically the throughput/latency leader on that hardware — at the cost of a heavier build/compile workflow and zero portability off NVIDIA. SGLang is the structured-generation specialist: RadixAttention-based prefix caching is first-class, and its programming model shines on agentic, multi-turn, branching, and tool-calling workloads where request structure can be exploited. Above the engines, the disaggregated orchestrators — NVIDIA Dynamo (KV-aware routing, xPyD disaggregation, NIXL transport, multi-engine backends including TensorRT-LLM, vLLM, and SGLang), the community llm-d (Kubernetes-native disaggregated serving on NIXL), and KServe (the broader CNCF model-serving CRD and autoscaling layer) — provide the fleet-level disaggregation, prefix-aware routing, and SLO-based autoscaling that no single-node engine can.
| Layer | Optimizes for | Portability | Best fit | Watch-out |
|---|---|---|---|---|
| vLLM | Breadth, iteration speed, openness | High — multi-vendor backends | Default for most fleets; heterogeneous hardware; newest models | Peak perf trails a compiled engine on a fixed NVIDIA target |
| TensorRT-LLM | Peak throughput/latency on NVIDIA | None — NVIDIA-locked | Max performance on a stable NVIDIA model set | Compile workflow; zero portability; slower to new models |
| SGLang | Structured/agentic generation, prefix reuse | Moderate | Multi-turn, branching, tool-calling, RAG-heavy traffic | Narrower than vLLM for plain single-shot serving |
| Dynamo | Disaggregated fleet: KV-aware routing, xPyD | Multi-engine backends | Large NVIDIA fleets needing P/D disaggregation | Operationally heavy; pays off only at scale |
| llm-d | K8s-native disaggregated serving on NIXL | Open, multi-vendor | Kubernetes shops wanting open disaggregation | Younger ecosystem; community-paced maturity |
| KServe | Model-serving CRD + SLO autoscaling | Open, engine-agnostic | Standardized serving/autoscaling control plane | A control plane, not an engine — pair with one above |
Deep dive: autoscaling against the SLO, end to end
SLO-based autoscaling ties the whole chapter together, because it is where the latency taxonomy, the goodput objective, and the queueing math become an operational control loop. The loop has three correct ingredients and one tempting wrong one. The wrong signal is raw GPU utilization, for the reason in the warning callout — it conflates the memory-bound decode phase and the compute-bound prefill phase, both of which load different units, so it is uncorrelated with whether you are meeting the SLO. The right signals are: queue depth / pending-request count (a leading indicator — it rises before latency does, giving the autoscaler lead time to spin up replicas, which matters because cold-starting a model onto a GPU takes tens of seconds to minutes); measured TTFT and TPOT at the target percentile (the lagging ground truth — scale when the actual SLO percentile drifts toward its bound); and KV-cache occupancy (when the paged cache is near full, you cannot admit more sequences regardless of compute headroom, so it is a true admission ceiling).
The decisions inside the loop: set the scale-out trigger at the goodput knee (below saturation, with headroom for the queueing tail), not at the throughput ceiling; scale prefill and decode tiers independently in a disaggregated deployment, because a shift in prompt-to-output ratio loads them differently (a sudden influx of long prompts needs more prefill workers, not more decode); and combine autoscaling with admission control so that during the seconds-to-minutes it takes a new replica to become live, the marginal request is shed or downgraded rather than admitted into badput. KServe, llm-d, and Dynamo each expose these signals as first-class autoscaling inputs precisely so the fleet scales against what it promised. The broader scheduling and orchestration plane this rides on is Chapter 10.1; the multi-tenant isolation that keeps one noisy tenant from poisoning another's tail latency is Chapter 10.3.
The economic tie-back: serving efficiency governs $/token
Every lever in this chapter resolves to one number that the business cares about: $/token at the SLO. Serving efficiency is the governor on inference unit economics because the denominator of $/token is goodput — tokens served that met the SLO — and the numerator is the amortized cost of the GPU, power, and fabric underneath. A 30% goodput improvement from disaggregation, prefix-aware routing, and speculative decoding at the right batch is a ~30% cut in $/token at constant SLO, which at fleet scale is the difference between a profitable inference product and a subsidized one. With inference now roughly two-thirds of AI compute and the dominant share of power draw at large operators, the serving layer is where the power-bound era is won or lost: more goodput per megawatt is more revenue per the scarce input. The market has already priced this in — self-hosted inference fell from ~$10 to ~$2.50 per million tokens in a year, a collapse driven substantially by exactly these serving advances, not by cheaper hardware.
The strategic posture that follows: treat serving efficiency as a first-class capital decision, not a software-team detail. The goodput knee determines how much hardware you must own to serve a given SLO-bound demand; the engine and disaggregation choices determine your peak goodput per GPU; the autoscaler determines how close to the knee you can safely run. Each is a lever on the $/token that ties this chapter to the economics in Chapter 1.8 and Chapter 7.11, and the SLA framing in Chapter 12.4.