Guide › Software, Orchestration & Service Delivery › 10.11

Chapter 10.11

Inference Serving Engineering: SLOs, Batching, Disaggregation & Goodput-Optimal Scheduling

Inference serving is the constrained optimization of serving the most tokens that meet your SLO — not a throughput problem and not a latency problem on its own — and every lever (batching, chunking, disaggregation, speculation, routing) is a different bet on where that goodput-optimal point sits for your model, your traffic, and your fleet.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Which two latency targets actually bind your SLO — TTFT (prefill-bound) or TPOT/ITL (decode-bound) — because that single ranking decides whether you optimize the prefill path, the decode path, or pay to separate them.
Whether to run aggregated serving with chunked prefill, or pay the disaggregation tax (a KV-cache transfer between prefill and decode tiers) to hit strict TTFT and TPOT simultaneously — the central fork of the modern stack.
What you are actually maximizing: raw tokens/sec is the wrong objective; SLO-attainment-constrained goodput (tokens served that met the SLO) is the number that ties serving to revenue and to $/token.
When speculative or parallel decoding pays for itself — it trades extra FLOPs for fewer sequential decode steps, which only wins when you are memory-bandwidth-bound and have spare compute, i.e. at low-to-moderate batch.
Which serving engine and orchestration layer (vLLM, TensorRT-LLM, SGLang; Dynamo, llm-d, KServe) you standardize on, and whether your autoscaler scales against the SLO or against the easy-but-wrong signal of GPU utilization.

Prefill and decode stress different hardware, so modern inference splits them onto separate GPU pools and pays a KV-cache transfer tax to do it.

Every other layer of the stack in this guide exists to put a working accelerator under a token. This chapter is about the last hop — the software that turns a powered, healthy, scheduled GPU into served tokens that a user is willing to pay for. It is where the capital you spent on power, cooling, fabric, and silicon either earns its $/token or leaks away as idle bubbles, missed deadlines, and over-provisioned headroom. And it is the layer where the binding objective is almost universally mis-stated. The naive goal is throughput — tokens per second per GPU. The naive fix when latency complaints arrive is to chase p99 latency. Both are wrong on their own, because they optimize one axis while silently destroying the other. The correct objective is goodput: the rate of tokens served that also met the latency SLO. A server doing 50,000 tok/s where a third of requests blew their deadline has less useful goodput than one doing 38,000 tok/s where all of them landed — and only the goodput number tracks revenue.

We start with the latency taxonomy that defines the SLO (TTFT, TPOT, ITL), because the ranking of those targets deterministically selects every downstream lever. We work through the throughput levers — continuous/in-flight batching and chunked prefill — then the central architectural fork of the 2025-2026 stack: prefill/decode (P/D) disaggregation, and the KV-cache transfer that is its tax. We formalize SLO-constrained goodput, size a bursty always-on fleet with queueing theory, price out speculative and parallel decoding, route requests with prefix-cache awareness, and close on the engine landscape (vLLM, TensorRT-LLM, SGLang) and the disaggregated orchestrators (Dynamo, llm-d, KServe) — and on autoscaling that scales against the SLO rather than against a lie.

The latency taxonomy that governs the SLO

Autoregressive inference is two physically different workloads wearing one API. Prefill ingests the whole prompt in a single forward pass, computing the KV cache for every input token at once — it is compute-bound, a dense matmul that saturates the tensor cores and scales with prompt length. Decode then emits one token at a time, each step reading the entire model's weights and the growing KV cache from HBM to produce a single token — it is memory-bandwidth-bound, and the GPU's FLOPs sit mostly idle waiting on memory. These two phases have opposite bottlenecks, opposite batching behavior, and opposite scaling laws, and almost every serving decision in this chapter flows from refusing to treat them as one thing.

The user-facing SLO is expressed in three numbers, and you must know which one binds before you tune anything:

TTFT (time-to-first-token) — the wait from request arrival to the first streamed token. It is dominated by queueing delay plus the prefill forward pass, so it scales with prompt length and with how loaded the prefill path is. This is the number a user feels as "is it responding?"; for interactive chat it is typically targeted in the low hundreds of milliseconds, and for voice/agentic loops far tighter.
TPOT (time-per-output-token) — the average steady-state decode-step time once generation is underway. It sets the streaming speed and therefore the perceived "reading pace"; a TPOT of ~40-50 ms is roughly human reading speed, and reasoning workloads that emit thousands of decode tokens make this the dominant contributor to total request latency.
ITL (inter-token latency) — the distribution of gaps between consecutive tokens, not just the mean. ITL is where batching interference shows up: a smooth TPOT average can hide ugly ITL stalls when a heavy prefill is admitted into a batch of in-flight decodes and freezes the stream for tens of milliseconds. SLOs that only specify mean TPOT and ignore p95/p99 ITL get gamed by exactly this.

The decision that cascades from here is the ranking of these targets. A batch-summarization product cares about end-to-end completion and tolerates a slow first token — TTFT is loose. A live voice agent cares about first-token responsiveness above all — TTFT binds hard. A long-form reasoning product lives and dies on TPOT/ITL because it emits a wall of tokens. Whichever binds tells you which physical phase to protect, and that is the fork that selects chunked prefill versus disaggregation below.

The throughput levers: continuous batching and chunked prefill

Static batching — wait for N requests, run them in lockstep, return when the longest finishes — is the obvious approach and it is catastrophic for generation, because requests finish at wildly different lengths and the whole batch is held hostage by its longest sequence while finished slots sit idle. Continuous (in-flight) batching is the fix that defines the modern era of serving: the scheduler operates at the granularity of a single decode step, evicting completed sequences and admitting waiting ones every iteration, so the GPU never waits for a batch to drain. Paired with paged KV-cache management (PagedAttention and its descendants), which allocates KV memory in non-contiguous blocks to eliminate the fragmentation that static pre-allocation causes, continuous batching is what lets a single GPU keep dozens to hundreds of sequences in flight at high utilization. It is table stakes in 2026; an engine that does not do it is not a serious contender.

Continuous batching solves the decode-side bubble but reintroduces the prefill-versus-decode conflict from the callout above: when do you admit a new prompt's prefill into a batch that is busy decoding? The two classical policies are both bad. Prefill-first (admit new prompts eagerly to minimize their TTFT) lets a long prefill freeze every in-flight decode stream — good TTFT, terrible ITL. Decode-first (drain decodes before admitting prefill) protects the streams but lets new requests queue, inflating TTFT. Chunked prefill is the technique that dissolves the dilemma: it splits a long prompt's prefill into bite-sized token chunks and interleaves those chunks with ongoing decode steps inside a fixed per-iteration token budget. No single iteration is dominated by a giant prefill, so decode streams keep flowing (ITL stays smooth) while new prompts make steady prefill progress (TTFT stays bounded). The cost is a modest hit to peak prefill throughput and a tuning knob — the per-step token budget — that trades TTFT against TPOT and must be set against your SLO ranking, not a vendor default.

Batching & scheduling policy → which SLO it protects, and what it costs

Policy	What it does	Protects	Sacrifices	Pick it when
Static batching	Fixed batch runs to completion in lockstep	Simplicity	Throughput & latency both — long-tail sequence stalls the batch	Never, for generative serving
Continuous batching	Per-step admit/evict; paged KV cache	Throughput & GPU utilization	Adds scheduler complexity; raw prefill/decode conflict remains	Always — baseline for any 2026 engine
Prefill-first	Eagerly admit new prompts	TTFT of new requests	ITL — long prefill freezes in-flight decode streams	TTFT binds and decode streams are short
Decode-first	Drain decodes before admitting prefill	ITL / TPOT of in-flight streams	TTFT — new requests queue behind decodes	TPOT binds and you can tolerate first-token wait
Chunked prefill	Split prefill into chunks, interleave with decode under a token budget	Both TTFT and TPOT, smoothly	Peak prefill throughput; adds a budget knob to tune	You need both targets met on one tier (the aggregated default)

The policy you pick is downstream of which latency target binds. There is no free option; each protects one axis by spending another. vLLM/SGLang/TensorRT-LLM all implement continuous batching and chunked prefill; the tuning is yours.

The central fork: prefill/decode disaggregation and its tax

Chunked prefill makes aggregated serving — one pool of GPUs doing both phases — hold both TTFT and TPOT reasonably well across a range of traffic. But there is a regime where it runs out of room: when both targets are strict and prompts are long, the interleaving still couples the two phases enough that you cannot independently meet them. The phases also want different hardware: prefill is compute-bound and rewards raw FLOPs and large tensor parallelism, while decode is memory-bound and rewards HBM capacity and bandwidth and high batch concurrency. Forcing both onto one homogeneous tier means you provision for the worse case on both axes and waste one of them.

Prefill/decode (P/D) disaggregation is the architectural answer: physically split the fleet into a prefill tier and a decode tier, each independently sized, parallelized, and scaled. A request runs prefill on the prefill tier, then its KV cache is handed to a decode worker, which streams the output. The wins are real and measured: the prefill tier is never interrupted by decode work, so TTFT drops sharply and stays stable under load; the decode tier runs uninterrupted at the big batch it wants, so TPOT/ITL stay smooth; and you can scale the two tiers independently (the runtime-reconfigurable xPyD pattern — x prefill workers feeding y decode workers — lets you re-balance as your traffic's prompt-to-output ratio shifts). For workloads with strict-on-both-axes SLOs, disaggregation is the better choice and the published goodput gains over chunked-prefill-on-one-tier are substantial.

The catch — and the reason this is a genuine fork rather than a free upgrade — is the disaggregation tax: the KV cache computed during prefill must be transferred to the decode worker before generation can begin, and that transfer sits on the critical path of TTFT. For a long prompt the KV cache is large (it scales with prompt length x layers x heads x head-dim x 2 for K and V), so the transport matters enormously. Over fast intra-node NVLink the tax is small; over a node boundary it demands RDMA and a purpose-built transfer library (NVIDIA's NIXL is the de facto transport that llm-d and Dynamo both lean on, moving KV asynchronously across memory and storage tiers); over a slow link it can erase the entire TTFT win. This is why disaggregation only pays at sufficient scale and on a fast fabric, and why the KV-cache hierarchy is its own engineering problem. The transfer mechanics, the memory tiering, and the offload-to-flash economics are the subject of Chapter 9.7; this chapter governs when to pay the tax, not how to plumb it.

Aggregated (chunked prefill) vs disaggregated (P/D) serving

Dimension	Aggregated + chunked prefill	Disaggregated (P/D)
TTFT under load	Good; bounded by chunk budget, degrades as load rises	Excellent & stable — prefill tier never blocked by decode
TPOT / ITL	Good; chunking smooths interference	Excellent — decode tier runs uninterrupted at large batch
Hardware efficiency	One homogeneous tier; over-provisions one axis	Each tier sized to its bottleneck (FLOPs vs HBM/BW)
Added cost	A tuning knob (token budget)	KV-cache transfer on TTFT critical path + RDMA/NIXL plumbing
Scaling	Scale one pool; phases coupled	xPyD — scale prefill and decode tiers independently
Best fit	Most workloads; small/medium fleets; mixed traffic	Strict-on-both SLOs, long prompts, large fleet, fast fabric

The decision turns on SLO strictness, prompt length, and fabric. Disaggregation is not strictly better — it adds a KV-transfer hop and operational complexity that only pay above a scale and bandwidth threshold.

SLO-attainment-constrained goodput: the real objective

With the levers on the table, we can state the objective they all serve precisely. Define goodput as the rate of tokens (or completed requests) served that satisfied the SLO — every token from a request that met its TTFT and TPOT targets counts; tokens from a request that breached either are badput, served but worthless or worse, because they consumed capacity while failing the customer. The serving problem is then a constrained optimization: maximize goodput subject to TTFT ≤ target and TPOT ≤ target at the required percentile. This reframing is not academic. It changes what you measure, what you alert on, and how you size the fleet.

The consequence that practitioners get wrong: the throughput-maximizing operating point is past the goodput-maximizing one. As you push batch size and load up, raw tokens/sec keeps climbing toward saturation — but TTFT and TPOT degrade non-linearly as queues build, and at some load level the marginal tokens you add are all breaching the SLO. Goodput peaks before throughput and then falls, because beyond the knee you are converting good tokens into bad ones (every admitted request slows the ones already running). Running a serving fleet at maximum utilization is therefore a goodput error: you are proudly reporting a throughput number while shipping deadline misses. The right operating point sits at the goodput knee, which is deliberately below saturation — headroom that looks like waste on a utilization dashboard and is actually the SLO doing its job. This is the same idea as the reliability-overhead framing in training goodput, applied to latency rather than failures; the economic version of it governs $/token in Chapter 1.8 and Chapter 7.11.

~40-50 ms

TPOT roughly matching human reading speed (~20-25 tok/s); common interactive decode target

2025Practitioner consensus; NVIDIA / vLLM serving guides

2/3 (~66%)

inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80-90% of draw at large operators

2026Deloitte TMT Predictions 2026; McKinsey

~$2.50/M tok

market-average self-hosted inference cost, fell ~$10→~$2.50 in a year; worked example ~$1.90/M (8xH100, Llama-70B FP16)

2025Introl / NVIDIA synthesis (via provenance)

up to ~77%

goodput gain from hybrid aggregation/disaggregation over SOTA when both TTFT and TPOT bind

2025TaiChi (arXiv 2508.01989); see also FlowKV/HexGen-2

2:1-3:1

inference back-end fabric oversubscription (training is 1:1 non-blocking); 2:1 cuts back-end cost ~31% (contested — single-source)

2025SemiAnalysis AI Neocloud Playbook

~96% / ~90%

best-in-class vs industry-average goodput (training framing); reliability overhead 6-21% of TCO

2025SemiAnalysis ClusterMAX 2.0 / CoreWeave

xPyD

runtime-reconfigurable disaggregation: x prefill workers feeding y decode workers, re-balanced live

2026NVIDIA Dynamo / TensorRT-LLM disaggregated serving docs

Queueing-theory fleet sizing for bursty, always-on demand

Online inference traffic is bursty and always-on: arrivals are roughly Poisson at short timescales with strong diurnal and event-driven swings, and there is no batch window to hide behind — a user is waiting. Sizing the fleet is a queueing problem, and the queueing-theory result that matters is the one that surprises people: tail latency explodes as utilization approaches 1, and it does so non-linearly. In an M/M/c-style model the expected queueing delay scales roughly as 1/(1-ρ) where ρ is utilization; at 80% utilization your queue wait is already several times the service time, and at 95% it is an order of magnitude worse. Because TTFT contains queue wait, an SLO on TTFT is really a cap on ρ — you literally cannot run a latency-bound fleet near saturation and meet a tail SLO, no matter how good your engine is.

The decisions that fall out of this: (1) Size to the goodput knee, not to average load — provision enough replicas that peak diurnal demand still sits below the utilization ceiling your TTFT SLO implies, which means carrying idle headroom by design. (2) Exploit the c in M/M/c — one big pool of c servers has far better tail behavior than c isolated single-server queues, because a momentary burst can be absorbed by any free replica; this is the queueing-theoretic argument for fleet-wide load balancing over per-replica sharding, and it is why prefix-cache-aware fleet routing (below) beats local scheduling. (3) Shed or downgrade rather than breach — when a burst exceeds capacity, admission control that rejects or routes-to-a-smaller-model the marginal request preserves goodput for everyone else, whereas blindly admitting converts the whole fleet into badput. The SLA-contract framing of these targets — what you promise, what you measure, what a breach costs — is developed in Chapter 12.4.

Autoscaling on GPU utilization is the classic goodput bug

The most common production mistake in inference serving: autoscaling against GPU utilization. It fails in both directions. Decode is memory-bandwidth-bound, so a GPU can be at 95% memory-bus saturation and report modest SM/compute utilization — the autoscaler sees "idle," refuses to scale, and TPOT quietly degrades. Conversely a compute-heavy prefill burst can pin utilization at 100% while the SLO is being met fine, triggering a needless scale-out. Utilization is a proxy, and a broken one, because the two phases load different units. Scale against the thing you actually promised: queue depth, TTFT, and TPOT at the target percentile. Modern SLO-based autoscalers (and the KServe/llm-d/Dynamo control planes) expose these signals precisely so you can stop scaling on a lie. → orchestration plane in Chapter 10.1.

Speculative and parallel decoding: when the extra compute pays

Decode is sequential and memory-bound: each token requires a full pass over the weights, and the FLOPs sit idle waiting on HBM. Speculative decoding exploits exactly that idle compute. A cheap draft mechanism (a small draft model, a few extra prediction heads as in Medusa/EAGLE/MTP, or n-gram lookahead) proposes several tokens ahead; the full model then verifies all of them in a single parallel forward pass and accepts the longest correct prefix. When the draft is good, you produce multiple tokens per expensive model pass instead of one — fewer sequential decode steps, lower TPOT, same output distribution (verification is exact, so quality is unchanged). Parallel decoding schemes (lookahead, Jacobi-style) generalize the idea without a separate draft model.

The economics are a clean trade and they have a sharp boundary. Speculation spends FLOPs to buy back memory-bound stalls. It therefore pays precisely when you have spare FLOPs — i.e. at low-to-moderate batch sizes, where decode is firmly memory-bound and the tensor cores are starved. The win is real there: 2-3x effective decode speedup on amenable workloads, at high acceptance rates. But as batch size climbs, the decode step becomes compute-bound (you are now amortizing weight reads across many sequences), the spare FLOPs vanish, and the extra verification work of speculation becomes pure overhead — at high batch, speculation can reduce throughput. The decision is therefore not "turn it on"; it is "turn it on for the low-batch, latency-bound regime, and off (or adaptively) at high batch." Acceptance rate is the other knob: a draft that is too weak gets rejected often and wastes the verification pass, so the draft-model choice is itself an economic decision. For agentic and reasoning workloads that emit long, low-batch decode streams under tight TPOT, speculation is one of the highest-leverage levers available; for a saturated high-throughput batch endpoint it is often net-negative.

Deep dive: prefix-cache-aware routing — turning request structure into goodput

A large fraction of real inference traffic shares prefixes: the same long system prompt across every request to an agent, a shared few-shot preamble, a multi-turn conversation where each turn re-sends the growing history, a RAG pipeline where many queries hit the same retrieved context. The KV cache for a shared prefix is identical and can be computed once and reused — prefix caching turns a long, expensive prefill into a cache hit, collapsing TTFT and freeing prefill capacity. SGLang's RadixAttention organizes the cache as a radix tree so any shared prefix among in-flight and recent requests is automatically reused; vLLM and TensorRT-LLM expose automatic prefix caching as well.

The fleet-level decision is routing. A request only benefits from a prefix cache if it lands on the replica that already holds that prefix's KV. Naive round-robin or least-loaded routing scatters requests and squanders the cache. Prefix-cache-aware (KV-aware) routing — the headline feature of disaggregated orchestrators like Dynamo and llm-d — inspects the request's prefix, looks up which replica holds the matching KV, and routes there, maximizing cache hits and minimizing recomputed prefill. The tension is load balance versus cache locality: always routing to the cache-holder can hot-spot a replica, so the router blends a cache-affinity score with a load score. Done well, KV-aware routing is one of the largest single goodput wins available on prefix-heavy traffic, because it converts repeated prefill work — pure badput-adjacent overhead — into near-free cache hits. The memory hierarchy that makes cross-replica and offloaded KV reuse possible (HBM → host DRAM → flash, moved over NIXL) is engineered in Chapter 9.7.

The serving-engine landscape

By 2026 the serving layer has consolidated around three engines and a layer of disaggregated orchestrators above them. The engine choice is a real fork with real consequences — portability and iteration speed versus peak hardware-locked performance versus programmable structured workloads — and it is reversible (you can re-platform an endpoint) but not cheap, because operational tooling, kernels, and tuning accrete around whatever you pick.

vLLM is the open, vendor-neutral default: it popularized PagedAttention and continuous batching, runs across NVIDIA, AMD, and other backends, has the broadest model coverage and the fastest community cadence, and is the reference implementation most new techniques land in first. TensorRT-LLM is NVIDIA's performance-maximizing engine: compiled, kernel-fused, hardware-locked to NVIDIA GPUs, and typically the throughput/latency leader on that hardware — at the cost of a heavier build/compile workflow and zero portability off NVIDIA. SGLang is the structured-generation specialist: RadixAttention-based prefix caching is first-class, and its programming model shines on agentic, multi-turn, branching, and tool-calling workloads where request structure can be exploited. Above the engines, the disaggregated orchestrators — NVIDIA Dynamo (KV-aware routing, xPyD disaggregation, NIXL transport, multi-engine backends including TensorRT-LLM, vLLM, and SGLang), the community llm-d (Kubernetes-native disaggregated serving on NIXL), and KServe (the broader CNCF model-serving CRD and autoscaling layer) — provide the fleet-level disaggregation, prefix-aware routing, and SLO-based autoscaling that no single-node engine can.

Serving engine & orchestrator selection

Layer	Optimizes for	Portability	Best fit	Watch-out
vLLM	Breadth, iteration speed, openness	High — multi-vendor backends	Default for most fleets; heterogeneous hardware; newest models	Peak perf trails a compiled engine on a fixed NVIDIA target
TensorRT-LLM	Peak throughput/latency on NVIDIA	None — NVIDIA-locked	Max performance on a stable NVIDIA model set	Compile workflow; zero portability; slower to new models
SGLang	Structured/agentic generation, prefix reuse	Moderate	Multi-turn, branching, tool-calling, RAG-heavy traffic	Narrower than vLLM for plain single-shot serving
Dynamo	Disaggregated fleet: KV-aware routing, xPyD	Multi-engine backends	Large NVIDIA fleets needing P/D disaggregation	Operationally heavy; pays off only at scale
llm-d	K8s-native disaggregated serving on NIXL	Open, multi-vendor	Kubernetes shops wanting open disaggregation	Younger ecosystem; community-paced maturity
KServe	Model-serving CRD + SLO autoscaling	Open, engine-agnostic	Standardized serving/autoscaling control plane	A control plane, not an engine — pair with one above

Vendor-neutral snapshot, current to 2026. The engines are not mutually exclusive — Dynamo and KServe orchestrate multiple engines as backends — so the practical choice is often 'which engine per workload, under which orchestrator.'

Deep dive: autoscaling against the SLO, end to end

SLO-based autoscaling ties the whole chapter together, because it is where the latency taxonomy, the goodput objective, and the queueing math become an operational control loop. The loop has three correct ingredients and one tempting wrong one. The wrong signal is raw GPU utilization, for the reason in the warning callout — it conflates the memory-bound decode phase and the compute-bound prefill phase, both of which load different units, so it is uncorrelated with whether you are meeting the SLO. The right signals are: queue depth / pending-request count (a leading indicator — it rises before latency does, giving the autoscaler lead time to spin up replicas, which matters because cold-starting a model onto a GPU takes tens of seconds to minutes); measured TTFT and TPOT at the target percentile (the lagging ground truth — scale when the actual SLO percentile drifts toward its bound); and KV-cache occupancy (when the paged cache is near full, you cannot admit more sequences regardless of compute headroom, so it is a true admission ceiling).

The decisions inside the loop: set the scale-out trigger at the goodput knee (below saturation, with headroom for the queueing tail), not at the throughput ceiling; scale prefill and decode tiers independently in a disaggregated deployment, because a shift in prompt-to-output ratio loads them differently (a sudden influx of long prompts needs more prefill workers, not more decode); and combine autoscaling with admission control so that during the seconds-to-minutes it takes a new replica to become live, the marginal request is shed or downgraded rather than admitted into badput. KServe, llm-d, and Dynamo each expose these signals as first-class autoscaling inputs precisely so the fleet scales against what it promised. The broader scheduling and orchestration plane this rides on is Chapter 10.1; the multi-tenant isolation that keeps one noisy tenant from poisoning another's tail latency is Chapter 10.3.

The economic tie-back: serving efficiency governs $/token

Every lever in this chapter resolves to one number that the business cares about: $/token at the SLO. Serving efficiency is the governor on inference unit economics because the denominator of $/token is goodput — tokens served that met the SLO — and the numerator is the amortized cost of the GPU, power, and fabric underneath. A 30% goodput improvement from disaggregation, prefix-aware routing, and speculative decoding at the right batch is a ~30% cut in $/token at constant SLO, which at fleet scale is the difference between a profitable inference product and a subsidized one. With inference now roughly two-thirds of AI compute and the dominant share of power draw at large operators, the serving layer is where the power-bound era is won or lost: more goodput per megawatt is more revenue per the scarce input. The market has already priced this in — self-hosted inference fell from ~$10 to ~$2.50 per million tokens in a year, a collapse driven substantially by exactly these serving advances, not by cheaper hardware.

The strategic posture that follows: treat serving efficiency as a first-class capital decision, not a software-team detail. The goodput knee determines how much hardware you must own to serve a given SLO-bound demand; the engine and disaggregation choices determine your peak goodput per GPU; the autoscaler determines how close to the knee you can safely run. Each is a lever on the $/token that ties this chapter to the economics in Chapter 1.8 and Chapter 7.11, and the SLA framing in Chapter 12.4.

The KV-cache hierarchy that disaggregation transfers across — and the offload-to-flash tiering that makes cross-replica prefix reuse economical — is engineered in Chapter 9.7. The back-end fabric oversubscription that inference tolerates (and the KV-transfer bandwidth disaggregation demands) is Chapter 8.5. The orchestration and scheduling plane this serving layer rides on is Chapter 10.1; topology-aware placement is Chapter 10.2; the multi-tenant isolation that protects tail latency across customers is Chapter 10.3. The goodput-versus-availability rethink for training is Chapter 12.2, and the SLA contract framing of the latency targets here is Chapter 12.4. The $/token economics that serving efficiency governs live in Chapter 1.8 and Chapter 7.11; the online-inference archetype that set these requirements upstream is Chapter 1.3.