Chapter 11.6
Multi-Tenant & Workload Isolation Security
Every shared accelerator is a decision about how much of someone else's blast radius you are willing to inherit — partitioning a GPU is a utilization win and a confidentiality liability at the same time, and the only honest question is which boundary you trust to hold.
What you'll decide here
- Where on the multi-tenancy spectrum each pod sits — bare-metal-per-tenant, VM-per-tenant, MIG-partitioned, vGPU-virtualized, or kernel-level time-sliced — and therefore which isolation failures you have signed up to own.
- Whether MIG, vGPU, or time-slicing is being treated as a performance-sharing convenience or as a security boundary — because only one of the three is a hardware-enforced boundary, and conflating them is how breaches happen.
- Your memory-hygiene policy on every reassignment: who scrubs GPU framebuffer, HBM, and local/shared memory between tenants, and whether that scrub is verified or assumed.
- Whether confidential multi-tenancy (per-tenant TEE + attestation) is required by the data class, or whether strong VM isolation plus disciplined hygiene is sufficient — the former costs goodput and operational complexity, the latter costs you nothing until the day it does.
- The control-plane isolation posture: the container runtime, the GPU operator, and the orchestrator are the real multi-tenant attack surface, and a three-line container escape there defeats any in-GPU partitioning you bought.
Multi-tenancy is the business model of the entire neocloud and most enterprise platform teams: buy one expensive accelerator, rent fractions of it to many workloads, and amortize a $30k+ GPU across tenants who would each waste 70% of it alone. The economics are irresistible and the security consequences are routinely underpriced. The instant two tenants share a physical GPU, a memory controller, an L2 cache, an NVLink domain, or merely a host kernel and a container runtime, you have created a channel — and the only question left is whether it is a strong boundary, a soft convenience dressed up as a boundary, or an outright leak. This chapter is about reading that distinction correctly before a tenant's KV-cache shows up in a neighbor's framebuffer.
This chapter lays out the multi-tenancy spectrum from bare-metal to time-slicing and the documented isolation failures at each rung; we resolve the question every platform team mishandles — is MIG / vGPU / time-slicing a security boundary? — with the hardware reality rather than the marketing; we make memory hygiene on reassignment an explicit, owned policy instead of an assumption; and we close on the isolation-versus-utilization economics that govern the whole thing, with confidential multi-tenancy treated as the strongest and most expensive rung. The canonical TEE and attestation machinery lives in Chapter 11.5; here we decide when you need it.
The multi-tenancy spectrum
Tenant isolation is not binary — it is a spectrum, and each step down trades a stronger boundary for higher utilization. Read it top to bottom as decreasing isolation strength and increasing density: the moment you stop giving each tenant a whole physical GPU, you start sharing silicon that was never designed as a confidentiality boundary.
Bare-metal-per-tenant is the strongest posture: a tenant gets whole nodes, whole GPUs, and ideally a whole rack or NVLink domain, with no hypervisor and no co-tenant on the same silicon. The blast radius is the tenant's own. This is what serious training customers and sovereign deployments demand, and it is what the strongest neocloud SLAs implicitly promise. The cost is utilization: a tenant who under-uses the hardware strands it, and you cannot backfill without reintroducing a co-tenant.
VM-per-tenant (whole-GPU passthrough) keeps the GPU undivided but interposes a hypervisor, giving each tenant a strong CPU-side boundary (SR-IOV / PCIe passthrough) while the GPU itself is still owned end-to-end by one VM at a time. The boundary you now trust is the hypervisor and the IOMMU, not the GPU. This is the workhorse of enterprise multi-tenant clouds and a genuinely defensible boundary — the residual risk is hypervisor escapes and the host-side GPU management stack.
MIG (Multi-Instance GPU) partitions a single physical GPU into up to seven hardware-isolated instances, each with dedicated streaming multiprocessors, dedicated L2 cache slices, dedicated memory controllers, and a dedicated slice of HBM, plus dedicated copy and decode engines. This is the only fractional-sharing mode with a hardware-enforced partition, which is why it is the one rung of fractional sharing that can credibly be called a security boundary. The cost is rigidity (fixed geometries, no oversubscription) and the residual uncore side-channels discussed below.
vGPU (mediated virtualization) time-multiplexes a GPU across guest VMs via a host-resident vGPU Manager that schedules and mediates access. The VM boundary is real, but the GPU is now shared through a software mediator with privileged host code — and that mediator has been the source of the most consequential 2025 multi-tenant CVEs. vGPU can be layered on MIG (MIG-backed vGPU) to combine hardware partitioning with the VM boundary, which is the strongest practical fractional mode.
Time-slicing (kernel-level sharing / MPS) is the bottom rung: multiple tenant processes share one GPU context-switched in time, with no memory isolation and no fault isolation. A fault, a hang, or a noisy kernel from one tenant degrades or crashes the others, and there is no architectural barrier between their allocations. Time-slicing is a utilization tool for trusted, co-operative workloads — it is not, and must never be sold as, a tenant boundary.
| Isolation mode | Boundary trusted | Memory isolation | Fault isolation | Security boundary? | Utilization | Documented failure class |
|---|---|---|---|---|---|---|
| Bare-metal-per-tenant | Physical separation | Total (no co-tenant) | Total | Yes — strongest | Lowest (strands idle) | Decommission / reassignment remanence |
| VM-per-tenant (passthrough) | Hypervisor + IOMMU | Total (whole GPU per VM) | Total | Yes | Low-moderate | Hypervisor / SR-IOV escapes |
| MIG (hardware partition) | GPU hardware partition | Hardware-enforced slice | Per-instance | Yes — strongest fractional | High (≤7 instances) | Uncore side-channels; fixed geometry |
| vGPU (mediated) | VM + host vGPU Manager | Per-VM via mediator | Per-VM | Conditional — see CVEs | High (oversubscribable) | vGPU Manager CVEs; cross-VM leakage |
| Time-slicing / MPS | None (shared context) | None | None | No | Highest (oversubscribed) | Cross-process memory leak; noisy-neighbor DoS |
Documented isolation failures: the empirical record
This is not a theoretical threat model. The 2024–2025 disclosure record contains real, assigned CVEs at every layer of the stack — and reading them is the fastest way to calibrate which boundaries hold. The pattern is instructive: the catastrophic breaks are almost never in the GPU partitioning itself, but in the control plane around it.
The control-plane escape (the real catastrophe). NVIDIAScape — CVE-2025-23266, CVSS 9.0, disclosed by Wiz in July 2025 — is a three-line container escape in the NVIDIA Container Toolkit: a malicious container sets LD_PRELOAD against an OCI createContainer hook and gains root on the host, from which it can read, steal, or tamper with the models and data of every other tenant on the shared machine. It affected Container Toolkit up to 1.17.7 and GPU Operator up to 25.3.0. Multi-tenant security is a full-stack property, and the runtime is the weakest, most-exposed link.
The mediator leak (vGPU). NVIDIA's July 2025 vGPU bulletin disclosed cross-VM issues in the vGPU Manager: CVE-2025-23290 (cross-VM information disclosure — a guest reading global GPU metrics influenced by neighboring VMs, the first publicly acknowledged leakage of co-tenant activity through the mediator) and CVE-2025-23285 (a guest consuming global resources to deny service to neighbors). The same bulletin patched stack buffer overflows in the vGPU Manager (CVE-2025-23283/23284, CVSS 7.8) enabling guest-to-host code execution. The mediated-virtualization boundary is real but it is privileged software, and privileged software has bugs.
The memory-remanence leak (any shared GPU). LeftoverLocals — CVE-2023-4969, Trail of Bits, 2024 — recovered another process's data from un-scrubbed GPU local memory across process and container boundaries on Apple, Qualcomm, AMD, and Imagination GPUs, enough to reconstruct an LLM's responses (≈181 MB recoverable per query against a 7B model on llama.cpp). NVIDIA and Arm were not impacted, but the structural lesson stands: GPU memory is not scrubbed for you by default, and the next un-zeroed region is a side-channel waiting to be read.
The uncore side-channels (even MIG). Academic work ("Spy in the GPU-box" and the uncore side-channel literature, arXiv 2203.15981) has demonstrated covert and side channels that bypass MPS and even MIG partitioning by observing contention on shared uncore resources — the very thing MIG was supposed to isolate. MIG's partition is genuinely hardware-enforced for memory and compute, but it is not a proof of side-channel resistance. For data where timing leakage is in-scope, MIG is necessary but not sufficient; confidential computing is the answer.
Memory hygiene on reassignment: the owned policy
The most overlooked multi-tenant control is also the most routine: scrubbing memory when a resource changes hands. Every time a GPU, a MIG instance, a vGPU slice, or even a container's pinned allocation is freed by one tenant and handed to the next, the prior tenant's HBM contents, framebuffer, L2 lines, and local/shared memory may persist as remanence. LeftoverLocals is exactly this failure at the local-memory layer. The decision you own is not whether to scrub — it is who scrubs, at which layer, and whether the scrub is verified or merely assumed.
The fork: driver/firmware-level scrub on free (the operator trusts the GPU stack to zero memory on instance teardown or VM release) versus orchestrator-enforced scrub (the platform explicitly zeros and verifies before re-allocating). The former is cheaper and faster but inherits whatever the vendor's default does — and defaults have historically been incomplete (LeftoverLocals shipped for years). The latter costs reassignment latency — a full HBM zero on a 192 GB B200 is non-trivial wall-clock time that shows up as scheduler churn and lower effective utilization — but it is the only posture you can attest to a tenant. For confidential workloads the question is moot: TEE teardown cryptographically erases keys so ciphertext remanence is meaningless without the key, which is one of the strongest arguments for confidential multi-tenancy (Chapter 11.5).
Deep dive: the three layers of GPU memory remanence and who scrubs each
"Scrub the GPU memory" is too coarse to be actionable, because GPU memory is at least three distinct regions with three distinct owners, and a policy that covers one and not the others leaks through the gap.
Global / HBM (framebuffer and device allocations). The largest region and the one tenants think of as "GPU memory." On MIG teardown or VM release, the driver/firmware is responsible for zeroing the slice before reallocation. This is the layer where you most want orchestrator-level verification, because the consequence of a missed scrub is the previous tenant's weights, activations, or KV-cache becoming readable. A full-capacity zero is bandwidth-bound — seconds of HBM write traffic on the largest parts — which is why operators are tempted to skip or defer it; resist the temptation or make the deferral explicit and bounded.
Local / shared memory (per-SM scratchpad). The small, fast, software-managed region that LeftoverLocals exploited. It is not automatically cleared between kernel launches on affected stacks, so a reader kernel could dump whatever a prior victim kernel left behind. Mitigation is a vendor driver fix plus, defensively, kernels that clear their own local memory on exit — a measurable but small overhead that hardened inference runtimes now adopt.
Caches and registers (L2, register file). The hardest to reason about and the realm of side-channels rather than direct remanence: MIG dedicates L2 slices per instance, but contention on shared uncore paths is observable, which is the uncore side-channel result. You do not "scrub" this layer; you either accept the side-channel risk (acceptable for most commercial multi-tenancy) or you move to confidential computing where the threat model explicitly includes a malicious co-tenant. Naming which of those two you have chosen is the deliverable.
Confidential multi-tenancy: the strongest, most expensive rung
When the data class genuinely cannot tolerate exposure to a co-tenant or to the operator — regulated health and financial data, sovereign workloads, frontier weights rented on infrastructure you do not own — the only sufficient answer is confidential computing: a per-tenant Trusted Execution Environment spanning the CPU (SEV-SNP / TDX) and the GPU (NVIDIA Hopper/Blackwell CC with encrypted HBM, the BAR0 decoupler, and encrypted transfers), gated by attestation so a tenant releases keys only to a measured, verified configuration. The full machinery — Confidential Protected Register state, attestation via NRAS/RIM, TEE-I/O across NVLink, the residual attack surface — is the canonical subject of Chapter 11.5. The decision here is whether you need it, because it is not free.
The fork is sharp. Strong VM isolation + disciplined hygiene defends against an honest-but-curious neighbor and an accidental leak; it does not defend against a malicious co-tenant exploiting a side-channel, nor against a compromised or coerced operator, nor against an insider with host access. Confidential multi-tenancy moves the operator out of the trust boundary and makes the attestation the gate — at the cost of a measurable performance tax (encrypted PCIe/NVLink transfers and TEE overhead, larger for small-transfer chatty workloads than for large compute-bound ones), reduced GPU-sharing flexibility (CC modes constrain partitioning), and real operational complexity in key brokering and attestation-policy management. You pay goodput and ops to remove the operator from the trust model. If your tenants do not require that removal, you are buying nines of confidentiality the data class does not value — the security analog of over-provisioned redundancy.
| Adversary you must defend against | Minimum sufficient posture | Boundary that must hold | Goodput / cost penalty | Residual risk you accept |
|---|---|---|---|---|
| Accidental cross-tenant leak | VM-per-tenant + verified scrub | Hypervisor + scrub policy | Low | Hypervisor escape; side-channels |
| Noisy neighbor / DoS | MIG (hardware partition) | GPU hardware partition | Low (fixed geometry overhead) | Uncore side-channels |
| Malicious co-tenant (data theft) | MIG-backed vGPU + hygiene | HW partition + VM mediator | Low-moderate | vGPU Manager CVEs; timing leakage |
| Malicious co-tenant (timing/side-channel) | Confidential computing (GPU TEE) | TEE + attestation | Moderate (encrypted transfers) | TEE implementation flaws |
| Compromised / coerced operator | Confidential computing + tenant-held keys | Attestation-gated key release | Moderate-high (ops complexity) | Supply-chain / firmware trust root |
The isolation-versus-utilization economics
Every rung you climb for stronger isolation costs utilization, and utilization is the entire economic premise of multi-tenancy. This is the central tension and it is quantifiable. A bare-metal tenant who drives a B200 at 40% leaves ~60% of a $30k+ asset idle and unbackfillable. MIG recovers much of that by packing up to seven isolated tenants onto one die. Time-slicing recovers the most by oversubscribing — at the price of being no boundary at all. Confidential computing climbs back down the utilization ladder: it constrains partitioning, taxes transfers, and complicates scheduling, so the same hardware serves fewer effective tenant-hours.
The decision is therefore not "how secure can we be" but "what is the cheapest posture that is sufficient for this data class's adversary model" — which is why the table above is organized by adversary, not by feature. Over-isolating is a real cost: a commercial inference fleet serving public, non-sensitive content that runs everything under per-tenant confidential computing is burning goodput and capex to defend against a threat its data class does not face. Under-isolating is a breach: a regulated-data tenant time-sliced onto shared silicon with a stranger is one un-scrubbed allocation away from a disclosure. The ClusterMAX 2.0 rubric (SemiAnalysis) bakes this judgment into how the market grades neoclouds — tenant and fabric isolation are scored as first-class alongside goodput and health-checks, because buyers have learned to price the boundary, not just the FLOPs.
The pragmatic 2026 default for a serious multi-tenant operator is a tiered offering: bare-metal or VM-per-tenant for customers who pay for it and demand it; MIG-backed vGPU with verified hygiene as the standard fractional product; time-slicing reserved strictly for a single tenant's own co-operative workloads (never across trust boundaries); and confidential computing as a premium tier for regulated and sovereign demand. Selling time-slicing as a cross-tenant boundary, or selling "isolation" without naming which rung, is the misrepresentation that turns a CVE into a liability.
Deep dive: the control plane is the real multi-tenant attack surface
The in-GPU isolation mechanisms — MIG, vGPU, CC — get the attention, but the empirical breach record points relentlessly at the orchestration layer that sits above them. In a Kubernetes-based GPU platform the multi-tenant boundary is enforced by a stack of software components, each of which is a tenant-reachable attack surface: the container runtime and NVIDIA Container Toolkit (NVIDIAScape lived here), the GPU Operator and device plugin that advertise and allocate GPUs, the scheduler / fractional-GPU layer (e.g. Run:ai-style quota and policy) that decides which tenant lands on which slice, and the namespace / RBAC / network-policy fabric that is supposed to keep tenant A from reaching tenant B's pods and services.
The hard isolation design (Introl's taxonomy of hard / soft / hybrid) treats these as the primary controls: per-tenant namespaces with enforced RBAC and resource quotas, network policies that default-deny east-west traffic between tenants, admission controllers that block privileged containers and dangerous host mounts, and a patched, current Container Toolkit as table stakes. A neocloud that nails MIG partitioning but runs an unpatched Container Toolkit, permits privileged pods, or shares a flat tenant network has a strong GPU boundary wrapped in a soft control plane — and attackers, like water, find the soft part. The corollary for buyers: when you diligence a multi-tenant provider, audit the control plane and the patch cadence before you ask about MIG. The fabric-side enforcement (DPU-VPC, per-tenant PKeys vs shared VLANs) is detailed in Chapter 11.7.