Chapter 11.6

Multi-Tenant & Workload Isolation Security

Every shared accelerator is a decision about how much of someone else's blast radius you are willing to inherit — partitioning a GPU is a utilization win and a confidentiality liability at the same time, and the only honest question is which boundary you trust to hold.

GOODPUTDENSITY-RAMP

What you'll decide here

Where on the multi-tenancy spectrum each pod sits — bare-metal-per-tenant, VM-per-tenant, MIG-partitioned, vGPU-virtualized, or kernel-level time-sliced — and therefore which isolation failures you have signed up to own.
Whether MIG, vGPU, or time-slicing is being treated as a performance-sharing convenience or as a security boundary — because only one of the three is a hardware-enforced boundary, and conflating them is how breaches happen.
Your memory-hygiene policy on every reassignment: who scrubs GPU framebuffer, HBM, and local/shared memory between tenants, and whether that scrub is verified or assumed.
Whether confidential multi-tenancy (per-tenant TEE + attestation) is required by the data class, or whether strong VM isolation plus disciplined hygiene is sufficient — the former costs goodput and operational complexity, the latter costs you nothing until the day it does.
The control-plane isolation posture: the container runtime, the GPU operator, and the orchestrator are the real multi-tenant attack surface, and a three-line container escape there defeats any in-GPU partitioning you bought.

Multi-tenancy is the business model of the entire neocloud and most enterprise platform teams: buy one expensive accelerator, rent fractions of it to many workloads, and amortize a $30k+ GPU across tenants who would each waste 70% of it alone. The economics are irresistible and the security consequences are routinely underpriced. The instant two tenants share a physical GPU, a memory controller, an L2 cache, an NVLink domain, or merely a host kernel and a container runtime, you have created a channel — and the only question left is whether it is a strong boundary, a soft convenience dressed up as a boundary, or an outright leak. This chapter is about reading that distinction correctly before a tenant's KV-cache shows up in a neighbor's framebuffer.

This chapter lays out the multi-tenancy spectrum from bare-metal to time-slicing and the documented isolation failures at each rung; we resolve the question every platform team mishandles — is MIG / vGPU / time-slicing a security boundary? — with the hardware reality rather than the marketing; we make memory hygiene on reassignment an explicit, owned policy instead of an assumption; and we close on the isolation-versus-utilization economics that govern the whole thing, with confidential multi-tenancy treated as the strongest and most expensive rung. The canonical TEE and attestation machinery lives in Chapter 11.5; here we decide when you need it.

The multi-tenancy spectrum

Tenant isolation is not binary — it is a spectrum, and each step down trades a stronger boundary for higher utilization. Read it top to bottom as decreasing isolation strength and increasing density: the moment you stop giving each tenant a whole physical GPU, you start sharing silicon that was never designed as a confidentiality boundary.

Bare-metal-per-tenant is the strongest posture: a tenant gets whole nodes, whole GPUs, and ideally a whole rack or NVLink domain, with no hypervisor and no co-tenant on the same silicon. The blast radius is the tenant's own. This is what serious training customers and sovereign deployments demand, and it is what the strongest neocloud SLAs implicitly promise. The cost is utilization: a tenant who under-uses the hardware strands it, and you cannot backfill without reintroducing a co-tenant.

VM-per-tenant (whole-GPU passthrough) keeps the GPU undivided but interposes a hypervisor, giving each tenant a strong CPU-side boundary (SR-IOV / PCIe passthrough) while the GPU itself is still owned end-to-end by one VM at a time. The boundary you now trust is the hypervisor and the IOMMU, not the GPU. This is the workhorse of enterprise multi-tenant clouds and a genuinely defensible boundary — the residual risk is hypervisor escapes and the host-side GPU management stack.

MIG (Multi-Instance GPU) partitions a single physical GPU into up to seven hardware-isolated instances, each with dedicated streaming multiprocessors, dedicated L2 cache slices, dedicated memory controllers, and a dedicated slice of HBM, plus dedicated copy and decode engines. This is the only fractional-sharing mode with a hardware-enforced partition, which is why it is the one rung of fractional sharing that can credibly be called a security boundary. The cost is rigidity (fixed geometries, no oversubscription) and the residual uncore side-channels discussed below.

vGPU (mediated virtualization) time-multiplexes a GPU across guest VMs via a host-resident vGPU Manager that schedules and mediates access. The VM boundary is real, but the GPU is now shared through a software mediator with privileged host code — and that mediator has been the source of the most consequential 2025 multi-tenant CVEs. vGPU can be layered on MIG (MIG-backed vGPU) to combine hardware partitioning with the VM boundary, which is the strongest practical fractional mode.

Time-slicing (kernel-level sharing / MPS) is the bottom rung: multiple tenant processes share one GPU context-switched in time, with no memory isolation and no fault isolation. A fault, a hang, or a noisy kernel from one tenant degrades or crashes the others, and there is no architectural barrier between their allocations. Time-slicing is a utilization tool for trusted, co-operative workloads — it is not, and must never be sold as, a tenant boundary.

The boundary question: which of these is actually a security boundary?

The single most consequential mistake platform teams make is treating a performance-sharing feature as a confidentiality boundary. Resolve it explicitly. Bare-metal and VM-per-tenant are security boundaries (the boundary is the absence of a co-tenant, or the hypervisor + IOMMU). MIG is a hardware-enforced isolation boundary — the strongest fractional option — but it is not a side-channel-proof confidentiality boundary; co-located instances still share uncore resources that have demonstrated covert and side channels. vGPU's boundary is the VM, mediated by privileged host software that has a real CVE history. Time-slicing / MPS is NOT a security boundary at all — no memory isolation, no fault isolation. If your data class requires confidentiality against a malicious co-tenant, your floor is MIG-backed vGPU with hygiene, and your gold standard is confidential computing (Chapter 11.5). If it requires confidentiality against the operator, only the latter qualifies.

The multi-tenancy spectrum: isolation strength vs utilization

Isolation mode	Boundary trusted	Memory isolation	Fault isolation	Security boundary?	Utilization	Documented failure class
Bare-metal-per-tenant	Physical separation	Total (no co-tenant)	Total	Yes — strongest	Lowest (strands idle)	Decommission / reassignment remanence
VM-per-tenant (passthrough)	Hypervisor + IOMMU	Total (whole GPU per VM)	Total	Yes	Low-moderate	Hypervisor / SR-IOV escapes
MIG (hardware partition)	GPU hardware partition	Hardware-enforced slice	Per-instance	Yes — strongest fractional	High (≤7 instances)	Uncore side-channels; fixed geometry
vGPU (mediated)	VM + host vGPU Manager	Per-VM via mediator	Per-VM	Conditional — see CVEs	High (oversubscribable)	vGPU Manager CVEs; cross-VM leakage
Time-slicing / MPS	None (shared context)	None	None	No	Highest (oversubscribed)	Cross-process memory leak; noisy-neighbor DoS

Boundary classification reflects 2026 hardware reality, not vendor positioning. "Boundary trusted" names the component whose compromise breaks isolation. Utilization is qualitative — see economics section.

Documented isolation failures: the empirical record

This is not a theoretical threat model. The 2024–2025 disclosure record contains real, assigned CVEs at every layer of the stack — and reading them is the fastest way to calibrate which boundaries hold. The pattern is instructive: the catastrophic breaks are almost never in the GPU partitioning itself, but in the control plane around it.

The control-plane escape (the real catastrophe). NVIDIAScape — CVE-2025-23266, CVSS 9.0, disclosed by Wiz in July 2025 — is a three-line container escape in the NVIDIA Container Toolkit: a malicious container sets LD_PRELOAD against an OCI createContainer hook and gains root on the host, from which it can read, steal, or tamper with the models and data of every other tenant on the shared machine. It affected Container Toolkit up to 1.17.7 and GPU Operator up to 25.3.0. Multi-tenant security is a full-stack property, and the runtime is the weakest, most-exposed link.

The mediator leak (vGPU). NVIDIA's July 2025 vGPU bulletin disclosed cross-VM issues in the vGPU Manager: CVE-2025-23290 (cross-VM information disclosure — a guest reading global GPU metrics influenced by neighboring VMs, the first publicly acknowledged leakage of co-tenant activity through the mediator) and CVE-2025-23285 (a guest consuming global resources to deny service to neighbors). The same bulletin patched stack buffer overflows in the vGPU Manager (CVE-2025-23283/23284, CVSS 7.8) enabling guest-to-host code execution. The mediated-virtualization boundary is real but it is privileged software, and privileged software has bugs.

The memory-remanence leak (any shared GPU). LeftoverLocals — CVE-2023-4969, Trail of Bits, 2024 — recovered another process's data from un-scrubbed GPU local memory across process and container boundaries on Apple, Qualcomm, AMD, and Imagination GPUs, enough to reconstruct an LLM's responses (≈181 MB recoverable per query against a 7B model on llama.cpp). NVIDIA and Arm were not impacted, but the structural lesson stands: GPU memory is not scrubbed for you by default, and the next un-zeroed region is a side-channel waiting to be read.

The uncore side-channels (even MIG). Academic work ("Spy in the GPU-box" and the uncore side-channel literature, arXiv 2203.15981) has demonstrated covert and side channels that bypass MPS and even MIG partitioning by observing contention on shared uncore resources — the very thing MIG was supposed to isolate. MIG's partition is genuinely hardware-enforced for memory and compute, but it is not a proof of side-channel resistance. For data where timing leakage is in-scope, MIG is necessary but not sufficient; confidential computing is the answer.

9.0

CVSS of NVIDIAScape (CVE-2025-23266) — three-line container escape to host root in NVIDIA Container Toolkit

Jul 2025Wiz Research; NVIDIA Security Bulletin

≤1.17.7

NVIDIA Container Toolkit versions vulnerable to NVIDIAScape (GPU Operator ≤25.3.0)

Jul 2025Wiz; NVIDIA

CVE-2025-23290

first publicly acknowledged cross-VM co-tenant information disclosure via the vGPU Manager

Jul 2025NVIDIA Security Bulletins

max MIG instances per GPU — the only hardware-enforced fractional partition (dedicated SMs, L2 slice, memory controllers, HBM slice)

2025NVIDIA Multi-Instance GPU

≈181 MB

LLM-response data recoverable per query via LeftoverLocals (CVE-2023-4969) from un-scrubbed GPU local memory

2024Trail of Bits

memory and fault isolation guarantees provided by time-slicing / MPS between tenants

2025Introl; NVIDIA MPS docs

10-dimension

ClusterMAX 2.0 operator-maturity rubric grades tenant/fabric isolation, health-checks, and goodput as first-class

2025SemiAnalysis ClusterMAX 2.0

Memory hygiene on reassignment: the owned policy

The most overlooked multi-tenant control is also the most routine: scrubbing memory when a resource changes hands. Every time a GPU, a MIG instance, a vGPU slice, or even a container's pinned allocation is freed by one tenant and handed to the next, the prior tenant's HBM contents, framebuffer, L2 lines, and local/shared memory may persist as remanence. LeftoverLocals is exactly this failure at the local-memory layer. The decision you own is not whether to scrub — it is who scrubs, at which layer, and whether the scrub is verified or merely assumed.

The fork: driver/firmware-level scrub on free (the operator trusts the GPU stack to zero memory on instance teardown or VM release) versus orchestrator-enforced scrub (the platform explicitly zeros and verifies before re-allocating). The former is cheaper and faster but inherits whatever the vendor's default does — and defaults have historically been incomplete (LeftoverLocals shipped for years). The latter costs reassignment latency — a full HBM zero on a 192 GB B200 is non-trivial wall-clock time that shows up as scheduler churn and lower effective utilization — but it is the only posture you can attest to a tenant. For confidential workloads the question is moot: TEE teardown cryptographically erases keys so ciphertext remanence is meaningless without the key, which is one of the strongest arguments for confidential multi-tenancy (Chapter 11.5).

Deep dive: the three layers of GPU memory remanence and who scrubs each

"Scrub the GPU memory" is too coarse to be actionable, because GPU memory is at least three distinct regions with three distinct owners, and a policy that covers one and not the others leaks through the gap.

Global / HBM (framebuffer and device allocations). The largest region and the one tenants think of as "GPU memory." On MIG teardown or VM release, the driver/firmware is responsible for zeroing the slice before reallocation. This is the layer where you most want orchestrator-level verification, because the consequence of a missed scrub is the previous tenant's weights, activations, or KV-cache becoming readable. A full-capacity zero is bandwidth-bound — seconds of HBM write traffic on the largest parts — which is why operators are tempted to skip or defer it; resist the temptation or make the deferral explicit and bounded.

Local / shared memory (per-SM scratchpad). The small, fast, software-managed region that LeftoverLocals exploited. It is not automatically cleared between kernel launches on affected stacks, so a reader kernel could dump whatever a prior victim kernel left behind. Mitigation is a vendor driver fix plus, defensively, kernels that clear their own local memory on exit — a measurable but small overhead that hardened inference runtimes now adopt.

Caches and registers (L2, register file). The hardest to reason about and the realm of side-channels rather than direct remanence: MIG dedicates L2 slices per instance, but contention on shared uncore paths is observable, which is the uncore side-channel result. You do not "scrub" this layer; you either accept the side-channel risk (acceptable for most commercial multi-tenancy) or you move to confidential computing where the threat model explicitly includes a malicious co-tenant. Naming which of those two you have chosen is the deliverable.

Confidential multi-tenancy: the strongest, most expensive rung

When the data class genuinely cannot tolerate exposure to a co-tenant or to the operator — regulated health and financial data, sovereign workloads, frontier weights rented on infrastructure you do not own — the only sufficient answer is confidential computing: a per-tenant Trusted Execution Environment spanning the CPU (SEV-SNP / TDX) and the GPU (NVIDIA Hopper/Blackwell CC with encrypted HBM, the BAR0 decoupler, and encrypted transfers), gated by attestation so a tenant releases keys only to a measured, verified configuration. The full machinery — Confidential Protected Register state, attestation via NRAS/RIM, TEE-I/O across NVLink, the residual attack surface — is the canonical subject of Chapter 11.5. The decision here is whether you need it, because it is not free.

The fork is sharp. Strong VM isolation + disciplined hygiene defends against an honest-but-curious neighbor and an accidental leak; it does not defend against a malicious co-tenant exploiting a side-channel, nor against a compromised or coerced operator, nor against an insider with host access. Confidential multi-tenancy moves the operator out of the trust boundary and makes the attestation the gate — at the cost of a measurable performance tax (encrypted PCIe/NVLink transfers and TEE overhead, larger for small-transfer chatty workloads than for large compute-bound ones), reduced GPU-sharing flexibility (CC modes constrain partitioning), and real operational complexity in key brokering and attestation-policy management. You pay goodput and ops to remove the operator from the trust model. If your tenants do not require that removal, you are buying nines of confidentiality the data class does not value — the security analog of over-provisioned redundancy.

Choosing a multi-tenant posture by adversary and data class

Adversary you must defend against	Minimum sufficient posture	Boundary that must hold	Goodput / cost penalty	Residual risk you accept
Accidental cross-tenant leak	VM-per-tenant + verified scrub	Hypervisor + scrub policy	Low	Hypervisor escape; side-channels
Noisy neighbor / DoS	MIG (hardware partition)	GPU hardware partition	Low (fixed geometry overhead)	Uncore side-channels
Malicious co-tenant (data theft)	MIG-backed vGPU + hygiene	HW partition + VM mediator	Low-moderate	vGPU Manager CVEs; timing leakage
Malicious co-tenant (timing/side-channel)	Confidential computing (GPU TEE)	TEE + attestation	Moderate (encrypted transfers)	TEE implementation flaws
Compromised / coerced operator	Confidential computing + tenant-held keys	Attestation-gated key release	Moderate-high (ops complexity)	Supply-chain / firmware trust root

Read top to bottom as strengthening adversary model. Each row is the minimum sufficient posture for that threat — over-buying wastes goodput, under-buying is a breach waiting to happen.

The isolation-versus-utilization economics

Every rung you climb for stronger isolation costs utilization, and utilization is the entire economic premise of multi-tenancy. This is the central tension and it is quantifiable. A bare-metal tenant who drives a B200 at 40% leaves ~60% of a $30k+ asset idle and unbackfillable. MIG recovers much of that by packing up to seven isolated tenants onto one die. Time-slicing recovers the most by oversubscribing — at the price of being no boundary at all. Confidential computing climbs back down the utilization ladder: it constrains partitioning, taxes transfers, and complicates scheduling, so the same hardware serves fewer effective tenant-hours.

The decision is therefore not "how secure can we be" but "what is the cheapest posture that is sufficient for this data class's adversary model" — which is why the table above is organized by adversary, not by feature. Over-isolating is a real cost: a commercial inference fleet serving public, non-sensitive content that runs everything under per-tenant confidential computing is burning goodput and capex to defend against a threat its data class does not face. Under-isolating is a breach: a regulated-data tenant time-sliced onto shared silicon with a stranger is one un-scrubbed allocation away from a disclosure. The ClusterMAX 2.0 rubric (SemiAnalysis) bakes this judgment into how the market grades neoclouds — tenant and fabric isolation are scored as first-class alongside goodput and health-checks, because buyers have learned to price the boundary, not just the FLOPs.

The pragmatic 2026 default for a serious multi-tenant operator is a tiered offering: bare-metal or VM-per-tenant for customers who pay for it and demand it; MIG-backed vGPU with verified hygiene as the standard fractional product; time-slicing reserved strictly for a single tenant's own co-operative workloads (never across trust boundaries); and confidential computing as a premium tier for regulated and sovereign demand. Selling time-slicing as a cross-tenant boundary, or selling "isolation" without naming which rung, is the misrepresentation that turns a CVE into a liability.

Deep dive: the control plane is the real multi-tenant attack surface

The in-GPU isolation mechanisms — MIG, vGPU, CC — get the attention, but the empirical breach record points relentlessly at the orchestration layer that sits above them. In a Kubernetes-based GPU platform the multi-tenant boundary is enforced by a stack of software components, each of which is a tenant-reachable attack surface: the container runtime and NVIDIA Container Toolkit (NVIDIAScape lived here), the GPU Operator and device plugin that advertise and allocate GPUs, the scheduler / fractional-GPU layer (e.g. Run:ai-style quota and policy) that decides which tenant lands on which slice, and the namespace / RBAC / network-policy fabric that is supposed to keep tenant A from reaching tenant B's pods and services.

The hard isolation design (Introl's taxonomy of hard / soft / hybrid) treats these as the primary controls: per-tenant namespaces with enforced RBAC and resource quotas, network policies that default-deny east-west traffic between tenants, admission controllers that block privileged containers and dangerous host mounts, and a patched, current Container Toolkit as table stakes. A neocloud that nails MIG partitioning but runs an unpatched Container Toolkit, permits privileged pods, or shares a flat tenant network has a strong GPU boundary wrapped in a soft control plane — and attackers, like water, find the soft part. The corollary for buyers: when you diligence a multi-tenant provider, audit the control plane and the patch cadence before you ask about MIG. The fabric-side enforcement (DPU-VPC, per-tenant PKeys vs shared VLANs) is detailed in Chapter 11.7.

Confidential computing — the GPU TEE, encrypted HBM, attestation via NRAS/RIM, TEE-I/O over NVLink, and the residual attack surface — is the canonical subject of Chapter 11.5; this chapter decides when its cost is warranted. The container-runtime and orchestration hardening that NVIDIAScape exposes connects to firmware and BMC integrity in Chapter 11.4 and to the network/microsegmentation boundary that carries east-west tenant traffic in Chapter 11.7. Memory remanence on reassignment is the in-GPU analog of media sanitization and data remanence at decommission in Chapter 11.3. The model and weight protection that a tenant boundary ultimately exists to guarantee is treated in Chapter 11.8. The multi-tenant scheduling, quota, and fair-share mechanics that decide who shares what are engineered in Chapter 10.3; the isolation-vs-utilization economics tie back to the procurement and goodput framing of Chapter 1.6.