Chapter 11.5
GPU Confidential Computing & Trusted Execution
Confidential computing moves the trust boundary from the operator to the silicon — a cryptographic guarantee that the cloud cannot read your weights or your prompts — but you pay for it in attestation plumbing, a narrower-than-advertised threat model, and a performance tax that is near-zero on Blackwell and severe on anything that crosses the PCIe boundary.
What you'll decide here
- Whether your workload actually needs a hardware TEE — a real third-party-trust problem (sovereign tenant, regulated data, untrusted operator) — or whether you are paying the attestation and ops tax to satisfy a checkbox that encryption-at-rest already covers.
- Which CPU TEE you anchor on (AMD SEV-SNP vs Intel TDX) and therefore which GPU confidential-VM path, attestation tooling, and cloud availability you inherit — this pairing is set by your host platform, not chosen freely per workload.
- Whether you need single-GPU confidential computing (Hopper-class, PCIe-bound, heavy small-transfer tax) or multi-GPU TEE-I/O across NVLink (Blackwell-class, near-line-rate) — the generation gap here is the difference between a 2-3x slowdown and under ~3%.
- Who operates the attestation verifier and key broker — a vendor cloud service (NRAS) you must reach at boot, or a self-hosted relying party — and what happens to your fleet when that dependency is unreachable.
- What residual attack surface you are explicitly accepting: plaintext queue metadata and physical-address tables, uncore side-channels, a trusted-but-unverified hypervisor scheduler, and the firmware/RIM supply chain underneath the whole guarantee.
Every other security chapter in this Part defends an asset you control inside a perimeter you own. Confidential computing answers a harder question: how do you run on infrastructure you do not trust — and prove it? The tenant renting a GPU in a multi-tenant cloud, the sovereign government placing a frontier model in a hyperscaler region, the healthcare or finance customer whose data is the product — each of them faces the same fork. Either they trust the operator's word that no privileged insider, no compromised hypervisor, and no co-resident tenant can read their weights and their prompts, or they demand a cryptographic guarantee rooted in the silicon, enforced by hardware the operator itself cannot override, and attested end-to-end before a single byte of plaintext is released.
That guarantee is a Trusted Execution Environment (TEE). The choice is unforgiving, because confidential computing is the rare security control that is simultaneously a strong cryptographic primitive and a leaky abstraction. Turn it on and you genuinely remove the operator from your trust boundary for data-in-use — the hardest of the three states to protect. Turn it on without understanding what it does not cover, and you have bought a false sense of security: plaintext metadata still leaks, side-channels still exist, the hypervisor scheduler is still trusted, and the entire chain rests on a firmware root of trust that someone has to attest. This is the canonical home for TEE and attestation in the guide; Chapter 10.3 (the data-in-use protection model) and Chapter 11.6 (multi-tenant isolation) both point here.
The three states of data, and the one that is hard
Data exists in three states, and the security maturity of each is wildly different. At rest — weights on NVMe, checkpoints in object storage — is solved: AES-256 self-encrypting drives and KMS-wrapped keys are table stakes. In transit — across the fabric, between nodes — is solved: TLS, IPsec, and increasingly link-layer encryption (MACsec, NVLink encryption) close it. In use — the moment the weights are decrypted into GPU HBM and the prompts flow through the tensor cores — was, until recently, simply unprotected. A privileged host with root, a malicious hypervisor, a DMA-capable peripheral, or a cold-boot attacker could read plaintext model and data straight out of memory. Confidential computing exists to close that last gap, and only that gap. → the full at-rest / in-transit / in-use treatment for weights specifically lives in Chapter 11.8.
The mechanism is a hardware-enforced encrypted boundary. A CPU TEE (a confidential VM) encrypts and integrity-protects guest memory with a key held in the memory controller that no hypervisor or co-tenant can extract. A GPU TEE does the analogous thing for HBM and the on-die register/command interface. The two are chained: the confidential VM on the CPU and the confidential GPU context establish an encrypted, mutually-attested channel so that the bounce buffers used to move data across the (untrusted) PCIe bus are themselves encrypted. The fork that matters for an AI data center is that the GPU is where the model actually runs — so a CPU TEE alone protects nothing that matters. You need both, chained, attested, and that chaining is exactly where the cost and the caveats live.
CPU TEE foundations: SEV-SNP and TDX
The GPU TEE does not stand alone — it is anchored to a CPU confidential VM, and which one you get is set by your host platform, not chosen freely per workload. The two production options diverge in design philosophy, and that divergence cascades into attestation tooling and cloud availability.
AMD SEV-SNP (Secure Encrypted Virtualization — Secure Nested Paging) encrypts each VM's memory with a per-VM key managed by the on-die AMD Secure Processor, and adds Secure Nested Paging to defeat the hypervisor's ability to remap, replay, or alias guest pages — closing the integrity gap that earlier SEV generations left open. Its model is per-VM memory encryption with a relatively coarse, VM-granular boundary. Intel TDX (Trust Domain Extensions) builds a "trust domain" enforced by a dedicated, Intel-signed firmware module (the TDX Module) running in a new SEAM mode, mediating every transition between the untrusted hypervisor and the protected guest. TDX leans on a measured, attestable firmware mediator; SEV-SNP leans on the Secure Processor. Both produce a signed attestation report describing the launch measurement of the confidential VM. The practical consequence for an operator: your accelerator host CPU determines your TEE — an AMD EPYC host pairs with SEV-SNP, an Intel Xeon host with TDX — and the GPU confidential-computing stack, the attestation verifier, and the cloud regions where it is GA all follow from that one platform decision.
The GPU TEE: how the accelerator becomes confidential
Making a GPU confidential is a different engineering problem than making a CPU confidential, because a GPU is a wide-open device by design: hundreds of memory-mapped registers, DMA engines that move data autonomously, and a command interface the driver pokes directly. The teardown of NVIDIA's implementation (independent analysis in arXiv 2507.02770, alongside NVIDIA's own WP-12554) shows three load-bearing mechanisms.
The Compute Protected Region (CPR) carves off the large majority of GPU memory — roughly 90% of HBM — into an encrypted, integrity-protected region that only the confidential GPU context can read in plaintext. Weights and activations live here; the host sees ciphertext. The BAR0 decoupler addresses the register-exposure problem: in normal mode only about 8% of the GPU's memory-mapped registers are hidden from the host, but in confidential-compute mode the decoupler hides roughly 99.78% of them, collapsing the management interface the host can touch down to a tiny, audited surface. Encrypted transfers handle the data path: because PCIe itself is untrusted, every payload crossing it moves through AES-GCM-encrypted bounce/staging buffers, with the per-channel keys derived from an SPDM-negotiated master secret (the device negotiates a session and derives 44+ distinct keys across RPC, DMA, fault, and workload channels). The result is a GPU whose memory and management plane the operator cannot read, even with full host root.
| Dimension | Hopper-class GPU TEE | Blackwell-class GPU TEE | Consequence of the choice |
|---|---|---|---|
| CPU anchor | SEV-SNP or TDX confidential VM | SEV-SNP or TDX confidential VM | Set by host silicon; AMD EPYC -> SNP, Intel Xeon -> TDX |
| GPU scope | Single-GPU only (one confidential GPU per VM) | Multi-GPU via TEE-I/O across NVLink / NVSwitch | Hopper cannot build a confidential NVL domain; Blackwell can |
| Memory protection | CPR encrypts ~90% of HBM; BAR0 decoupler hides ~99.78% of registers | Same CPR/decoupler model, hardware-accelerated encrypted HBM | Operator cannot read model/data from HBM in either case |
| Data-path tax | Heavy on small / PCIe-bound transfers (bounce buffers dominate) | Near line-rate; encrypted NVLink avoids the PCIe chokepoint | Hopper CC penalizes chatty inference; Blackwell largely erases it |
| Headline retention | Single-GPU; overhead workload-dependent, can be 2-3x on bad cases | HGX B200 keeps ~2x train / ~2.5x inference vs H200 with CC on | Multi-node confidential training only becomes practical on Blackwell |
| Attestation verifier | NRAS + RIM golden measurements; 5-cert identity chain | Same NRAS/RIM model, extended to the TEE-I/O topology | You inherit a vendor verifier dependency at boot either way |
Read that table as a generational discontinuity, not a spec sheet. The single most consequential line is GPU scope. Hopper-class confidential computing is single-GPU: you can protect one accelerator inside one confidential VM, which is fine for an isolated inference replica but cannot wrap a tightly-coupled multi-GPU job. A frontier training run or a large MoE inference deployment spans a scale-up domain of dozens of GPUs talking over NVLink — and if that traffic must drop to plaintext at the NVLink boundary, the whole confidentiality guarantee evaporates at exactly the link that carries the most sensitive data. Blackwell's TEE-I/O closes this by extending the encrypted, attested boundary across NVLink and NVSwitch, so an entire NVL domain becomes one confidential compute fabric. That is why "confidential multi-node training" was a research curiosity on Hopper and is a product on Blackwell — and why the DENSITY-RAMP matters: as scale-up domains grow from 8 to 72 to 576 GPUs, the only confidential-computing path that scales with them is one where the intra-domain fabric is itself inside the TEE.
Attestation: the part everyone underestimates
Encryption without attestation buys nothing. A TEE that you cannot prove is genuine, running the firmware you expect, in the configuration you expect, is just a black box asserting it is trustworthy. Attestation is the cryptographic proof — and it is the operationally hardest part of the whole system, because it inserts a hard dependency into your boot path that did not exist before.
The flow, in NVIDIA's implementation, runs roughly: the GPU presents a device identity chain (a 5-certificate chain rooted in a vendor-provisioned device identity key) and a set of measurements — 64 structured measurement records capturing firmware versions, configuration, and security state. A verifier (NVIDIA's Remote Attestation Service, NRAS, or a self-hosted equivalent) checks the identity chain's signatures and compares each measurement against the Reference Integrity Manifest (RIM) — the vendor-published golden measurements for that exact firmware build. Only if every record matches does the verifier issue an attestation token, which the relying party (the key broker) accepts as the precondition for releasing the decryption keys that let the workload start. The CPU TEE goes through the analogous SEV-SNP or TDX attestation against its own verifier. Both must pass before plaintext exists.
Multi-GPU confidential computing: TEE-I/O and the scale-up boundary
The reason single-GPU TEEs were a dead end for serious AI is the same reason the rest of this guide spends so long on the scale-up fabric: the unit of work is not a GPU, it is a scale-up domain. A confidential boundary that stops at one GPU forces every NVLink hop to plaintext, which is both a confidentiality hole and a performance disaster — because the alternative, routing inter-GPU traffic back through encrypted PCIe bounce buffers, is exactly the chatty small-transfer pattern that the Hopper CC tax punishes hardest.
TEE-I/O resolves both at once. By bringing NVLink and NVSwitch inside the trusted boundary with hardware-accelerated link encryption, it lets a whole NVL domain operate as a single confidential context: inter-GPU collectives stay encrypted but never leave the TEE, never hit the PCIe chokepoint, and run at near line-rate. This is what collapses the confidential-computing performance tax from the 2-3x range seen on adversarial Hopper cases to under ~3% on Blackwell's large-tensor workloads, and it is the precondition for confidential multi-node training to exist at all. The cross-reference here is structural: the same NVLink scale-up domain that Chapter 11.6 analyzes as an isolation boundary between tenants is what TEE-I/O turns into a confidentiality boundary against the operator. The two chapters describe the same fabric under two different threat models.
Deep dive: why the Hopper-to-Blackwell performance gap is a PCIe-physics story, not a marketing one
The instinct on seeing "near-zero overhead" claims is to discount them as vendor optimism. The underlying physics, though, explains both the Hopper penalty and the Blackwell recovery, and it is worth understanding because it tells you which of your workloads will suffer if you are stuck on a single-GPU TEE.
Confidential computing's cost is almost entirely a function of how much data must cross the untrusted PCIe boundary through encrypted bounce buffers, and how small those transfers are. Encryption throughput on bulk transfers is cheap — modern AES-GCM engines run at many GB/s and the per-byte cost is negligible against HBM bandwidth. The killer is per-transfer overhead: the staging copy into the bounce buffer, the AES-GCM setup, and the round-trip latency. A workload dominated by large, sequential weight loads and big matmuls barely notices. A workload dominated by many small host-device transfers — frequent kernel launches with small arguments, chatty PCIe-bound inference, anything that ping-pongs little messages across the bus — pays the setup cost over and over. Independent Hopper CC benchmarks (arXiv 2409.03992) found exactly this signature: large-matrix and high-arithmetic-intensity workloads near baseline, small/transfer-bound workloads degraded substantially, sometimes multiples.
Blackwell attacks the root cause two ways. First, hardware-accelerated encrypted HBM removes the in-memory encryption cost from the critical path. Second, and decisively, TEE-I/O over NVLink means the high-volume inter-GPU traffic never traverses the PCIe boundary at all — it stays on the fast, now-encrypted scale-up fabric. The practical decision: if your confidential workload is large-batch training or throughput inference, even Hopper CC is tolerable; if it is latency-sensitive, small-transfer, or multi-GPU, you wait for Blackwell-class TEE-I/O or you accept a tax that can dominate your goodput. → fabric context in Chapter 11.6.
Confidential containers and key brokers
Hardware TEEs are necessary but not sufficient — they protect memory, not the software lifecycle around it. Two pieces of plumbing turn a confidential VM into a usable confidential workload, and each carries its own decision.
Confidential containers (the Kata/CoCo lineage and its cloud equivalents) run each pod or container inside its own confidential VM, so the orchestrator — Kubernetes, the node agent, the cloud control plane — is moved outside the trust boundary. This matters because in a normal Kubernetes node, the kubelet and container runtime can read any container's memory; confidential containers break that. The cost is a fatter per-pod footprint (a VM per workload, not a namespace) and an attestation step injected into pod startup. The decision is granularity: confidential VM per node is cheaper but coarser; confidential container per workload is the strong isolation posture but heavier.
The key broker is the linchpin and the most common place to get the architecture subtly wrong. The pattern is attestation-gated key release: encrypted weights and data sit at rest; the workload boots into its TEE; it presents its attestation token to a key broker service; the broker validates the token (the right firmware, the right measurements, the right identity) and only then releases the decryption keys, into the attested enclave, where the operator cannot see them. The weights never exist in plaintext outside a proven-good TEE. Get this right and you have a genuinely strong story; get it wrong — release keys before validating attestation, or run the broker somewhere the operator can read its memory — and you have moved the crown jewels' keys into the very environment you were trying to defend against. This attestation-gated release is the canonical mechanism that Chapter 11.8 relies on for in-use weight protection, and the key-management discipline (HSM/KMS hierarchy, rotation, revocation) lives there.
Residual attack surface: what confidential computing does NOT protect
Confidential computing is a strong primitive with a precisely-bounded threat model, and the failure mode in practice is almost never the cryptography. It is a defender who assumed the boundary covered more than it does. The literature on where confidential VMs fall short (arXiv 2503.08256, a systematization of CVM trust relationships) is required reading before you stake a sovereignty claim on a TEE.
- Metadata leakage. RPC payloads are encrypted, but queue headers, command-ring structures, and physical-address tables on the untrusted side remain plaintext. An operator who cannot read your data may still infer access patterns, memory layout, and operation sequencing from the metadata. For some workloads that is harmless; for others, access-pattern leakage is itself a meaningful side channel.
- Uncore and microarchitectural side-channels. The TEE protects the compute and memory of the protected context, but shared uncore blocks — media engines (NVENC/NVDEC/NVJPEG), DRAM frequency scaling, contention on shared caches and interconnects — can leak information across the boundary. These are the same class of channels that Chapter 11.6 documents bypassing MIG and MPS isolation; a TEE does not make them disappear.
- The trusted-but-unverified scheduler. The hypervisor is removed from the confidentiality boundary but not from the availability and scheduling path. It still decides when your VM runs, can starve it, can mount denial-of-service, and in some designs the firmware mediator (the TDX Module, the Secure Processor) is itself a large trusted-computing-base component you are attesting but not auditing.
- The firmware / RIM supply chain. The entire guarantee bottoms out in golden measurements published by the vendor and a firmware root of trust on the device. If the RIM is wrong, if the firmware signing is compromised, or if the silicon root of trust is subverted, attestation passes for a compromised platform. This is precisely why Chapter 11.4 (hardware root of trust, Caliptra/DICE, OCP S.A.F.E.) is the foundation underneath this chapter — the TEE is only as trustworthy as the firmware integrity story beneath it.
- Physical and supply-chain attacks below the TEE's model. Interposer attacks, advanced fault injection, and supply-chain tampering before provisioning are outside the standard CC threat model. A nation-state adversary with physical access is not stopped by a TEE designed to defend against a malicious-but-remote operator.
Deep dive: a worked confidential-inference boot sequence, end to end
To make the abstractions concrete, here is the full sequence a sovereign tenant's confidential-inference workload actually executes on a Blackwell-class platform, and where each thing can go wrong.
1. CPU TEE launch. The host launches a confidential VM under SEV-SNP or TDX. Guest memory is encrypted with a key in the memory controller; the platform produces a signed attestation report capturing the VM's launch measurement. Failure mode: a launch measurement that does not match the expected golden means the VM image was tampered with — attestation downstream will reject it.
2. GPU TEE establishment. The confidential VM brings up the GPU in confidential-compute mode. The CPR is enabled (~90% of HBM encrypted), the BAR0 decoupler hides the register surface (~99.78%), and an SPDM session negotiates the master secret from which the 44+ per-channel keys are derived. TEE-I/O brings the NVLink domain inside the boundary. Failure mode: if any peer GPU in the domain fails its own attestation, the confidential fabric will not form.
3. Dual attestation. The CPU report (SEV-SNP/TDX verifier) and the GPU evidence — 5-cert identity chain plus 64 measurement records — go to the verifiers. The GPU evidence is checked against NRAS + the RIM goldens for the exact firmware build. Failure mode: stale or missing RIM, or an unreachable verifier, blocks the whole boot; this is the availability dependency from the warning above.
4. Attestation-gated key release. The combined attestation token is presented to the key broker. Only on full validation does the broker release the weight-decryption keys into the attested enclave. The encrypted weights are decrypted inside the TEE; they never exist in plaintext anywhere the operator can read. Failure mode: a broker that releases keys before validating attestation, or one whose own memory the operator can read, defeats the entire scheme.
5. Steady-state serving. Prompts arrive encrypted, are decrypted inside the TEE, inferenced, and re-encrypted before leaving. Inter-GPU collectives stay encrypted on NVLink via TEE-I/O. The operator sees ciphertext in HBM, a near-empty register surface, and encrypted fabric traffic — but can still observe queue metadata and is still trusted for scheduling and availability. → the weight-side key hierarchy is detailed in Chapter 11.8.
The decision, restated
Confidential computing is the right tool when — and the discipline is to insist on the "when" — you have a named, unavoidable, untrusted party in your data path and a workload whose plaintext exposure you cannot otherwise prevent. For those cases it is the only thing that works, and on Blackwell-class hardware with TEE-I/O the historical objection (the performance tax) has largely collapsed: under ~3% on large workloads, multi-GPU scope, near-line-rate encrypted NVLink. The cost has migrated from cycles to operations: you now run an attestation verifier and RIM pipeline as Tier-0 infrastructure, you architect a key broker that must never leak, and you carry a precisely-bounded residual surface — metadata, side-channels, a trusted scheduler, and a firmware supply chain — that you must defend with the rest of this Part rather than assume the TEE covered.