Chapter 11.7

Network Segmentation, Microsegmentation & Zero Trust

An AI cluster is one enormous flat east-west fabric built for collective bandwidth, not for containment — so the segmentation decision is really a decision about where the blast radius of a single compromised node ends, and the wrong answer is a perimeter firewall guarding a network that has no interior walls at all.

GOODPUTDENSITY-RAMP

What you'll decide here

Where the trust boundary actually lives: a hardened north-south perimeter around a flat interior (cheap, fast, and indefensible once breached) versus DPU-enforced microsegmentation that puts a policy-enforcement point in front of every host (defensible, but a new control plane to operate at fleet scale).
Whether to segment the high-performance back-end fabric at all — and accept the goodput and latency cost of inline enforcement on collectives — or treat the GPU fabric as a single trust domain and contain the blast radius elsewhere.
How egress is controlled, because it is the anti-exfiltration linchpin: a default-deny egress posture with an allow-listed proxy is the single highest-leverage control for weight theft, and it is the one most often left as default-allow.
How the management plane (BMC/IPMI, out-of-band, orchestration, telemetry) is isolated from the data plane — the path that turns a single foothold into fleet-wide firmware compromise if it is reachable from tenant workloads.
Which segmentation you can retrofit cheaply (overlay policy, identity) versus which is baked into the physical fabric and VLAN/PKey design at build time (expensive to re-cut once racks are energized and cabled).

Every preceding chapter in Part 11 hardened a component: the supply chain (11.3), the root of trust and firmware (11.4), the GPU TEE (11.5), the tenant boundary (11.6). This chapter hardens the connective tissue — the network that lets thousands of those components behave as one machine. And here the AI data center inverts the assumption every enterprise network security product was built on. The classic model is a fortified perimeter (north-south) protecting a soft interior; the interior is assumed friendly because crossing the perimeter was supposed to be hard. An AI cluster is almost entirely interior. More than three-quarters of data-center traffic is now east-west — workload-to-workload, GPU-to-GPU, host-to-storage, service-to-service — and on a training fabric that share approaches unity (Akamai / Gigamon, 2024-2026). The perimeter you spent your budget on guards a building whose interior is a single open floor.

So the segmentation decision is not "do we have a firewall." It is: when an attacker gets one foothold — a compromised inference container, a poisoned dependency, a credential lifted from a CI runner — how far can they go before they hit a wall? On a flat fabric, the answer is "everywhere the routing table reaches," and they get there fast: CrowdStrike measured average eCrime breakout time — initial access to first lateral movement — at 29 minutes in 2025, down from 48 the year before, with the fastest observed at 27 seconds and exfiltration beginning within four minutes in one case (CrowdStrike 2026 Global Threat Report). The discipline that answers the question is segmentation: macro- (coarse zones), micro- (per-workload policy), and the zero-trust posture that refuses to treat network location as a proxy for trust at all. This chapter walks those layers — where each wall costs goodput, where it costs nothing, and where leaving it out costs you the weights.

The three fabrics, three different segmentation problems

The single most common mistake is to reason about "the network" as one thing. An AI cluster has at least three physically and logically distinct fabrics, and each one wants a different segmentation strategy because each carries different traffic with different performance constraints. Conflate them and you either strangle the workload (inline enforcement on the wrong fabric) or leave the crown jewels exposed (no enforcement on the fabric that needs it).

The back-end / scale-out fabric (InfiniBand or RoCE/Spectrum-X Ethernet, → 8.4, 8.5) carries the collectives — all-reduce, all-gather, all-to-all — that dominate every training step. It is engineered for sub-2-microsecond latency and 1:1 non-blocking bisection bandwidth. Putting a stateful L7 firewall inline here is a category error: you would add latency to the one path whose latency directly sets goodput, and a single straggler link drags the whole synchronous job. The front-end / north-south fabric carries API traffic, ingress, internet egress, and tenant access — this is where the classic perimeter and inference-endpoint controls live, and where inline inspection is affordable because the traffic is comparatively sparse and latency-tolerant. The management / out-of-band (OOB) fabric (→ 8.7) reaches BMCs, PDUs, CDU controllers, switches, and the orchestration plane. It carries the least traffic and the most privilege — and it is the fabric whose compromise is catastrophic, because it is the path to firmware (11.4) and to physical-systems control (11.10).

The three cluster fabrics → segmentation strategy

Fabric	Dominant traffic	Performance budget	Right enforcement point	Worst-case if flat
Back-end / scale-out (GPU)	Collectives (all-reduce/all-gather/all-to-all)	Sub-2 us latency, 1:1 non-blocking	PKey/VRF partitioning + DPU at host edge; NOT inline L7	Lateral movement across the entire job pool; rollout-data theft
Front-end / north-south	API, ingress, internet egress, tenant access	Latency-tolerant, sparse	Perimeter + WAF/API gateway + default-deny egress proxy	Direct internet exfil path; endpoint abuse; ingress to interior
Management / out-of-band	BMC/IPMI, telemetry, orchestration, OT control	Tiny traffic, maximal privilege	Air-gap or strict L3 isolation; jump host + PAM only	Firmware implant fleet-wide; cyber-physical control (11.10)

Each fabric has a different traffic profile, performance budget, and threat exposure — so a single segmentation policy across all three is always wrong on at least one of them. Latency figures per Chapter 8 keynumbers.

The master fork: perimeter-and-flat vs microsegmented-interior

This is the decision that sets everything downstream. Perimeter-and-flat hardens north-south, then treats the entire interior as one trust domain: cheap, simple, zero goodput cost, and the default almost every cluster ships with. Its failure mode is total — one interior foothold owns the fabric, and breakout takes minutes. Microsegmented-interior places a policy-enforcement point in front of every host (in practice, a DPU/SmartNIC, → 8.3) so that east-west flows are authorized by identity and policy, not by routability. It contains the blast radius to a single workload, but it is a new distributed control plane you must author policy for, operate, and keep from becoming the thing that breaks goodput. The fork is not binary in practice — most mature operators microsegment the management and front-end fabrics hard and apply coarse, offload-based segmentation to the back-end — but you must decide consciously where on the spectrum each fabric sits, before the racks are cabled. The interior walls are far cheaper to design in than to retrofit.

Macrosegmentation: the coarse zones that cost nothing to get right

Before microsegmentation there is macrosegmentation — the coarse partitioning of the cluster into a handful of trust zones with controlled choke points between them. This is the cheap, high-leverage layer, and it is mostly a build-time design decision: VRFs/VLANs on the front-end, PKeys (InfiniBand partition keys) or VXLAN/EVPN segments on the back-end, and a hard separation of the management network. The zones that matter for an AI cluster are recognizable: ingress/DMZ, inference-serving, training/compute, storage, management/OOB, and — where weights live — a high-value-asset enclave with the tightest egress of all.

The consequence of skipping macrosegmentation is the one that recurs across every breach postmortem: an attacker who compromises a low-value, internet-adjacent service (an inference container, a monitoring agent, a CI runner) finds nothing between them and the training fabric where the weights are loaded. The single most important coarse wall is between the workload that talks to the internet and the workload that holds the model. On the back-end fabric, InfiniBand partition keys are the native tool — but treat them as a segmentation control, not a strong security boundary: PKeys are enforced by the subnet manager and adapters, and like VLANs they are a configuration boundary an attacker who reaches the subnet manager or a misconfigured adapter can cross. That caveat — config boundaries are not cryptographic boundaries — is the bridge to why the industry moved toward DPU-enforced, identity-based policy.

Microsegmentation and the DPU as the enforcement point

Microsegmentation means the unit of policy is the individual workload, not the subnet — a flow from container A to container B is permitted only if an explicit identity-based rule says so, regardless of whether A and B share a VLAN. In a traditional enterprise this is done with host agents or hypervisor firewalls; both consume the very CPU cycles an AI host wants for data movement, and host agents live in the same trust domain as the workload they are supposed to police. The architectural answer that defines the 2026 AI data center is to move enforcement off the host and onto the DPU / SmartNIC at the network edge of every server (→ 8.3).

The DPU is the right enforcement point for three reasons. It sits on the wire between the host and the fabric, so every packet to and from the GPU server crosses it — there is no bypass. It runs its own isolated OS and Arm cores, so a fully compromised host (root on the x86 side, even compromised firmware) cannot disable or rewrite the policy executing on the DPU — the enforcement plane and the workload are in different trust domains. And it offloads the work to dedicated silicon, so stateful L4 filtering, microsegmentation, encryption, and flow telemetry run at line rate without stealing host cycles. NVIDIA's BlueField-4, launching with the Vera Rubin platform in 2026, lands this thesis at 800 Gb/s with 64 Arm cores and roughly 6x the compute of BlueField-3, and ships with an ecosystem (Check Point, Cisco, Palo Alto Networks, F5, Forescout, Armis, Trend Micro) building zero-trust, inline east-west enforcement on it (NVIDIA / HPCwire, Oct 2025). The trade here is sharp: choose host-agent microsegmentation and you pay in GPU-host CPU and accept a co-located enforcement plane; choose DPU enforcement and you pay in DPU cost and a new control plane to operate, but you get an enforcement point a compromised host cannot reach.

Enforcement-point comparison for east-west microsegmentation

Enforcement point	GPU-host CPU cost	Survives host compromise?	Line-rate at 400-800G?	Operational burden
Host agent / eBPF	High (steals data-movement cycles)	No — same trust domain as workload	No — software path bottlenecks	Low to deploy, high to trust
Hypervisor / vSwitch firewall	Medium	Partial — only if hypervisor intact	Marginal	Medium
Top-of-rack switch ACLs	None	Yes (separate plane)	Yes, but coarse (L3/L4, no per-workload identity)	Low; limited granularity
DPU / SmartNIC offload	None — runs on DPU cores	Yes — isolated OS, separate trust domain	Yes — purpose-built silicon	Higher — new distributed control plane

The choice is where the policy executes relative to the workload's trust domain. DPU offload is the 2026 default for AI hosts precisely because it removes both the CPU tax and the co-located-trust problem.

Do not put a stateful firewall inline on the back-end collective fabric

The instinct to "inspect everything" meets physics on the GPU fabric. Collectives are latency-bound and run every training step; the job moves at the speed of its slowest link. Inserting a stateful L7 firewall inline on the back-end raises tail latency, perturbs congestion control, and converts directly into lost goodput — and goodput, not availability, is what training economics optimize (→ 8.6 for the congestion-control coupling). The correct pattern is to segment the back-end with offloaded, stateless or lightweight controls — DPU-enforced microsegmentation, PKey/VRF partitioning, encryption-in-transit handled by the NIC — and to put the heavyweight inspection where the traffic is sparse and latency-tolerant: the front-end and the egress path. Segment the back-end for blast-radius containment; do not try to deep-packet-inspect an all-reduce.

Zero Trust: the posture, not a product

Zero Trust is the principle underneath all of the above, and it is widely mis-sold as a product you buy. The canonical definition is NIST SP 800-207 (Aug 2020): no implicit trust is granted to an asset based on its network location — "never trust, always verify." Every access request is authenticated, authorized, and (ideally) encrypted, on the working assumption that an attacker may already be inside the environment. For an AI cluster, that principle has three concrete consequences. First, workload identity replaces IP address as the basis of authorization: a flow is allowed because the calling workload proves who it is (mTLS, SPIFFE/SVID, signed tokens), not because it sits on the right subnet. Second, the management plane is never implicitly trusted from the data plane — reaching a BMC requires going through a policy-enforcement point and authenticating, even from inside the building. Third — and this is where zero trust and confidential computing converge — attestation becomes an access gate: a node is admitted to the trusted compute pool, and released a key, only after it proves its firmware and TEE state are measured and unmodified (the attestation flow is canonical in 11.5; key release in 11.8). Zero trust without attestation trusts a node's self-asserted identity; zero trust with attestation trusts a hardware-rooted proof of what the node actually is.

Egress control: the anti-exfiltration linchpin

If you do one thing in this chapter, control egress. Every other control limits how far an attacker can move; egress control limits whether they can get the prize out. The crown jewel of a frontier cluster is the weights — hundreds of gigabytes to terabytes of model parameters whose theft is the entire point of most state-level intrusions (the asset framing is in 11.1; the protection regime in 11.8). The asymmetry of exfiltration is that the network was built for bandwidth: a cluster with terabit egress can ship a multi-hundred-GB weight set off-site in seconds if egress is default-allow. The single highest-leverage anti-exfiltration control is therefore a default-deny egress posture: the high-value enclave has no route to the internet except through an allow-listed, logged forward proxy, and large or anomalous outbound transfers from the weights enclave are blocked and alerted by policy rather than merely observed.

The trade is uncomfortable because default-deny egress breaks things: package installs, model-hub pulls, telemetry to SaaS, license checks. The cost of doing it right is an allow-list and a proxy you must maintain; the cost of not doing it is that your most expensive asset has an unmonitored exit. The mature pattern segments egress by zone — permissive for build/dev, default-deny with allow-list for production serving, and effectively air-gapped (data-diode-like, human-in-the-loop for any egress) for the weights enclave — and treats the egress proxy logs as a primary detection surface, not an afterthought. Egress control is where network segmentation, weight protection (11.8), and insider-threat defense (11.9) all meet at the same choke point.

76-80%

share of data-center traffic that is east-west (interior); approaches 100% on a training back-end fabric

2024-2026Akamai / Gigamon

29 min

average eCrime breakout time (initial access to first lateral movement) in 2025, down from 48 min in 2024; fastest 27 s

2025CrowdStrike 2026 Global Threat Report

800 Gb/s

BlueField-4 DPU throughput; 64 Arm cores, ~6x BlueField-3 compute; zero-trust east-west enforcement at line rate

2026 (Vera Rubin platform)NVIDIA / HPCwire

SP 800-207

NIST Zero Trust Architecture — 'never trust, always verify'; no trust from network location

Aug 2020 (current)NIST

1:1 non-blocking

training back-end fabric design; sub-2 us latency — why inline L7 inspection is a goodput tax there

2025SemiAnalysis / NVIDIA

~90% / ~96%

industry-avg vs best-in-class goodput; inline enforcement on collectives erodes exactly this metric

2025SemiAnalysis ClusterMAX / CoreWeave

VLAN/PKey

configuration boundaries (subnet-manager / adapter enforced) — segmentation, not cryptographic isolation

2025NVIDIA InfiniBand / SemiAnalysis ClusterMAX

default-deny

egress posture for the weights enclave: allow-listed proxy + blocked/alerted bulk transfers — the anti-exfil linchpin

2025RAND RRA2849-1 (weight-security egress controls)

Management-plane isolation: the privileged path attackers actually want

The management and out-of-band fabric carries the least traffic and the most consequence. It reaches every BMC/IPMI controller, every switch management port, the orchestration plane (Slurm/Kubernetes control nodes), the telemetry pipeline (10.6), and — in a converged facility — the OT controllers for cooling and power (11.10). A foothold here is more than lateral movement to another workload; it is a path to firmware implantation (rewrite a BMC, persist below the OS, → 11.4) and, at the extreme, to physical-systems sabotage. The control is conceptually simple and operationally demanding: the management network is physically or strictly logically separated from every data-plane fabric, reachable only through hardened jump hosts under privileged-access management (PAM, → 11.9), with no path from a tenant workload to a BMC. The recurring failure is the convenience shortcut — a management VLAN that is routable from the production network "so the automation can reach the BMCs" — which collapses the most important air gap in the building for an operational nicety. The control-plane secrets that ride this fabric (BMC credentials, IPMI keys, orchestration tokens) are themselves a segmentation problem: they belong in an HSM/KMS-backed store (→ 11.8) reachable only from the management enclave, never embedded in images or reachable from the data plane.

API & inference-endpoint security: the front door that earns the revenue

The inference endpoint is the one surface deliberately exposed to the world, and it is where segmentation meets application security. The front-end fabric terminates ingress, so this is where the perimeter controls live: API gateway, authentication and rate-limiting per tenant, a WAF, and — increasingly — model-aware guardrails for prompt-injection and jailbreak attempts that the network layer cannot see. The segmentation discipline is to treat the serving tier as a semi-trusted DMZ: it talks to the internet, so it must be assumed reachable by attackers, and it must therefore be the most tightly segmented zone away from the weights. A compromised inference container should be able to load the model it serves and nothing else — no route to the training fabric, no route to other tenants' KV-caches, and default-deny egress so that a hijacked endpoint cannot become an exfiltration relay. This is the same blast-radius logic as the rest of the chapter, applied at the one place the blast is most likely to start.

Deep dive: monitoring east-west — you cannot contain what you cannot see

Segmentation policy is only as good as the visibility behind it, and east-west visibility is the historically neglected half of network monitoring precisely because the traffic never crossed a perimeter tap. On an AI cluster the problem is acute: the back-end fabric is RDMA, which bypasses the host kernel entirely (GPUDirect, → 8.4), so traditional host-based flow logging never sees the GPU-to-GPU traffic at all. The volume is also extreme — terabits of collectives — so full packet capture is infeasible and pointless. The 2026 answer pushes telemetry to the same place enforcement went: the DPU and the switch ASIC generate flow records, anomaly signals, and policy-violation events at line rate, exporting metadata (who talked to whom, how much, against which rule) rather than payload. That metadata stream is the detection surface for lateral movement and for the anomalous bulk transfer that signals exfiltration.

The decision here is what to instrument and where. Instrument the egress proxy and the front-end exhaustively — that is where exfil and abuse show. Instrument the management plane exhaustively — that is where privilege escalation shows. On the back-end, instrument for aggregate anomaly (a node suddenly talking to hosts outside its job's allocation, a flow pattern that does not match the collective topology) rather than per-packet inspection, because the cost of the latter is goodput and the value is near zero against an attacker who looks like an authorized GPU. The east-west monitoring pipeline feeds the SOC and the IR playbooks in 11.12; the goodput-vs-visibility tradeoff it embodies is the network analogue of the isolation-vs-utilization economics in 11.6.

Deep dive: what segmentation you can retrofit, and what is poured in concrete

Like the rest of the facility, segmentation decisions sort by the cost of changing your mind — and the sort is not intuitive, because some of the most important walls are the cheapest to add late while others are baked into the cabling. Cheap to retrofit (overlay / identity layer): microsegmentation policy on DPUs that are already deployed, workload-identity (mTLS/SPIFFE) on the front-end, egress allow-lists and the proxy in front of them, and east-west flow telemetry. These live in software and policy; you can tighten them on a running cluster, and the right move is to ship permissive and ratchet toward default-deny as you learn the legitimate flows.

Expensive or impossible to retrofit (physical / fabric layer): the physical separation of the management network (if you built it routable from production, un-building that touches every rack); the presence of a DPU at every host at all (no DPU, no enforcement point — this is a procurement decision made at server-spec time, not a config you toggle later); the PKey/VRF segment design on the back-end once the subnet manager and cabling are committed; and the high-value-asset enclave's physical and power boundary if weights need a hardware-isolated zone. The planning consequence mirrors the density-ramp logic of 1.1: spend your build-time budget on the substrate you cannot change — DPUs at every host, a genuinely separate management fabric, an enclave you can lock down — and keep the policy layer soft and tightenable. A cluster shipped without DPUs is not one that microsegments later; it is one that bolts on host agents and pays the CPU and trust tax forever.

Anti-patterns

The same mis-segmentations recur, each from treating the AI cluster like an enterprise LAN or from optimizing the wrong fabric:

Perimeter firewall, flat interior. Spending the security budget on a hardened north-south edge while the 80%+ of traffic that is east-west crosses no wall at all. One interior foothold owns the fabric in minutes. The fix is interior segmentation, not a bigger perimeter.
Inline L7 inspection on the collective fabric. Deep-packet-inspecting all-reduce traffic to "secure east-west" — and paying for it in tail latency and lost goodput on the one path that sets training economics. Segment the back-end with offloaded, lightweight controls; inspect at the egress and front-end.
Default-allow egress from the weights enclave. Leaving the most valuable asset in the building with a terabit, unmonitored path to the internet. Default-deny with an allow-listed proxy is the single highest-leverage control and the one most often skipped because it is inconvenient.
Routable management plane. Letting the BMC/OOB network be reachable from production "for the automation," collapsing the air gap that stands between a foothold and fleet-wide firmware compromise. The management plane is the attacker's actual objective; isolate it like it.
Treating VLAN/PKey as a security boundary. Relying on configuration boundaries as if they were cryptographic isolation, when an attacker who reaches the subnet manager or a misconfigured adapter crosses them. Use them for segmentation; layer identity-based, DPU-enforced policy for actual containment.

Segmentation is the connective layer across Part 11: the assets it protects and the threat model are in Chapter 11.1; the firmware/BMC compromise that a routable management plane enables is in Chapter 11.4; attestation as the zero-trust admission gate is canonical in Chapter 11.5; the tenant-isolation boundary this chapter assumes is in Chapter 11.6; egress as the anti-exfiltration choke point is deepened in Chapter 11.8 (with the key hierarchy and control-plane secrets); the insider who abuses an over-trusting interior is in Chapter 11.9; OT/management-plane sabotage in Chapter 11.10; and detection/IR on the east-west telemetry in Chapter 11.12. The fabrics themselves — scale-out transport and topology, congestion control, and the management/OOB network — are engineered in Chapter 8.4, Chapter 8.5, Chapter 8.6, and Chapter 8.7; the DPU enforcement point in Chapter 8.3; observability telemetry in Chapter 10.6; and the build-time-vs-retrofit logic that governs which walls you must pour now in Chapter 1.1.