Chapter 7.9
Software Ecosystems & Lock-In
The accelerator you buy is also a software contract: CUDA, ROCm, XLA, and Neuron are not interchangeable runtimes but distinct lock-in regimes, and the price of switching is paid not in the datasheet FLOPS you compared but in the realized-MFU gap you discover after the cluster is live.
What you'll decide here
- Which software regime you are committing to — CUDA, ROCm, XLA/JAX, or Neuron — and therefore which workloads you can run on day one versus which require a porting project before they earn revenue.
- Whether you buy at paper FLOPS or at realized MFU/goodput — because the gap between them is the real, workload-specific delivered performance, and it can erase a 15–30% hardware-price advantage or, on the right workload, amplify it.
- How much of your stack you write against a portability layer (Triton, MLIR, torch.compile, OpenAI-compatible serving APIs) versus a vendor-native one (CUDA C++, cuDNN, TensorRT-LLM) — the choice that sets your switching cost years before you exercise it.
- Whether the lock-in that binds you is the kernel language, the collective-communication library, the serving framework, or the scale-up fabric — each has a different exit cost and a different rate of erosion.
- Whether to run a single-vendor fleet for operational simplicity or a heterogeneous fleet for supply leverage and price discipline — and who on your team owns the second toolchain if you do.
Every accelerator decision in Chapter 7.1 through Chapter 7.8 was framed as a hardware comparison: HBM capacity, FP4 PFLOPS, NVLink bandwidth, dollars per GPU. That framing is incomplete in a way that costs real money. When you buy an accelerator you are also signing a software contract — a runtime, a kernel language, a collective library, a compiler, a serving stack, and a fabric — and that contract is what actually determines how much of the silicon's peak you will ever see, how fast a new model lands on the box, and how much it costs to leave. The four regimes that matter in 2026 are CUDA (NVIDIA), ROCm (AMD), XLA (Google TPU, via JAX/TensorFlow), and Neuron (AWS Trainium/Inferentia). They are not interchangeable. Choosing one is choosing a lock-in regime.
This chapter works through three forks — which regime, which portability posture, which fleet composition — and their downstream costs: the porting project that delays revenue, the realized-MFU gap that erases a price advantage, the kernel you have to write yourself because no library shipped it, the model that lands on NVIDIA on day zero and on everyone else on day forty. The hardware buyer who ignores the software contract does not save money. They defer the bill to the day the cluster is live and the MFU is half of what the datasheet promised.
The four regimes, and what each one actually locks
It is tempting to model the four ecosystems as "CUDA, plus three things trying to be CUDA." That is wrong in a way that hides the decision. Each regime locks a different layer of the stack, erodes at a different rate, and exacts a different exit cost. Sorting them by what they lock is the first analytical move.
CUDA is the deepest and broadest moat, and it is mostly a library and tooling moat, not merely a language one. The language (CUDA C++) is replaceable; what is hard to replace is two decades of accreted libraries — cuDNN, cuBLAS, CUTLASS, NCCL, TensorRT-LLM — plus the profilers, debuggers, and the simple fact that every new model, every new attention variant, every new quantization scheme ships and is tuned on NVIDIA first. The lock-in is not that you cannot leave; it is that on NVIDIA the kernel you need already exists and is tuned, and everywhere else you may have to write or wait for it.
ROCm is the open challenger, and as of 2026 it has closed the gap from "unusable for production" to "feature-matched on the workloads AMD chooses to fund." ROCm mirrors CUDA layer-for-layer — HIP for the language, MIOpen/rocBLAS for the math, RCCL for collectives — and the porting tax is genuinely low for code written against PyTorch rather than raw CUDA. The residual lock-in is coverage: the long tail of custom kernels and bleeding-edge model architectures that are tuned on CUDA months before ROCm catches up. ROCm's exit cost is asymmetric — coming from CUDA-portable PyTorch is cheap; coming from hand-written CUDA C++ is not.
XLA (the compiler behind JAX and TensorFlow on TPU) inverts the model entirely. There is no kernel language to lock; XLA is a graph compiler that takes a whole-program traced computation and emits TPU machine code, and the lock-in is that your code is written in JAX against XLA's compilation model, not in a portable imperative GPU style. The upside is that XLA is genuinely multi-target — it compiles to GPU too — so the regime locks the programming model more than the silicon. The downside: the ecosystem of pretrained checkpoints, third-party kernels, and community tooling around JAX/TPU is narrower than CUDA's, and TPUs are rentable only from Google.
Neuron (AWS's SDK for Trainium and Inferentia) is the most vertically integrated and the most single-tenant. The Neuron compiler ingests PyTorch/JAX and targets the Trainium ISA; the lock-in is the whole stack plus the fact that Trainium exists only inside AWS. The trade is explicit: external adopters cite roughly 50% cost savings versus comparable NVIDIA capacity, paid for with a Neuron porting project and the loss of any exit to another cloud. → Chapter 7.4.
| Regime | Vendor / silicon | Primary lock-in layer | Day-0 model availability | Exit cost (dominant term) | Erosion vector |
|---|---|---|---|---|---|
| CUDA | NVIDIA GPUs | Libraries + tooling (cuDNN, NCCL, TensorRT-LLM, CUTLASS) | Day 0 — models land and are tuned here first | Rewriting hand-tuned CUDA C++ kernels with no portable equivalent | Triton/MLIR/torch.compile abstract the kernel layer away |
| ROCm | AMD Instinct GPUs | Kernel coverage (the custom-kernel long tail) | Days-to-weeks on funded workloads; longer on the tail | Re-tuning + closing the realized-MFU gap on your specific model | HIP source-compat + PyTorch upstreaming + InferenceMAX-style benchmarking |
| XLA / JAX | Google TPU (also compiles to GPU) | Programming model (whole-program graph compilation) | Strong for Google/JAX-native models; thinner third-party tail | Rewriting imperative PyTorch into JAX/XLA tracing semantics | JAX-on-GPU + PyTorch/XLA bridge widen the target set |
| Neuron | AWS Trainium / Inferentia | Full vertical stack + single-cloud availability | Anchor-tenant models day 0; general tail lags | Neuron port + total loss of cross-cloud portability | PyTorch/JAX front-ends + managed model hubs reduce port friction |
The realized-MFU gap: where paper FLOPS go to die
A costly misconception in accelerator procurement is that two chips with similar peak FLOPS deliver similar work. They do not, and the gap is a software phenomenon. Model-FLOPS-utilization is the ratio of the useful floating-point work your model actually performs to the chip's theoretical peak over the same wall-clock. On a mature stack running a well-tuned dense transformer, training MFU lands in the 35–55% range; on an immature stack, or an exotic architecture, or a workload nobody has optimized yet, it can be half that — and the silicon is identical. The difference is entirely the kernels, the fused attention path, the collective-communication overlap, the compiler's ability to keep the tensor cores fed.
This is where ROCm's story turned in 2026, and it is worth stating precisely because the direction of travel matters more than any single snapshot. The historical knock on AMD was a severe realized gap: the MI300X carried roughly 1.5x the paper FLOPS of an H100 but, on real inference, delivered something like 37–66% of H100/H200 throughput because the kernels and serving stack were immature (SemiAnalysis AMD-vs-NVIDIA benchmarking, 2025). By mid-2026 that gap had narrowed dramatically on the workloads AMD and the community chose to fund. On specific recent models, the MI355X running ROCm on SGLang reached feature parity with B200 running CUDA — same FP8 KV-cache path, same MLA kernels — and undercut the B200 on cost per million tokens across much of the single-node Pareto frontier (SemiAnalysis InferenceX, 2026). The lesson is not "AMD won" or "NVIDIA won." The lesson is that the realized-MFU gap is a moving target set by software investment, workload by workload, and a procurement decision made on last quarter's gap may be wrong this quarter.
The consequence for the buyer is structural. A 15–30% hardware-price advantage (AMD's typical discount to NVIDIA) is real money only if the realized-MFU gap on your workload is smaller than that discount. If the chip is 20% cheaper but delivers 40% less goodput on your model, you have bought a more expensive cluster measured the only way that matters — cost per token or cost per useful FLOP. Flip it around and the same logic favors the challenger: on a workload where ROCm has reached parity and the chip is cheaper, the cheaper silicon now wins on cost-per-token. The number you must produce before signing is not on any datasheet; it is the realized cost-per-token of your workload on the software you will run.
Portability layers: paying the option premium up front
If lock-in is a contract, portability layers are the clause that lets you exit — but you pay for the clause whether or not you exercise it. The strategic decision is how much of your stack to write against a portable abstraction versus a vendor-native one, and it is made years before you would ever switch. Three layers matter in 2026, at three different altitudes.
The framework layer (PyTorch / JAX). The single most important portability decision most teams make is simply to write models in PyTorch (or JAX) rather than in raw CUDA. PyTorch is the great equalizer: a model written in idiomatic PyTorch runs on CUDA, ROCm, and — via the PyTorch/XLA bridge — on TPU, with the framework absorbing most of the vendor difference. This is why ROCm's porting tax is low for PyTorch code and high for hand-written CUDA: the abstraction already exists, and AMD invests heavily in keeping the AMD backend upstreamed. The lock-in you create by dropping into vendor-native kernels for the last 10% of performance is the lock-in you will pay to undo.
The kernel layer (Triton / MLIR). Below the framework sits the kernel, and this is where CUDA's deepest moat is being abstracted away. Triton — the kernel language built on MLIR — lets you write a fused attention or quantization kernel once and compile it to NVIDIA PTX, AMD AMDGCN, or Intel GPU code, and since PyTorch 2.0 it has been the default code generator inside torch.compile's TorchInductor. Its governance moved from OpenAI to a community triton-lang project with contributions from NVIDIA, AMD, Intel, Meta, IBM and Red Hat — a deliberately multi-vendor structure. A custom kernel written in Triton is portable; the same kernel written in CUDA C++ with wgmma and tensor-memory-accelerator intrinsics is not. This is the layer where the CUDA moat is most actively eroding, and where your kernel-authoring policy directly sets your future switching cost.
The serving layer (OpenAI-compatible APIs). For inference, the cheapest portability you can buy is an OpenAI-compatible HTTP surface in front of whatever engine you run. vLLM, SGLang, TensorRT-LLM, and their ROCm equivalents all expose a near-identical request API, so the application above never learns which silicon it is talking to. This makes the inference fleet the easiest place to be heterogeneous: route traffic to whichever accelerator delivers the best cost-per-token this quarter, behind a stable API. The lock-in at this layer is shallow by design.
| Layer | Portable choice | Vendor-native choice | Option premium of portable | When native is right |
|---|---|---|---|---|
| Framework | PyTorch / JAX (multi-backend) | CUDA C++ application code | Low — near-zero for most models | Almost never; reserve for a profiled hot path |
| Kernel | Triton / MLIR (PTX + AMDGCN + Intel) | CUDA C++ + cuDNN / CUTLASS | Often single-digit % on the hottest kernels | Frontier kernels where the last 10% pays for itself at scale |
| Collectives | NCCL-API-compatible (RCCL, oneCCL) | NCCL tuned to one fabric | Low at the API; tuning differs per fabric | When fabric-specific tuning is the bottleneck |
| Serving | OpenAI-compatible API over vLLM/SGLang | Vendor-locked inference microservice | Negligible | Rarely — the abstraction is essentially free |
| Scale-up fabric | UALink / Ethernet-based open fabric | NVLink (proprietary, NVIDIA-only) | Bandwidth/latency gap vs the leader, today | When you want the densest, fastest scale-up domain now |
Quantifying the switching cost
"Lock-in" is a slogan until you put a number on it, and the number is the total cost of switching regimes — which is not one cost but four, and they are paid at different times.
The porting cost is engineering time to make your stack run at all on the new regime. For PyTorch-native training and inference it is genuinely modest — days to weeks, mostly dependency and container work. For a codebase laced with hand-written CUDA kernels, custom NCCL tuning, or TensorRT-LLM-specific serving paths, it is a quarters-long project staffed by scarce specialists. This cost scales with how much vendor-native code you wrote in the first place, which is why the kernel-authoring policy of two years ago is the switching cost of today.
The realized-MFU recovery cost is the larger and more often ignored term. Getting your model to run on the new regime is not getting it to run well. The gap between first-light and tuned-throughput is the period where you are paying for silicon at a fraction of its peak while engineers chase the missing 20–40% of MFU — re-tuning kernels, fixing collective overlap, matching the serving framework's batching to the new memory system. On an immature stack this recovery can take longer than the port itself, and during it your cost-per-token is underwater.
The day-0 coverage cost is the opportunity cost of not being on the regime where new models land first. New architectures, attention variants, and quantization schemes ship and are tuned on CUDA first; on other regimes they arrive days to weeks later, after community or vendor porting. For a frontier lab that must serve the newest model the hour it drops, that lag is lost revenue and a competitive gap — and it is a recurring tax, not a one-time switching cost.
The operational-complexity cost is the standing overhead of running a heterogeneous fleet: two toolchains, two driver/firmware matrices, two sets of failure modes, two on-call playbooks. It is paid every day, not once, and it is the reason many operators stay single-vendor despite a price advantage elsewhere — the second toolchain needs an owner, and that owner is a headcount.
Deep dive: why the CUDA moat is libraries, not the language — and what that means for erosion
The popular framing — "CUDA is a programming language and that is the moat" — gets the mechanism wrong and therefore mis-predicts the erosion. The CUDA language (a C++ dialect with a launch syntax) is the least defensible part of the stack; HIP source-translates it almost mechanically, and most production code never touches it directly because it lives one or two layers up in PyTorch. The actual moat is the library and tooling estate: cuDNN's hand-tuned convolution and attention kernels, cuBLAS and CUTLASS for the matrix math, NCCL for collectives that overlap with compute on NVLink, TensorRT-LLM for serving, plus Nsight profilers and a debugger ecosystem — and, above all, the network effect that every researcher prototypes on NVIDIA, so every new technique is born CUDA-tuned.
This diagnosis predicts where the moat erodes. It does not erode by someone cloning the CUDA language; it erodes by abstractions that make the library estate irrelevant. Two are doing exactly that. PyTorch hoists the application above the library layer, so a model is portable even when the kernels beneath it are not — and AMD/Intel invest to keep their backends upstreamed. Triton/MLIR attacks the kernel layer itself: a fused kernel written once compiles to PTX, AMDGCN, and Intel GPU code, and because Triton is the default inside torch.compile, ordinary PyTorch users generate portable kernels without knowing it. The moat does not fall; it gets routed around, one layer at a time, on the workloads the ecosystem chooses to fund. The practical reading for a buyer: the moat is strongest exactly where you are still writing vendor-native kernels, and weakest where you have already moved up to PyTorch + Triton. Your switching cost is, to first order, a measure of how much vendor-native code you let accumulate. → precision and quantization kernels in Chapter 7.10.
Single-vendor vs heterogeneous fleets
The final fork is fleet composition, a strategy decision that arrives as a procurement one. A single-vendor fleet buys operational simplicity: one toolchain, one driver matrix, one set of failure modes, one on-call playbook, and the deepest day-0 model coverage if that vendor is NVIDIA. It pays for that simplicity with price exposure — you are a captive buyer with no credible alternative to walk to — and with supply exposure, because a single allocation queue gates your entire build (→ Chapter 7.6 on the HBM/CoWoS upstream gate).
A heterogeneous fleet inverts both terms. Running NVIDIA for the bleeding edge and AMD, TPU, or Trainium for the steady-state workloads where they have reached parity buys price discipline — a real outside option in every negotiation — and supply diversity across multiple allocation queues. It pays for that leverage with the operational-complexity cost above: two toolchains that each need an owner, and a routing layer (best built at the serving API) that sends each workload to the silicon with the best current cost-per-token. The heterogeneous strategy is most defensible exactly at the inference layer, where the OpenAI-compatible API makes the silicon swap nearly free, and least defensible at the frontier-training layer, where day-0 coverage and the deepest scale-up domain still favor a single vendor.
The 2026 reality that forces this question onto every roadmap is the rise of custom ASICs to roughly a quarter of AI-server shipments. Every Maia, MTIA, TPU, and Trainium is, by construction, a non-CUDA software regime — and the hyperscalers building them have already paid the heterogeneity tax internally because the cost-per-token and supply-control upside justified it at their scale. The strategic question for everyone smaller is whether their workload mix and engineering bench can capture the same upside, or whether the operational-complexity cost eats it. → the merchant-silicon disruption in Chapter 7.5; TCO that scores all of this in Chapter 7.11 and Chapter 1.8.
Deep dive: the collective-communication library as a hidden lock-in (NCCL vs RCCL)
Buyers obsess over the kernel and the framework and forget the layer that most directly gates multi-node training MFU: the collective-communication library. NCCL (NVIDIA) implements all-reduce, all-gather, and reduce-scatter tuned to NVLink and the InfiniBand/Spectrum-X fabric, overlapping communication with compute so the tensor cores never stall waiting on a gradient sync. On a large synchronous training run, a poorly-overlapped collective is the difference between 50% MFU and 30% — and it is invisible on the datasheet.
The portability story here is real but incomplete. RCCL (AMD) is API-compatible with NCCL, so PyTorch's distributed layer calls it transparently — the application does not change. But tuning is fabric-specific: the topology-aware algorithms, the buffer sizes, the ring-vs-tree selection that NCCL has accreted over years on NVLink are not automatically optimal on AMD's Infinity Fabric or a UALink topology. So the API ports for free and the realized collective bandwidth does not, which folds straight back into the realized-MFU recovery cost. This is why a multi-node training benchmark — not a single-GPU one — is the only honest way to compare regimes for training: the lock-in lives in the spaces between the GPUs as much as on them. → fabric topology and collectives in Chapter 8.2.