Guide › Software, Orchestration & Service Delivery › 10.8

Chapter 10.8

MLOps & Training Frameworks

The training framework and the control plane around it convert raw FLOPS into trained weights — pick the parallelism strategy and the orchestrator wrong and you do not get a slower run, you get a fleet that bills full-price for half its arithmetic.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Which parallelism framework — FSDP2, DeepSpeed/ZeRO, or Megatron-Core — your stack standardizes on, and therefore which 3D/4D/5D parallelism layout you can express without rewriting the model.
How you split a model across the four-to-five parallelism dimensions (data, tensor, pipeline, expert, context) given your scale-up domain size and back-end fabric — the layout that maximizes MFU is hardware-specific, not a default.
Which orchestration layer — Kubernetes-native (Kubeflow/Volcano/Kueue), Ray, or SkyPilot over Slurm/cloud — owns gang scheduling, topology awareness, and elastic recovery, and where the training control plane (experiment tracking, lineage, sweeps) lives relative to it.
Your checkpoint cadence and tier (the goodput governor): the Young/Daly-optimal interval against your measured failure rate, async multi-tier drain, and whether a node failure restarts thousands of GPUs or a fault-domain of them.
What 'reproducible' means for your runs in practice — pinned framework/CUDA/NCCL versions, deterministic kernels, seed and data-order capture — and the cost in MFU you are willing to pay for it.

Everything else in Part 10 gets electrons to the GPU, keeps the node healthy, and recovers it when it dies. This chapter is about what runs on that fleet to turn its arithmetic into a trained model — and it is where a power-bound, depreciation-clocked data center either earns its megawatts or wastes them. A cluster can be perfectly provisioned (Chapter 10.5), perfectly scheduled (Chapter 10.1), and perfectly observed (Chapter 10.6), and still run at 30% model-FLOPS-utilization because someone chose the wrong parallelism layout or let checkpoint overhead eat a fifth of every run. The framework and the control plane are not glue — they are the conversion efficiency between the capex you bought and the weights you ship.

The forks are three. Which training framework you standardize on (FSDP2 vs DeepSpeed/ZeRO vs Megatron-Core) determines which parallelism strategies you can express cheaply and how much of your team's time goes to plumbing instead of research. Which orchestration plane owns the job lifecycle (Kubernetes-native vs Ray vs SkyPilot-over-Slurm) determines how gang scheduling, topology placement, and elastic recovery behave when — not if — a node fails. And how you handle reproducibility and checkpointing determines whether a run is a scientific artifact you can re-derive or a one-off you can never explain. Inference serving — vLLM, TensorRT-LLM, SGLang, disaggregation, goodput-optimal scheduling — is a different discipline with different SLOs and is owned end-to-end in Chapter 10.11; this chapter stops at the trained checkpoint.

The framework fork: FSDP2 vs DeepSpeed/ZeRO vs Megatron-Core

A single accelerator cannot hold a frontier model. A dense 70B model in BF16 needs ~140 GB just for weights, and training state — optimizer moments, gradients, master weights — pushes the real footprint to roughly 16–18 bytes per parameter with Adam (4 bytes weight + 4 bytes gradient + 8–12 bytes optimizer state). Even an 80 GB or 192 GB GPU cannot hold a 70B+ training job, let alone a 400B+ one. The framework's first job is to shard that state across the fleet; its second is to do so while keeping the GPUs busy. Three families dominate production in 2026, and they sit on a spectrum from 'least code, least control' to 'most code, most control.'

PyTorch FSDP2 (Fully Sharded Data Parallel, the rewritten v2 that replaced the FlatParameter model with per-parameter DTensor sharding) is the path of least friction for PyTorch-native shops. It shards parameters, gradients, and optimizer state across the data-parallel group and reconstitutes each layer's full weights on demand via an all-gather that prefetches the next shard while computing on the current one — the same idea as DeepSpeed ZeRO-3, expressed natively in the framework. FSDP2 is the right default for mid-scale training (single-digit-thousands of GPUs) and for teams that want minimal abstraction over their model code. Its weakness is that pure parameter-sharding is a data-parallel technique: once you need tensor or pipeline parallelism to break a single layer across GPUs, FSDP2 alone is not enough, and you compose it with PyTorch's device-mesh primitives or reach for Megatron.

DeepSpeed/ZeRO (Microsoft) pioneered the partition-the-redundancy idea: ZeRO-1 shards optimizer state, ZeRO-2 adds gradient sharding, ZeRO-3 adds parameter sharding (the FSDP-equivalent). Its durable niche in 2026 is the offload-heavy lane — ZeRO-Infinity streams parameters and optimizer state to CPU DRAM and NVMe, letting a memory-starved cluster train a model that would not otherwise fit. That is a goodput-for-fit trade: you accept lower MFU (the offload traffic competes with compute) in exchange for running at all on the hardware you have. DeepSpeed also retains strong MoE support and a large installed base, so 'we already have DeepSpeed expertise' is itself a legitimate reason to stay.

Megatron-Core (NVIDIA) is the frontier-scale framework: it is where you go when every percent of MFU on a 16k–100k-GPU run is worth seven figures of electricity. Megatron implements true model parallelism — tensor parallelism (splitting individual matmuls across GPUs), pipeline parallelism (splitting layers into stages), and now expert and context parallelism — with hand-tuned communication-overlap kernels. It is the most code and the most operational burden, and it rewards that burden with the highest realized utilization at the largest scales. The 2026 reality is that these families increasingly compose rather than compete: Megatron-FSDP and TorchTitan let you layer FSDP-style sharding under Megatron's tensor/pipeline/expert dimensions, and most frontier labs run a hybrid rather than a single pure framework.

Training framework selection — the fork and its consequences

Framework	Sharding model	Parallelism it expresses natively	Scale sweet-spot	Pick it when	The cost you pay
PyTorch FSDP2	Per-parameter DTensor sharding (DP group)	Data + sharded data; TP/PP via device-mesh composition	Hundreds to low thousands of GPUs	PyTorch-native shop; mid-scale; minimal abstraction wanted	Not enough alone past where a layer must be split; you compose
DeepSpeed / ZeRO	ZeRO-1/2/3 staged sharding + CPU/NVMe offload	Data + sharded data; MoE; offload via ZeRO-Infinity	Hundreds to thousands; offload-bound jobs	Memory-starved hardware; MoE; existing DeepSpeed expertise	Offload traffic depresses MFU; less frontier-tuned than Megatron
Megatron-Core	Tensor + pipeline + expert + context, composable with FSDP	Full 3D/4D/5D parallelism with overlap-tuned kernels	Thousands to 100k+ GPUs (frontier)	Frontier scale where every % of MFU is seven figures of power	Highest engineering and ops burden; layout tuning is a project

Vendor-neutral synthesis of 2025–2026 practitioner guidance (PyTorch docs, NVIDIA Megatron-Core docs, DeepSpeed docs, independent framework surveys). 'Scale sweet-spot' is the GPU count where the framework's complexity is justified, not a hard limit.

3D parallelism — and why the layout is the real decision

Once you have a framework, the decisive choice is the parallelism layout: how you map the model and the batch onto the physical fleet across what used to be three dimensions and is now four or five. Each dimension trades communication for memory along a different axis, and the optimal mapping is a function of your scale-up domain size, your back-end fabric, and the model's shape. There is no portable default; a layout tuned for 8-GPU HGX nodes is wrong on a 72-GPU NVL72 domain.

Data parallelism (DP): replicate the model, split the batch, all-reduce gradients each step. Cheapest to reason about, but every replica needs the whole model — which is why it is paired with sharding (FSDP/ZeRO) to remove the redundancy.
Tensor parallelism (TP): split an individual matmul (and its layer) across GPUs. Communication is heavy and latency-sensitive (all-reduce inside every layer), so TP must stay inside the scale-up domain — NVLink, not the scale-out fabric. This is the dimension that most directly couples the layout to hardware: a 72-GPU NVLink domain lets you run wider TP than an 8-GPU node, which is why bigger scale-up domains matter (Chapter 8.5).
Pipeline parallelism (PP): split the layers into stages, each on a different GPU group, and stream micro-batches through. Cheap communication (only activations cross stage boundaries) but introduces the pipeline 'bubble' — idle time at fill and drain — which interleaved schedules (1F1B, zero-bubble variants) exist to minimize.
Expert parallelism (EP): for Mixture-of-Experts models, place different experts on different GPUs and route tokens to them via all-to-all. The all-to-all is the bottleneck, and a wide NVLink domain (EP32 on NVL72 vs EP8 on a node) is what makes wide expert parallelism throughput-positive rather than a network disaster (Chapter 8.5).
Context parallelism (CP): split along the sequence-length dimension so a single very-long-context example fits. The fifth dimension, and increasingly necessary as context windows grow.

The consequence of getting the layout wrong is not subtle. TP that spills out of the NVLink domain onto the scale-out fabric collapses MFU because the latency-sensitive in-layer all-reduce now traverses a slower hop. PP with too few micro-batches spends a large fraction of each step in the bubble. EP that is too wide for the fabric drowns in all-to-all. The layout is where the abstract 'parallelism strategy' meets the concrete topology of the building — and it is the single highest-leverage tuning decision a training team makes.

~30–50%

typical model-FLOPS-utilization (MFU) for large LLM training; best-in-class >50% on Hopper

2025SemiAnalysis; provenance.js (domain economics)

34% → 54%

BF16 MFU gain on GB200 NVL72 from software/kernel maturation over ~12 months (≈57% throughput from software alone)

2025SemiAnalysis (H100 vs GB200 NVL72 training benchmarks)

~41%

BF16 MFU achieved pre-training Llama 3 on 16k H100s (frontier-scale reference point)

2024Meta, The Llama 3 Herd of Models

~16–18 B/param

training-state footprint with Adam (4 weight + 4 grad + 8–12 optimizer); the number the framework must sleep across the fleet

2025Standard mixed-precision Adam accounting; DeepSpeed/ZeRO docs

~14 B/param

rule-of-thumb checkpoint size on disk (weights + optimizer state); sets async-drain bandwidth need

2025VAST Data (checkpoint bandwidth analysis)

~90% / ~96%

training goodput: industry average vs best-in-class effective-training-time fraction

2025SemiAnalysis ClusterMAX / CoreWeave; provenance.js

~43.4%

large-LLM-job failure rate in a production fleet (~37% hardware-attributed; ~73% restart-recoverable) — why elastic orchestration matters

2024Alibaba Unicron; provenance.js

~7 days

MTBF per 512 GPUs at a top-tier operator; one failure restarts a synchronous job from its last checkpoint

2025SemiAnalysis (100k H100 clusters); provenance.js

The orchestration fork: Kubernetes-native vs Ray vs SkyPilot-over-Slurm

A framework runs one job. The orchestration plane runs the fleet of jobs: it places them topology-aware, gang-schedules every rank so a job either gets all its GPUs or none (a half-placed synchronous job is a deadlock, not a slow start), preempts and re-queues, and re-forms the job when ranks die. This is the boundary between this chapter and the scheduling plane of Chapter 10.1 and Chapter 10.2: the scheduler decides where a job lands; the training-orchestration layer above it decides how the job's lifecycle is driven and how the pipeline of pre-processing, training, eval, and checkpointing is wired. Three patterns dominate.

Kubernetes-native (Kubeflow Trainer on top of a gang-capable scheduler — Volcano, Kueue, or NVIDIA's KAI — with topology-aware placement and increasingly Dynamic Resource Allocation for GPUs) is the path for operators who already run Kubernetes and want training to be one more workload on the same control plane. Kubeflow Trainer's 2026 line added distributed data cache and topology-aware scheduling via Kueue/Volcano. The strength is operational unification — one cluster, one RBAC model, one observability stack. The weakness is that vanilla Kubernetes was never built for gang-scheduled, topology-sensitive, all-or-nothing jobs, which is exactly why Volcano/Kueue/KAI exist as a layer to retrofit that behavior.

Ray is the application-framework answer: an actor/task runtime that expresses distributed training, RL rollouts, and data pipelines as Python and handles fine-grained orchestration within a job. The 2026 norm is hybrid — Kubernetes (or KubeRay) for coarse-grained resource management and Ray for the fine-grained application orchestration inside the allocation. Ray is the natural home for RL and post-training, where a run is a choreography of inference-heavy rollout generation feeding a smaller trainer (Chapter 1.4) rather than a single monolithic SGD loop.

SkyPilot over Slurm/cloud targets the operators whose problem is cost arbitrage and burst, not cluster ops: it abstracts a job over heterogeneous capacity (multiple clouds, multiple neoclouds, spot and on-demand) and chases the cheapest available GPUs, handling provisioning and recovery. For a tenant renting from neoclouds (Chapter 10.9) rather than running a Kubernetes estate, SkyPilot is often the right top layer. Slurm itself remains the HPC-heritage batch scheduler underneath many of these clusters and is entirely viable on its own for a homogeneous, single-tenant training estate.

Gang scheduling and topology awareness are the non-negotiables

Whatever you pick, two properties are load-bearing for synchronous training and absent from naive container schedulers. Gang (all-or-nothing) scheduling: a 1,024-GPU job must receive all 1,024 ranks atomically; placing 1,000 and waiting on 24 wastes 1,000 GPUs in a hung collective. Topology awareness: the ranks that talk most (the TP group) must land inside one NVLink domain, and the job as a whole should occupy contiguous racks/pods on the same fabric leaf to keep the all-reduce off oversubscribed up-links. An orchestrator that ignores topology can place a job that 'fits' but runs at half the MFU because its tensor-parallel traffic is crossing the spine. This is why Volcano/Kueue/KAI/Grove exist on top of Kubernetes — they add the gang and topology semantics the base layer lacks (Chapter 10.2).

The training control plane: experiment tracking, lineage, and sweeps

Above the orchestrator sits the layer that makes training reproducible: the training control plane. It is three capabilities. Experiment tracking (the open-source and commercial lineage of MLflow, Weights & Biases, and their kin) records every run's hyperparameters, metrics, hardware, code version, and artifacts so that 'which config produced this checkpoint?' is answerable months later. Lineage and artifact management connect a shipped model back to the exact dataset snapshot, tokenizer, and base checkpoint it descended from — the same provenance discipline that the data-governance regime of Chapter 10.10 requires for legal defensibility. Sweep orchestration drives hyperparameter search: grid, random, and Bayesian/bandit methods (Hyperband, ASHA, population-based training) that adaptively kill underperforming trials to spend the GPU budget where it earns the most signal.

The hard question is where the budget goes. At frontier scale, you cannot sweep the full hyperparameter space at the target model size — a single run is the budget. The discipline is scaling-law-guided search: sweep cheaply at small scale, fit the scaling laws (and the maximal-update-parameterization / muP transfer that lets a learning rate found at small width transfer to large), and extrapolate the few choices that matter to the full run. An operator that sweeps blindly at scale burns megawatts on trials a $50k small-scale sweep would have eliminated; an operator that skips small-scale search entirely gambles the full-run budget on untuned hyperparameters. The control plane is what makes the cheap-search-then-extrapolate workflow auditable.

Checkpointing: the goodput governor

Checkpointing is the single mechanism that most directly converts the fleet's raw availability into goodput — the fraction of GPU-time that advances the model rather than being lost to faults and recovery. The arithmetic is unforgiving because of synchronous coupling: at a top-tier operator, MTBF is roughly 7 days per 512 GPUs, so a 16,000-GPU run sees a failure roughly every few hours, and every failure restarts the synchronous job from its last checkpoint. Two costs bracket the cadence decision. Checkpoint too rarely and a failure throws away hours of progress (lost-work cost). Checkpoint too often and the write stalls (or the async drain bandwidth) eats into the step time (overhead cost). The optimum is the classic Young/Daly interval: checkpoint period ≈ √(2 · checkpoint-cost · MTBF), which falls as the fleet grows and MTBF shrinks.

The modern answer is asynchronous, multi-tier checkpointing, and it is the difference between a 90% and a 96% goodput cluster. Tier the writes: snapshot to local node memory/NVMe first (microseconds-to-seconds, survives a job restart), replicate peer-to-peer across the fabric (survives a node loss), and drain asynchronously to the parallel filesystem or object store in the background (survives a cluster loss) — overlapping the slow drain with continued computation so the GPUs never wait on storage. Google and AWS report multi-tier checkpointing cutting recovery time from 15–30 minutes to under 2 minutes and lifting goodput materially; the target is to keep checkpoint overhead under ~10% of step time with the drain fully overlapped. The full engineering of checkpoint storage, bandwidth sizing (the ~14 bytes/param rule), and elastic restart lives in Chapter 9.4; the KV-cache side of the memory hierarchy is Chapter 9.7. The point for this chapter: your framework and orchestrator must support async/distributed/elastic checkpointing, or the most expensive part of your fleet is idle during every recovery.

Synchronous coupling makes one bad node a fleet-wide tax

The reliability number that governs training economics is blast radius, not MTBF. A synchronous data-/tensor-/pipeline-parallel job moves at the speed of its slowest rank and dies with its first dead rank. A single failed GPU, a silent-data-corruption (SDC) fault that quietly poisons gradients, or one straggler node throttling on a hot cold-plate, and the whole job stalls or restarts. This is why the fleet-reliability and autonomous-recovery machinery of Chapter 10.7 — lemon-node detection, SDC scanners, hot-spare swap, elastic re-formation — is inseparable from the training framework. The framework defines the checkpoint and restart semantics; the fleet plane detects and evicts the bad node; the orchestrator re-forms the gang. Treat them as one system, because a failure in any one of them shows up as the same symptom: idle, billing megawatts.

Reproducibility: what it costs and what it buys

Reproducibility in large-scale training is a spectrum, not a binary, and where you sit on it is a deliberate cost decision. Bit-exact reproducibility — re-run the job and get byte-identical weights — requires pinning the entire stack (framework, CUDA, cuDNN, NCCL, driver), forcing deterministic kernels (which are slower; non-deterministic reductions are faster but non-reproducible), fixing every RNG seed, and capturing the exact data order. It costs MFU and it is rarely worth it at frontier scale. Statistical reproducibility — re-run and get a model within noise on every eval — is the pragmatic target: it tolerates non-deterministic kernels but still requires pinned versions, captured seeds and data order, and recorded hyperparameters.

The consequence of skimping is concrete and recurring. A run that drifts because someone bumped the NCCL version mid-sweep produces results you cannot compare; a model you cannot trace back to its dataset snapshot is a model you cannot defend in a copyright dispute (Chapter 10.10) or reproduce after the person who trained it leaves. The minimum reproducibility contract for a serious operator is: pinned and recorded software versions, captured seeds, captured data-loader order, recorded full config, and immutable artifact lineage from dataset → checkpoint → eval. That contract is cheap to maintain if the control plane enforces it from day one and ruinously expensive to reconstruct after the fact.

Deep dive: the parallelism-layout search as an MFU optimization

Finding the layout that maximizes MFU on a given fleet is a constrained search, and it is worth treating as one rather than copying a published config. The objective is to maximize useful FLOPS per GPU-second subject to two hard constraints — per-GPU memory must hold the layer's working set plus activations, and each parallelism dimension's communication must fit the bandwidth and latency of the link it traverses. The levers interact, so they cannot be tuned independently.

Tensor parallelism degree is bounded above by the scale-up domain: TP communication is a latency-sensitive in-layer all-reduce, so TP should never exceed the NVLink domain size (8 on HGX, 72 on NVL72). Push TP onto the scale-out fabric and MFU falls off a cliff. Pipeline parallelism degree is bounded by the bubble: PP=N introduces a fill/drain bubble of roughly (N−1)/(N−1+micro-batches), so you need enough micro-batches to amortize it — and interleaved/zero-bubble schedules to shrink it. Data-parallel degree absorbs whatever GPUs remain and is bounded by the all-reduce gradient cost and by diminishing returns on the global batch size (too large a batch hurts convergence). Expert parallelism for MoE is bounded by all-to-all capacity and, like TP, rewards a wide NVLink domain. The recurring 2026 pattern: TP inside the node/NVLink domain, PP across a few nodes, DP (sharded, FSDP/ZeRO-style) across everything else, with EP and CP layered in for MoE and long-context. Memory-consumption estimators and auto-tuners now exist to search this space, but the binding intuition is unchanged — keep heavy, latency-sensitive communication inside fast links and push cheap, bandwidth-tolerant communication onto the scale-out fabric. The fabric topology this layout assumes is engineered in Chapter 8.5.

Deep dive: why RL/post-training breaks the single-orchestrator assumption

Classic pre-training is one monolithic synchronous loop, and a Kubernetes-or-Slurm gang scheduler driving a Megatron job is a clean fit. Reinforcement learning for reasoning is not that shape, and treating it as if it were is a recurring mis-scope (Chapter 1.4). An RL run alternates a rollout phase — the current policy generates long trajectories (10K–100K+ tokens each), pure autoregressive inference, embarrassingly parallel, loosely coupled — with a policy-update phase: a comparatively small synchronous gradient step over the collected experience.

That structure wants a different control plane. The rollout fleet is best driven by an inference-serving stack (the engines of Chapter 10.11) and an actor runtime that can fan out thousands of generations; the trainer is a small Megatron/FSDP job; and the two are coupled asynchronously with bounded staleness so the rollout fleet never stalls waiting on the trainer. This is precisely the workload where Ray's actor model earns its place — it expresses the rollout-generate-then-update choreography natively, where a pure gang scheduler would force you to bolt the two phases together awkwardly. The infrastructure consequence: a post-training cluster's orchestration plane looks like a disaggregated inference fleet plus a trainer, not a single SGD job, and its framework choice (a serving engine for rollouts, a training framework for the update) is genuinely hybrid.

Putting it together: the training stack as a goodput chain

The five layers of this chapter form a chain, and the realized goodput of the fleet is the product of each link, not the strongest one. The framework sets the ceiling on MFU; the parallelism layout determines how close you get to that ceiling on your specific topology; the orchestrator determines how much of the wall-clock is spent training versus queued, mis-placed, or re-forming; checkpointing determines how much of a failure you lose; and reproducibility determines whether any of it is a repeatable result. A best-in-class framework at 54% MFU running on a topology-blind orchestrator that places tensor-parallel traffic across the spine, with synchronous checkpoints stalling every step, can deliver worse goodput than a simpler stack that gets every link merely competent. The operator's job is not to maximize any single link — it is to find the weakest link and fix it, because that is the one billing megawatts for nothing.

The economic tie-back is direct. Reliability and goodput overhead runs 6–21% of TCO, and the gap between 90% and 96% goodput on a gigawatt training estate is hundreds of millions of dollars of effective compute per year. That gap is won and lost in this chapter — in the parallelism layout, the gang scheduler's topology awareness, the checkpoint tier, and the discipline of pinned, traceable runs. The same goodput-vs-availability reframing that governs facility redundancy (Chapter 12.2) governs the software stack: spend on the things that protect useful training time, not on nines the synchronous workload does not value.

The orchestration plane this chapter sits on is built in Chapter 10.1 (scheduling architecture) and Chapter 10.2 (topology-aware, rack-scale scheduling); multi-tenancy and isolation in Chapter 10.3; the node software stack (drivers, CUDA/ROCm, NCCL, firmware) that the framework binds to in Chapter 10.4; provisioning and IaC in Chapter 10.5; observability in Chapter 10.6; and the fleet-reliability, fault-tolerance and autonomous-recovery machinery that makes synchronous training survivable in Chapter 10.7. Checkpoint storage and bandwidth sizing are engineered in Chapter 9.4; the scale-out fabric and oversubscription the parallelism layout assumes in Chapter 8.5; the RL/post-training workload shape in Chapter 1.4. Inference serving — the other half of the framework world — is owned in Chapter 10.11; data provenance and lineage as a compliance artifact in Chapter 10.10; and the goodput-vs-availability economics in Chapter 12.2 and the $/GPU-hr accounting in Chapter 1.8.