Chapter 9.2

Parallel & Distributed File Systems

The parallel file system is the throughput organ that keeps a $30-40k GPU saturated — and the decision that determines whether it succeeds is not peak sequential bandwidth, it is how the metadata plane survives billions of small files and how the namespace converges with object so the data path stops being a copy farm.

GOODPUTDENSITY-RAMP

What you'll decide here

Whether your primary store is a POSIX parallel file system (Lustre/EXAScaler, GPFS/Storage Scale, WekaFS, VAST) or an object-native namespace (Ceph, DAOS, JuiceFS) — and what that fork costs you in metadata performance, multiprotocol reach, and operational burden.
Centralized vs distributed metadata: whether your metadata plane is a single-server bottleneck (classic Lustre MDS) or sharded across the cluster (Weka, VAST, GPFS distributed-metadata) — because the LOSF (lots-of-small-files) problem is where most AI storage deployments actually fail, not on sequential throughput.
Erasure coding vs replication for the durable tier: the capacity-efficiency-vs-rebuild-cost-vs-CPU tradeoff, and how it interacts with all-flash QLC and the failure domain you can tolerate losing.
Build-your-own open-source (Lustre, Ceph, DAOS) vs a certified appliance (DDN, Weka, VAST, IBM) — and whether the certification against your GPU vendor's reference architecture is worth the price premium and the lock-in.
Multiprotocol single-namespace (file + object + block from one platform) vs a tiered file-over-object design — the decision that determines whether the data-prep, training, and inference paths share one namespace or pay a copy tax between three.

Chapter 9.1 established the economic argument: a parallel file system exists to keep accelerators saturated, and the moment storage cannot feed the GPUs, you are burning the most expensive depreciating asset in the building at idle. This chapter is about the engine that does the feeding — the parallel and distributed file system that fronts the NVMe flash, fans a single read out across hundreds of storage targets, and presents thousands of GPU clients with one coherent namespace. It is the layer where a wrong decision does not show up as a missing feature; it shows up as a fleet running at 50% MFU because the data loader is starving, or as a checkpoint write that stalls a 16k-GPU synchronous job because the write tier cannot absorb the incast.

The forks in this chapter are POSIX vs object-native, centralized vs distributed metadata, erasure coding vs replication, open-source build vs certified appliance, and multiprotocol convergence vs tiered file-over-object. The unifying insight, and the one most operators learn the hard way, is that AI storage is won or lost on the metadata plane, not on sequential bandwidth: every modern all-flash platform can saturate a few hundred GB/s per scalable unit. The binding constraint is the lots-of-small-files (LOSF) workload that dominates preprocessing, multimodal training, and dataset curation. The platform that strides through 5 billion 50 KB files at high IOPS is a different beast from the one that strides through a 10 GB sequential shard, and the same datasheet number describes both.

The master fork: POSIX parallel FS vs object-native primary store

Match each tier to its substrate before you buy hardware. A POSIX parallel file system (Lustre/EXAScaler, GPFS/Storage Scale, WekaFS, VAST) gives you a strict-consistency, byte-addressable namespace that every training framework, data loader, and checkpoint library expects out of the box — at the cost of running a metadata service that is the single hardest thing to scale in the system. An object-native store (Ceph RADOS, DAOS, or a file-over-object layer like JuiceFS) gives you near-infinite flat scale, cheap erasure-coded durability, and a natural fit for the cloud data lake — at the cost of weaker POSIX semantics, a metadata-engine bolt-on for the file view, and a small-file/metadata story that ranges from adequate to painful. The training tier almost always wants POSIX-parallel for performance and compatibility; the capacity and data-lake tiers almost always want object for economics and scale. The expensive mistake is forcing one workload onto the wrong substrate — running preprocessing's metadata storm against an object store, or trying to host a petabyte cold tier on an all-flash POSIX appliance. Decide which tier each personality lives on before you buy hardware. → Chapter 9.1 (the four I/O personalities); object economics in Chapter 9.6.

How a parallel file system actually feeds a GPU

The defining trick of a parallel file system is striping: a single file is split into stripes spread across many storage targets, so one client read pulls bandwidth from dozens of NVMe drives at once and aggregate throughput scales with the number of targets, not the speed of any single drive. The data path and the metadata path are architecturally separate, and that separation governs how the system scales. The data path is easy to scale: add flash, add targets, add network, and sequential bandwidth grows roughly linearly. The metadata path is hard, because every open(), stat(), readdir(), and create() hits a service that must maintain a consistent view of the namespace across thousands of concurrent clients.

This is why the LOSF workload is the real benchmark. A vision or multimodal dataset of billions of small files turns training into a metadata storm: the data loader is not asking for bandwidth, it is asking for millions of metadata operations per second. A platform sized to 250 GB/s of sequential read per scalable unit can still collapse to single-digit GPU utilization if its metadata server tops out at a few hundred thousand operations per second while the loader needs millions. The published gap is stark: in 2025 IO500 metadata measurements, Weka demonstrated on the order of 2.75 million metadata IOPS against roughly 520,000 for a comparable DDN/Lustre configuration — a ~5x advantage that is invisible on a bandwidth datasheet but decisive for an AI training pipeline (Blocks & Files / StorageMath, IO500 SC25). When you evaluate a platform, you are evaluating its metadata plane first and its bandwidth second.

The five primary-store contenders — architecture and AI fit

Platform	Lineage / model	Metadata architecture	Native protocol	Durability default	Best-fit AI tier
Lustre / DDN EXAScaler	Open-source HPC parallel FS; DDN-hardened appliance	Classic single MDS (now scalable via DNE multi-MDT); historically the bottleneck	POSIX; object/S3 via gateways	RAID / declustered RAID; replication on MDT	Large sequential training throughput; HPC heritage, ~41% PFS share
IBM Storage Scale (GPFS)	Mature distributed FS; enterprise + HPC	Distributed metadata across nodes; no single MDS	POSIX; native S3, NFS, SMB, CSI	Declustered RAID (GNR); replication options	Mixed enterprise/AI; multiprotocol breadth, ~17% PFS share
WekaFS / NeuralMesh	Software-only, NVMe-native, container-native	Fully distributed metadata across all nodes; auto-balanced	POSIX, NFS, SMB, S3, CSI from one namespace	Distributed data protection (network-distributed coding)	Metadata-heavy LOSF + low-latency mixed AI; KV/inference
VAST Data (DASE)	Disaggregated Shared-Everything; stateless CNodes + flash DNodes	Global transactional element store (ACID); no per-node metadata cache	NFS, S3, SMB, block; POSIX via DASE	Locally-decodable erasure codes on QLC + similarity reduction	All-flash capacity-at-scale; convergence of lake + training tier
Ceph / DAOS / JuiceFS	Object-native (RADOS / SCM+NVMe / file-over-object)	Ceph: MDS cluster; DAOS: distributed KV; JuiceFS: pluggable metadata engine	Object-first; POSIX via CephFS/DAOS-FUSE/JuiceFS client	Erasure coding default (e.g. EC 4+2 = 1.5x vs 3x replication)	Open-source capacity tier, sovereign/on-prem, cloud-native pipelines

2026-current, vendor-neutral. "Metadata" column describes the architectural approach, not a single benchmark. Market-share figures are HPC-survey approximations and drift; treat as directional.

The contenders differ in consequence, not rank. Lustre/EXAScaler is the HPC incumbent — roughly 41% of parallel-FS deployments in HPC surveys — and it is unmatched at raw sequential throughput, which is why DDN systems dominate the IO500 bandwidth lists. Its historical weakness is the single metadata server; modern Lustre mitigates this with Distributed Namespace (DNE) and multiple metadata targets, but you are explicitly engineering around a centralized design rather than getting distribution for free. GPFS/Storage Scale (~17% share) distributes metadata natively and brings the broadest multiprotocol and enterprise-feature set, which is why it shows up in hyperscaler-validated GDS integrations (OCI + IBM Storage Scale). WekaFS, now packaged as the NeuralMesh microservices architecture, was built software-first for exactly the metadata-heavy, low-latency mixed I/O that AI generates — it distributes both data and metadata across every node and is the platform most often cited for the LOSF win. VAST's DASE attacks a different axis: it disaggregates stateless compute (CNodes) from flash enclosures (DNodes) over NVMe-oF and puts a single global ACID element store underneath, which lets it scale capacity and compute independently and collapse the data-lake and training tiers into one all-flash namespace. Ceph, DAOS, and JuiceFS are the object-native answers — Ceph for unified on-prem block/object/file, DAOS for benchmark-leading raw HPC numbers, JuiceFS for file semantics over any S3 bucket in a cloud-native pipeline.

Centralized vs distributed metadata — where AI storage actually fails

This is the fork that separates a platform that works at scale from one that wins a bake-off and then falls over in production. Centralized metadata (the classic Lustre MDS model) routes every namespace operation through one logical service. It is simpler to reason about, gives clean POSIX consistency, and is fine for a workload of large files read sequentially — the HPC pattern Lustre was born for. Distributed metadata (Weka, VAST, GPFS, and Lustre's DNE extension) shards the namespace across many nodes so metadata throughput scales with the cluster instead of bottlenecking on one server.

The downstream cost of getting this wrong is specific and measurable. Preprocessing and dataset curation are metadata-bound by nature — they walk directories, stat millions of files, dedup, shard, and rewrite. Multimodal and vision training read billions of small image/audio/clip files. When the metadata plane saturates, the data loader cannot fill its prefetch queue, the GPUs drain their input buffers, and MFU falls off a cliff — the naive small-file path commonly runs under 50% GPU utilization versus over 90% with a well-fed loader. The fix is partly a data-format decision (pack small files into large sequential shards — WebDataset, TFRecord, MDS — covered in Chapter 9.5), but the format only papers over a metadata plane that cannot keep up. If your dominant workload is LOSF, distributed metadata is not a nice-to-have; it is the entry ticket, and you should benchmark it on your file-size distribution, not the vendor's hero sequential number.

The LOSF trap: the datasheet number you are shown is the one that does not matter

Vendors lead with peak sequential bandwidth because it is the largest, cleanest number they have. But an AI training pipeline's binding constraint is almost always the metadata operations per second and small random read IOPS under its real file-size distribution — and those can differ by 5x or more across platforms that quote identical sequential throughput. Insist on MLPerf Storage results or an IO500-style metadata score (mdtest: create/stat/delete) on a file-size mix that matches your corpus. A platform that posts a stunning GB/s and an unremarkable metadata number is telling you it was tuned for the benchmark, not for your data-prep storm. The cost of discovering this in production is a fleet running at half its MFU on data it is paying full price to host. → benchmarking discipline in Chapter 9.8.

~41% / ~17% / ~6%

parallel-FS deployment share: Lustre / IBM Storage Scale (GPFS) / WekaFS in HPC surveys

2025WWT High-Performance File Systems for AI/ML; HPC survey synthesis

~2.75M vs ~520k

metadata IOPS: Weka vs a comparable DDN/Lustre config in IO500 measurement — the LOSF gap behind the same bandwidth

2025Blocks & Files / StorageMath, IO500 SC25

250 / 124 GB/s

per scalable unit read / write — DGX B300 SuperPOD 'Enhanced' storage tier (write ≥ ½ read rule)

2025NVIDIA DGX SuperPOD B300 Storage Architecture

720 GB/s + 18M IOPS

WEKApod feeding 768 H100s from 8 nodes at sub-150 µs latency — distributed-metadata reference point

2025Introl AI-Optimized Storage; WEKA

1.5x vs 3x

storage overhead: erasure coding (EC 4+2) vs triple replication — capacity efficiency at the cost of CPU/RAM and rebuild

2025Red Hat Ceph Storage Strategies Guide

<50% → >90%

GPU utilization: naive small-file path vs packed sequential shards + fed loader (the LOSF penalty)

2025Introl; data-loader best-practice synthesis

~10 TB/s

single-namespace aggregate at the top of the all-flash range (Pure FlashBlade//EXA, 1k-10k GPU clusters)

2025Introl; vendor reference architectures

~$36B / ~24%

AI storage market size and CAGR — the consolidation pressure behind certified-appliance lock-in

2025domain-research synthesis (market trackers)

The LOSF problem and multiprotocol access

The 2026 architectural trend that resolves much of this tension is multiprotocol single-namespace convergence. Historically the data-prep pipeline lived on object (S3/HDFS), the training tier lived on a POSIX parallel FS, and inference pulled from somewhere else again — which meant data was copied between namespaces at every stage, paying a tax in time, capacity, and operational complexity. Weka and VAST now present file (POSIX/NFS/SMB), object (S3), and Kubernetes (CSI) views over a single namespace; GPFS has long offered native S3 alongside POSIX; and high-throughput object-over-flash (S3 Express-class) is fast enough to serve some training reads directly. The consequence is that the data path can stop being a copy farm: ingest as object, prep in place, train over POSIX against the same bytes, and serve inference from the same namespace.

This convergence is enabled by the second 2026 trend — all-flash everywhere. QLC NVMe plus data reduction (dedup, compression, and VAST's similarity reduction) has collapsed the cost gap that historically forced the capacity tier onto HDD. When the capacity tier is also flash, the economic argument for keeping the lake and the training tier on separate media weakens, and a single converged platform becomes viable. The fork is now genuine: pay for one multiprotocol all-flash platform that converges the tiers, or run a cheaper tiered file-over-object design and accept the copy tax and the operational seam between systems. → object economics and lifecycle tiering in Chapter 9.6; the data-loader path in Chapter 9.5.

Erasure coding vs replication

How the durable tier protects data is a three-way tradeoff between capacity efficiency, rebuild cost, and CPU/RAM overhead. Replication keeps N full copies (Ceph defaults to 3x) — simple, fast to recover, trivial on CPU, but it triples your raw capacity for a given usable petabyte. Erasure coding splits each object into k data and m parity chunks (e.g. EC 4+2 stores 1.5x the data for two-failure tolerance instead of 3x), which is dramatically more capacity-efficient but spends more RAM and CPU on every write and, critically, on every recovery — a rebuild must read across many drives and recompute, so wide EC stripes lengthen rebuild windows and raise the load on a degraded system.

For AI, the decision interacts with the all-flash and failure-domain story. On a hot training tier where you want minimum rebuild impact and maximum throughput under failure, replication or a narrow declustered-RAID scheme is often the right call. On a capacity or object tier holding petabytes of training data and retained checkpoints, the 2x capacity savings of erasure coding is too large to ignore, and the platforms that win here (VAST's locally-decodable codes, Ceph EC pools) are specifically engineered to cap the rebuild penalty. The wrong default — wide erasure coding on a latency-sensitive hot tier, or 3x replication on a 50 PB cold lake — either starves your GPUs during a rebuild or doubles your capacity bill for no resilience the workload values.

Deep dive: why VAST's DASE and the disaggregated model matter for the density ramp

The classic parallel-FS scaling problem is that compute and capacity are bolted together: every storage server owns its drives, so to add capacity you add servers (and their CPUs and their metadata responsibilities), and to add throughput you also add servers. This is fine until the ratios you need diverge — a checkpoint-heavy run wants write bandwidth without more capacity; a cold data lake wants capacity without more compute. VAST's Disaggregated Shared-Everything (DASE) architecture breaks the coupling: stateless front-end CNodes run the storage logic, capacity lives in flash DNodes reached over NVMe-over-Fabrics, and every CNode can see every DNode's media — including all system metadata — through a single global transactional (ACID) element store with no per-node metadata cache to become a bottleneck.

The consequence for an AI build is that you scale the two axes independently: add CNodes when you need more throughput or metadata operations, add DNodes when you need more petabytes, and never pay for one to get the other. Combined with locally-decodable erasure codes and similarity-based data reduction on QLC, this is what makes an all-flash capacity tier economically credible and lets the lake and the training tier converge into one namespace. Weka's NeuralMesh reaches a similar end by a different route — a software-only microservices mesh that distributes data and metadata across all nodes and grows more efficient as it scales. Both designs are reactions to the same pressure: the GB200-to-Rubin density ramp means the GPU side of the ratio keeps getting hungrier, and a storage architecture that cannot scale throughput independently of capacity will either strand flash or starve accelerators. → the storage:compute ratio math in Chapter 9.8; the CPU-bypass data path in Chapter 9.3.

Build vs buy: open-source vs certified appliance

The last fork is commercial, and it is where the parallel-FS market's ~24% CAGR and ongoing consolidation bite. Open-source build — running Lustre, Ceph, or DAOS on your own hardware — gives you the lowest licensing cost, full control, and no per-terabyte vendor tax, at the price of owning the operational burden: tuning the metadata servers, managing rebuilds, chasing kernel and client compatibility, and being your own support desk when a 16k-GPU job hangs on a storage stall at 3 a.m. DAOS posts spectacular IO500 numbers (Argonne and LRZ systems led the SC25 production list) but is candidly a niche HPC system that lacks much of the enterprise feature set and support an AI cloud operator expects. Certified appliances (DDN, Weka, VAST, IBM) cost more and carry lock-in, but they ship certified against the GPU vendor's reference architecture — validated per-scalable-unit bandwidth tiers, supported GPUDirect Storage paths, and a vendor on the hook when something breaks.

Certification is what you are actually paying for. NVIDIA's DGX SuperPOD reference architectures specify exact storage bandwidth tiers per scalable unit (the B300 'Enhanced' tier of 250 GB/s read / 124 GB/s write per SU), and a certified platform is one that has been measured to hit them with a supported GDS data path. Building your own means you also own proving you meet those targets — and the cost of being wrong is a fleet that under-feeds. The honest decision rule: at frontier scale with a long-lived, well-staffed storage team, open-source can be the right economic call; for most operators buying a certified appliance and treating the premium as insurance against GPU-idle risk is rational. Either way, the question to answer before signing is which specific reference architecture and which GDS path you are certifying against — vague "AI-ready" claims are not certification. → GPUDirect Storage and the data path in Chapter 9.3.

Open-source build vs certified appliance — the commercial fork

Dimension	Open-source build (Lustre/Ceph/DAOS)	Certified appliance (DDN/Weka/VAST/IBM)
Up-front cost	Lowest — hardware + your labor; no per-TB license	Highest — appliance premium + support contract
Operational burden	Yours: tuning, rebuilds, client/kernel compat, on-call	Vendor-shared: supported stack, break/fix SLA
GPU-vendor certification	You prove SU bandwidth + GDS path yourself	Pre-validated against DGX/HGX reference SU tiers
Lock-in / portability	Maximal portability; commodity hardware	Vendor lock-in; possible single-DPU (BlueField) dependence
GPU-idle risk	Higher if storage team is thin — stalls are yours	Lower — vendor on the hook for feed targets
Best fit	Frontier scale, deep storage team, cost-driven	Most operators; insurance against idle-GPU TCO

Directional 2026 practitioner framing, not absolute. The right answer is a function of team depth, scale, and tolerance for GPU-idle risk.

Anti-patterns

The recurring mis-scopes in parallel-FS selection all come from optimizing the visible number instead of the binding constraint:

Buying on sequential bandwidth for a metadata-bound workload. The classic error: a platform with a stunning GB/s number and a mediocre metadata score, bought for a preprocessing or multimodal pipeline that is LOSF-dominated. The fleet runs at half its MFU on data it pays full price to host. Benchmark metadata on your file-size distribution first.
One namespace for every personality. Forcing the metadata storm of data-prep, the sequential throughput of training, and the latency of KV-cache onto a single tier because convergence sounded cheaper. The four I/O personalities pull in opposite directions; converge where it pays (file+object on the training/lake tier) but do not pretend one tier serves all four well. → Chapter 9.1.
Wide erasure coding on the hot tier. Capacity-efficient until a drive fails mid-run, when the rebuild read-amplification starves the very GPUs the tier exists to feed. Keep wide EC for the cold/object tier; use replication or narrow declustered RAID where rebuild impact must stay bounded.
Uncertified "AI-ready" storage behind a non-blocking fabric. Pairing a platform that has never been measured against your GPU vendor's SU bandwidth tiers with a multi-million-dollar GPU cluster, then discovering at scale that it cannot sustain the GDS path. Certification against a specific reference architecture is the cheap insurance.

This chapter is the file-system layer of Part 9's storage stack. The economic argument and the four I/O personalities that justify it are in Chapter 9.1; the NVMe media tiers, GPUDirect Storage, and the CPU-bypass data path that these file systems ride on are in Chapter 9.3; the checkpoint write workload that stresses the write tier is engineered in Chapter 9.4; the data-loader and small-file packing that determine whether the metadata plane is even exercised are in Chapter 9.5; the object and capacity tier these platforms front (or converge with) is in Chapter 9.6; the KV-cache and inference memory hierarchy is in Chapter 9.7; and the storage:compute ratios, bandwidth budgets, and MLPerf-Storage acceptance benchmarking that size and validate the choice are in Chapter 9.8. The fabric-placement decision for the storage network — dedicated rail vs converged onto the back-end — sits in Chapter 8.5; and the workload archetype that sets which I/O personality dominates is the master variable from Chapter 1.1.