Chapter 9.5

Data Ingestion, Preprocessing & the Data-Loader Path

The data-loader path is the one storage subsystem that sits in the critical loop of every training step — get its format, sharding, and CPU budget wrong and you do not have a slow pipeline, you have idle GPUs burning depreciation while they wait to be fed.

GOODPUTPOWER-BOUND

What you'll decide here

Which on-disk format you commit the corpus to — sequential shards (WebDataset/MDS) vs columnar (Parquet) vs raw files — because that choice is written into petabytes and is expensive to re-encode mid-program.
Whether you stream samples directly from object storage or stage them onto a node-local/parallel-FS cache first — the fork that sets your egress bill, your startup latency, and your tolerance for object-store tail latency.
How many CPU cores and host-DRAM gigabytes per GPU you provision for decode, augmentation, and prefetch — the GPU:CPU ratio that decides whether the loader keeps a 132 kW rack fed or starves it.
Whether preprocessing runs on the CPU host, is offloaded to the GPU (DALI), or is baked once into a pre-decoded format (FFCV/MDS) — and what that does to per-epoch repeatability and storage footprint.
How you make shuffling and resumption deterministic across an elastic GPU count, so a restart after a failure resumes mid-epoch in seconds rather than re-reading from the top of the corpus.

Every chapter in Part 9 so far has been about storage that sits beside the training loop — the parallel file system that holds the corpus (Chapter 9.2), the NVMe tier and GPUDirect path that move bytes fast (Chapter 9.3), the checkpoint stream that protects against failure (Chapter 9.4). This chapter is about the storage path that sits inside the loop. The data loader is the only I/O subsystem whose latency is on the critical path of every single optimizer step: the GPU cannot start step N until the batch for step N is decoded, augmented, collated, and resident in device memory. If that pipeline cannot sustain the consumption rate, the accelerator stalls and burns depreciation while it waits.

The consequences here are unusually direct: the loader path converts almost every wrong choice into GPU idle time, and GPU idle time converts into wasted megawatts and stranded depreciation. We trace the runtime pipeline stage by stage through the format fork (sequential vs columnar vs raw), the placement fork (stream vs stage vs cache), and the offload fork (CPU vs GPU vs pre-decoded), and close on the two things people get wrong most often: under-provisioning CPU and DRAM relative to GPUs, and treating multimodal pipelines as if they were the text pipeline with pictures bolted on. The offline corpus-construction work — dedup, filtering, tokenization at corpus scale — is a different machine entirely and lives in Chapter 9.9; the legal regime that governs what may enter the corpus is Chapter 10.10.

The runtime pipeline: what happens between storage and the GPU

A training data loader is a producer-consumer pipeline with the GPU as the consumer. Reading left to right, every sample traverses six stages on its way to becoming part of a batch: (1) fetch a shard or object from the storage tier; (2) decode the container and the payload (JPEG/WebP, audio, video frames, or token records); (3) transform/augment (resize, crop, normalize, tokenize, pad); (4) shuffle across a buffer so consecutive batches are decorrelated; (5) collate samples into a tensor batch; and (6) transfer the batch host-to-device over PCIe/NVLink. Stages 1–5 run on the CPU host (unless explicitly offloaded); stage 6 crosses the bus into HBM.

The governing equation is a throughput inequality, not a clever algorithm. Let the GPUs consume batches at rate C (samples/s) and the loader produce them at rate P. If P >= C, and you have enough prefetch buffering to absorb jitter, the GPU never waits and the loader is invisible. If P < C, the GPU stalls for (1 - P/C) of every step, and no amount of fabric bandwidth or HBM capacity recovers it — you are bottlenecked on the slowest stage of stages 1–6. The entire engineering of the loader path is the project of keeping P >= C as C climbs with each GPU generation, while the corpus grows into the petabytes and the per-sample decode cost (especially for images and video) refuses to fall.

This is why the industry's canonical storage benchmark is framed not as bandwidth but as accelerator utilization (AU): MLPerf Storage defines a passing result as keeping every simulated GPU at 90% utilization or higher (70% for the looser Cosmoflow workload), and reports the maximum accelerator count a storage system can feed while holding that line (MLCommons, MLPerf Storage v2.0, Aug 2025). The benchmark exists precisely because storage that cannot hold the line starves the most expensive asset in the data center.

The loader is a goodput problem, not a storage line item

It is tempting to file the data loader under storage and hand it to the storage team. That mis-frames it. The loader's output metric is not IOPS or GB/s — it is GPU goodput, the fraction of wall-clock time the accelerators spend doing useful math. A loader that delivers 8 GB/s but stalls the GPU 15% of the time is worse, in the only currency that matters, than one that delivers 6 GB/s and stalls it 0%. Treat the loader as a co-design between storage, CPU host, and the training framework — owned by whoever owns goodput — not as a storage line item. The cross-reference to keep in view is Chapter 9.1, which makes the general case that storage determines GPU efficiency; this chapter is that thesis applied to the one path that touches every step.

The format fork: sequential shards vs columnar vs raw files

The first irreversible-ish decision is the on-disk format the corpus is encoded into, because it is written across petabytes and re-encoding a multi-PB corpus is a multi-day, multi-thousand-core job in its own right (→ Chapter 9.9). Three families dominate, and they trade the same three things against each other: sequential read efficiency, random-access / column-projection ability, and metadata/small-file pressure on the storage tier.

Sequential shards — WebDataset (POSIX tar) and MosaicML MDS — are the LLM and large-multimodal default. The corpus is packed into shards of typically 50 MB–1 GB (MDS defaults to ~67 MB; 50–100 MB shards work well across modalities per Databricks/Mosaic guidance), each holding thousands of samples in record order. The win is that the loader issues large sequential reads against the storage tier instead of millions of tiny random reads — which is the difference between a parallel file system running near its streaming bandwidth and the same file system collapsing under metadata-server load from a small-file storm (the small-file problem from Chapter 9.2 is exactly what sharding exists to defeat). The cost: random access within a shard is awkward, and you cannot cheaply read one column of a multi-field record without decoding the whole sample.

Columnar — Apache Parquet — is the data-lake-native and tabular/structured default. Parquet stores data by column with embedded compression and statistics, so a loader can project just the fields it needs and push down filters, and it integrates natively with the lakehouse engines that produced the data (→ Chapter 9.6). For text/token corpora and structured multimodal metadata this is excellent; for raw image/video bytes it is less natural, and naive row-group sizing can reintroduce small-read patterns. Parquet is where the corpus lives during curation; many shops convert hot training splits from Parquet into WebDataset/MDS shards for the run itself.

Raw files (one image/clip per object) is the path of least resistance and the most common scaling mistake. A directory of 2 billion JPEGs is trivial to produce and catastrophic to train from: it is 2 billion metadata operations and 2 billion tiny random reads, which saturates the file system's metadata plane long before its bandwidth, and turns object-store request-rate limits into the binding constraint. Raw files are fine for small datasets and prototyping; at corpus scale they are the canonical way to convert a fast storage tier into a slow one.

Training-data format fork → consequences

Format	Read pattern	Random access	Metadata pressure	Best fit	Main downside
WebDataset (tar shards)	Large sequential reads of 50 MB–1 GB shards	Weak (scan within shard)	Low — millions of samples, thousands of shards	LLM & large multimodal pre-training on POSIX/parallel FS or object	No column projection; re-sharding the corpus is costly
MosaicML MDS	Sequential, ~67 MB default shards; random sample seek supported	Moderate — indexed sample offsets	Low	Streaming from cloud object with deterministic, mid-epoch-resumable shuffle	Framework-coupled tooling; another encode pass
Apache Parquet (columnar)	Row-group reads; column projection + predicate pushdown	Good for fields; poor for raw blobs	Low–moderate (row-group dependent)	Text/token corpora, structured + tabular, lakehouse-native curation	Awkward for raw image/video bytes; row-group tuning matters
Raw files (1 object/sample)	Millions of tiny random reads	Native (one file = one sample)	Severe — metadata storm, request-rate limited	Small datasets, prototyping, ad-hoc inspection	Collapses storage metadata plane at corpus scale

Shard-size and default figures are 2025 practitioner/vendor guidance (Databricks Mosaic Streaming docs; WebDataset/MLCommons references). "Random access" means cheap access to an individual sample without scanning its shard.

Sharding and shuffling: decorrelation without a small-file storm

Sharding solves the read-pattern problem but creates a statistics problem: if you read shards in order and emit samples in shard order, every batch is drawn from a narrow slice of the corpus, gradients are correlated, and training quality suffers. The fix is to shuffle — but you cannot hold a 50 TB corpus in memory to do a global shuffle, so the standard technique is a two-level shuffle: shuffle the order of shards globally (cheap — there are only thousands of them), then maintain a shuffle buffer of K samples in host DRAM from which each batch is drawn at random as new samples stream in. A buffer of a few thousand to tens of thousands of samples approximates a global shuffle well enough for almost all workloads; the buffer size is a memory-vs-decorrelation knob, not a correctness one.

The decision that quietly costs the most is determinism under elastic scale. A 10,000-GPU run will be interrupted — Meta's Llama 3 405B saw an unplanned interruption roughly every three hours over a 54-day window on 16,384 H100s (Meta, Llama 3 paper, 2024). When it restarts, you want it to resume mid-epoch, in seconds, having already consumed the samples it processed and not re-reading from the top — and you want the sample order to be identical regardless of whether the restart happens on 16,384 GPUs or 12,288. Streaming loaders built for this (MosaicML StreamingDataset is the canonical example) make the shuffle elastically deterministic: sample order is a function of a seed and a global step, independent of the device count, so resumption is exact and a loss spike is reproducible for debugging. A loader that lacks this turns every failure into either a full-epoch replay (egress and idle-GPU cost) or a silently non-deterministic data order (irreproducible training). The checkpoint/restart math this rides on is Chapter 9.4.

Data loaders and the CPU bottleneck: the GPU:CPU:DRAM ratio

Here is the failure mode that strands the most compute, because it is invisible until the GPUs are installed. Stages 2–5 of the pipeline — decode, augment, shuffle, collate — run on the CPU host by default, and they do not get faster when you buy faster GPUs. As accelerator throughput climbs generation over generation (H100 to B200 to Rubin), the consumption rate C climbs with it, but the host CPU's ability to produce decoded, augmented batches P stays flat unless you provision for it. The result is a pipeline that was perfectly balanced on the previous GPU generation and is CPU-starved on the next. NVIDIA, AWS, and the FFCV authors all describe the same root cause: "these pipelines — currently executed on the CPU — have become a bottleneck, limiting the performance and scalability of training."

The lever is the GPU:CPU-core and GPU:host-DRAM ratio, decided when you spec the server, and it is workload-dependent in a way that catches people out. A pure-text LLM pipeline is cheap per sample — tokens are small, decode is light, augmentation is trivial — so a modest core count per GPU and a few dozen prefetch workers keep P ahead of C. An image or, far worse, a video pipeline is the opposite: JPEG/WebP decode and especially video frame decode are heavy, and the per-sample CPU cost can be 10–100x the text case. The same chassis that comfortably feeds eight GPUs on text will starve them on video. Under-provision cores and you cap goodput at whatever the CPU can produce; over-provision and you pay for idle cores — but idle cores are cheap insurance next to idle GPUs.

Host DRAM is the second half of the ratio, and it serves three simultaneous masters: the shuffle buffer, the prefetch queue (you want several batches staged ahead so a slow read never reaches the GPU), and — if you stage rather than stream — the node-local dataset cache. Under-size DRAM and the shuffle buffer shrinks (worse decorrelation), the prefetch depth shrinks (jitter reaches the GPU), or the cache thrashes back to the storage tier. This is why AI training nodes carry host-DRAM-to-HBM ratios that look extravagant by HPC standards — the DRAM is not for the model, it is for the loader.

The offload fork: CPU vs GPU (DALI) vs pre-decoded (FFCV/MDS)

When the CPU cannot keep up, there are exactly three escapes, and they trade CPU cost, GPU cost, and storage footprint against each other.

Scale out the CPU. Add cores, add prefetch workers, add nodes — the brute-force path. It works until you hit the chassis core ceiling or the point where adding CPU sockets to feed GPUs is worse economics than the alternatives. It keeps the data path simple and the format unchanged, which is its real virtue.

Offload decode/augment to the GPU (NVIDIA DALI). DALI moves JPEG/video decode and augmentation onto the GPU (using hardware decode engines like nvJPEG/NVDEC) and runs its own execution engine to overlap I/O with compute, "offloading data preprocessing to the GPU to eliminate the CPU bottleneck." This is the dominant answer for image and video pipelines where decode is the wall. The tradeoff is honest: you are spending GPU cycles and a slice of HBM bandwidth on preprocessing instead of on the model — justified precisely when the CPU alternative would otherwise leave the GPU idle, and a poor trade when the GPU is already compute-bound.

Pre-decode once into a training-optimized format (FFCV, or baking transforms into MDS). FFCV "combines efficient file formats with asynchronous transfers to maximize GPU utilization" — the idea is to pay the decode and resize cost once, offline, and store samples in a layout the loader can stream with near-zero per-sample CPU. This collapses the runtime CPU cost to almost nothing and is the fastest path for fixed-preprocessing workloads (the classic result is large epoch-time reductions on ImageNet-class training). The cost is paid in two coins: a larger or differently-shaped storage footprint (decoded data is bigger than compressed), and a loss of flexibility — augmentations baked at encode time cannot vary per epoch, so heavy randomized augmentation either stays on the runtime path or must be designed into the format.

Preprocessing-offload fork → where the cost lands

Strategy	Runtime CPU cost	GPU cost	Storage footprint	Per-epoch augmentation flexibility	Best fit
Scale out CPU + workers	High (the whole pipeline)	None added	Unchanged	Full — anything the CPU can compute	Text/light pipelines; simplest data path
GPU offload (DALI)	Low — fetch only	Adds decode/augment to GPU	Unchanged (compressed)	Full — runs on GPU each epoch	Image/video where decode is the wall and GPU has headroom
Pre-decoded (FFCV / baked MDS)	Near zero	None added	Larger (decoded) or re-encoded	Limited — fixed transforms baked at encode time	Fixed-preprocessing, decode-bound, throughput-critical runs

Speedups are workload-dependent; image/video pipelines see the largest gains from GPU offload and pre-decoding, text pipelines the least. Sources: NVIDIA DALI technical blogs; FFCV (CVPR 2023); 2025 practitioner reports.

Streaming vs staging vs caching tiers

Independent of format and offload is the placement question: where do the bytes live relative to the GPU when the run starts? There are three postures, and the fork sets your startup latency, your egress bill, and your exposure to storage tail latency.

Stream from object storage. The loader pulls shards directly from S3/GCS/on-prem object as it needs them, holding only the shuffle and prefetch buffers locally (this is what MosaicML StreamingDataset and WebDataset-over-S3 are built for; AWS publishes loader best practices for exactly this — parallel connections, sharding, and prefetch to keep GPUs fed). The win is no staging step and instant start; the data can be larger than any local disk. The exposure is object-store tail latency and request-rate limits — a single slow GET, if it reaches the GPU, is a stall — and recurring egress cost if compute and storage sit in different zones (the data-gravity economics of Chapter 9.8). Deep prefetch and many parallel connections are what hide the tail.

Stage to a node-local NVMe or parallel-FS cache, then train. Copy the working set onto fast local storage up front and read from there for the run. This eliminates per-step object-store dependence and tail latency, and it amortizes egress over many epochs — the right call for a corpus read many times (Meta's Research SuperCluster famously fronted training with a multi-tens-of-PB flash cache; Introl, 2025). The cost is the staging delay before the first step and the local capacity ceiling: if the working set exceeds local NVMe, you cannot stage it whole and must fall back to streaming or a hybrid.

Cache hot, stream cold (the hybrid that wins in practice). Most real pipelines stream from object but interpose a node-local or rack-local cache that holds recently-used shards, so the first epoch pays object-store cost and subsequent epochs are served from flash. This is the placement analogue of the storage hierarchy that runs through all of Part 9, and it is why the parallel-FS and NVMe tiers of Chapter 9.2 and Chapter 9.3 exist between object storage and the GPU.

>=90%

accelerator-utilization threshold for a passing MLPerf Storage result (70% for Cosmoflow); the loader's whole job is holding this line

Aug 2025MLCommons, MLPerf Storage v2.0

200+ / 26

performance results from 26 organizations in MLPerf Storage v2.0; systems now feed ~2x the accelerators of v1.0

Aug 2025MLCommons; StorageNewsletter

~1.08 TB/s

peak single-submission throughput at 92.2% GPU utilization and 9.9M IO/s (AWS) in MLPerf Storage v2.0

Aug 2025StorageNewsletter / MLCommons

~67 MB

default MDS shard size; 50–100 MB shards work well across modalities (large sequential reads, low metadata load)

2025Databricks Mosaic Streaming docs

~1 / 3 hr

unplanned interruptions on Llama 3 405B (16,384 H100s, 54 days) — why mid-epoch deterministic resumption matters

2024Meta, Llama 3 paper

10–100x

per-sample CPU decode cost of image/video vs text — the reason the GPU:CPU ratio must be set per modality

2025NVIDIA DALI / FFCV synthesis

~46 PB

flash cache fronting training at Meta's Research SuperCluster — the stage-and-cache posture at hyperscale

2025Introl, petabyte-scale AI pipelines

Multimodal pipelines: where the loader stops being uniform

A text pipeline is close to uniform: every sample is a token sequence of bounded size, decode is trivial, and the batch is a clean rectangle. A multimodal pipeline — interleaved image-text, audio, or video — breaks all three assumptions at once, and treating it as "the text pipeline plus pictures" is the recurring mis-design.

First, per-sample cost variance explodes. A caption is bytes; the image it describes is a megabyte of JPEG that must be decoded and resized; the video clip is hundreds of frames through a hardware decoder. Within one batch, samples differ in decode cost by orders of magnitude, so the slowest sample in a batch sets the batch latency — and a few heavy video samples can stall an otherwise-fed pipeline. This is the strongest case for GPU-offloaded decode (DALI's hardware NVDEC path) and for length-aware or cost-aware batching that groups similar-cost samples so the straggler problem is bounded.

Second, modalities arrive at different rates and sizes, so the shard often packs heterogeneous records and the collate step must pad and mask across variable-length sequences and variable-resolution tensors — work that is itself CPU-heavy and a frequent hidden bottleneck. Third, alignment and sampling logic (which frames of a video pair with which text span, how to sample within long clips) lives in the loader and must be deterministic for resumption to work. The net engineering consequence: multimodal runs need a higher CPU:GPU ratio, more aggressive GPU offload, and cost-aware batching — and they expose loader weaknesses that a text-only run would never surface. This is also why the offline prep for multimodal corpora (decoding, frame extraction, re-encoding to uniform shards) is so much heavier than for text, and why that work belongs in the offline supercomputer of Chapter 9.9 rather than the runtime path.

Deep dive: how to actually diagnose a starved GPU (and why people blame the wrong subsystem)

The symptom is always the same — GPU utilization sawtooths or sits below target, MFU is disappointing — and the instinct is to blame the fabric or the storage bandwidth. Usually it is the loader, and the diagnosis is a short decision tree. Step 1: is the GPU actually waiting on data? Profile the step. If the GPU shows idle gaps that line up with batch boundaries (a stall at the start of each step that disappears mid-step), data starvation is confirmed; if the GPU is busy but slow, the problem is compute or collectives, not the loader, and you are in Chapter 8.5 territory instead.

Step 2: which stage is the bottleneck? Watch CPU utilization on the data-loader workers. If the CPUs are pinned at 100% while the storage tier is loafing, you are decode/augment-bound — the fix is more cores, GPU offload (DALI), or pre-decoding (FFCV). If the CPUs are idle while the GPU starves, you are I/O-bound — the fix is bigger sequential reads (re-shard), deeper prefetch, more parallel connections to object storage, or a local cache. If neither is saturated but the GPU still stalls, you are jitter-bound: a tail-latency read or a single heavy sample is reaching the GPU because the prefetch buffer is too shallow — deepen it. Step 3: confirm with a synthetic loader. Replace the real dataset with a loader that yields pre-made random tensors at infinite rate; if GPU utilization jumps to target, the loader was the bottleneck, full stop. This three-step routine resolves the overwhelming majority of "the GPUs aren't busy" tickets without touching the fabric or buying more storage — and it is the empirical backstop behind the whole goodput thesis of Chapter 9.1.

Data versioning, lineage, and governance

The loader path also carries an obligation that has moved from nice-to-have to mandatory: knowing, and being able to prove, exactly which data went into a given model. Three capabilities make that possible, and they are increasingly built into the format and loader rather than bolted on after.

Versioning pins a corpus to an immutable, content-addressed snapshot so that "model X was trained on dataset version Y" is a precise, reproducible statement — and so that re-running training reads byte-identical inputs. Lineage records the transformation graph from raw source through dedup, filtering, and tokenization (→ Chapter 9.9) to the shards the loader consumed, so any sample in the corpus can be traced back to its origin. Governance enforces what may enter the corpus and proves what did: license and consent status, PII handling, opt-outs, and regional restrictions. None of this is optional at frontier scale, because the legal regime now treats training-data provenance as a first-class liability surface — copyright, privacy, and data-residency exposure all turn on being able to answer "what was in the corpus?" with evidence. That regime, and the compliance architecture it demands, is the subject of Chapter 10.10; the loader's job is to make the answer cheap to produce by carrying version and lineage metadata in the shards it streams.

The loader sits on top of the whole Part 9 stack: the goodput thesis it serves is Chapter 9.1; the parallel/distributed file systems it reads from (and the small-file problem sharding defeats) are Chapter 9.2; the NVMe tier and GPUDirect path that move bytes are Chapter 9.3; the checkpoint/restart math behind deterministic resumption is Chapter 9.4; the object-storage capacity tier the corpus streams from is Chapter 9.6; and the storage:compute ratios, egress economics, and data gravity that the streaming-vs-staging fork turns on are Chapter 9.8. The offline corpus-construction machine — dedup, filtering, decontamination, tokenization at corpus scale — is a distinct workload in Chapter 9.9; the legal regime governing what may enter the corpus is Chapter 10.10; and when the loader is not the bottleneck and the GPUs still stall, the fabric and collectives in Chapter 8.5 are where you look next.