Chapter 9.9

The Data-Prep Supercomputer: Offline Data Processing

Before a single GPU sees a token, a second supercomputer — CPU-bound, storage-heavy, and almost always undersized — must dedupe, filter, decontaminate, and tokenize trillions of tokens; treat data prep as an afterthought and you either starve the training fleet of clean tokens or burn frontier-GPU hours doing string processing the wrong silicon should never touch.

GOODPUTPOWER-BOUND

What you'll decide here

Whether data prep runs as a first-class, separately-sized cluster (its own CPU/storage/network profile) or is bolted onto idle GPU nodes — and what that does to your GPU goodput and your prep wall-clock.
Where on the dedup spectrum you sit: exact-match only, fuzzy near-dedup (MinHash/LSH), or embedding-based semantic dedup — and how aggressive a similarity threshold you can afford before you delete the high-quality long tail.
Whether quality and safety filtering is heuristic (cheap, CPU, interpretable) or classifier-based (expensive, often GPU, higher ceiling) — and who owns the false-positive rate that silently shapes your model.
How rigorous your decontamination is against every eval and benchmark you will report — because the leakage you fail to remove is the leaderboard number you cannot trust.
Whether tokenization and the rest of preprocessing is CPU-bound at your token volume, and therefore whether the prep cluster's bottleneck is cores and memory bandwidth, not accelerators.

There are two supercomputers in a frontier training program, and only one of them has GPUs. The first — the one this part of the guide has been about — keeps thousands of accelerators saturated with a clean, shuffled, tokenized stream. The second runs before that, often weeks before, and its job is to manufacture the stream in the first place: ingest raw web crawls and licensed corpora, strip boilerplate, deduplicate at corpus scale, filter for quality and safety, scrub out anything that overlaps your evaluations, and tokenize the survivors into the exact binary format the loader will memory-map. This is the data-prep supercomputer, and it is the most consistently undersized, under-instrumented, and under-respected machine in the building.

The reason it gets disrespected is that it looks like ETL — and ETL is something every org thinks it already knows how to do. But web-scale data prep for a frontier run is not a nightly Spark job. It is a multi-petabyte, trillion-token batch pipeline whose stages have opposite hardware appetites — some embarrassingly parallel and CPU-bound, some metadata-storm small-file workloads, some GPU-accelerated classifier sweeps, some all-to-all shuffles that look like a network benchmark. Size it as one thing and you bottleneck on whichever stage you got wrong. This chapter walks the pipeline stage by stage and ends by sizing the cluster that runs it, which has a power, network, and storage profile deliberately unlike the GPU fleet next door.

Three machines, not one: prep, loader, legal

Keep three concerns distinct or they will contaminate each other. Offline data prep (this chapter) is the one-time-per-corpus batch job that produces the tokenized dataset — measured in petabytes processed and tokens emitted, run on a CPU/storage cluster. The runtime data loader (Chapter 9.5) is the online, per-step path that streams already-prepared tokens into the GPUs without stalling them — measured in GB/s per GPU. The legal regime (Chapter 10.10) governs what you are allowed to put in the corpus — provenance, licensing, opt-outs, PII. Prep is where the legal regime is mechanically enforced (filters and exclusions) and where the loader's input is born. Conflating prep with the loader leads people to spec a fast parallel file system and call it done; conflating it with the legal regime leads to a pipeline that is compliant but technically incapable of processing the volume.

Why prep is a goodput problem, not a side quest

The strategic case for taking prep seriously is a goodput case, and it cuts two ways. First, prep is on the critical path to the run. If your corpus takes six weeks to prepare and your GPU cluster is energized and idle for four of them, you have paid frontier-cluster lease or depreciation for a month of string processing. The prep cluster's wall-clock is a direct deduction from the training program's schedule, and the GPUs cannot start until the first tokens are ready. Second, the quality of prep sets the ceiling on what the GPUs can achieve. The dominant lesson of the open-data era — FineWeb, RefinedWeb, DataComp-LM, Dolma, Nemotron-CC — is that careful filtering and deduplication beats raw scale: a smaller, cleaner corpus trains a better model than a larger, dirtier one at the same token budget. Every duplicate you fail to remove is a GPU-hour spent memorizing boilerplate; every benchmark sample you leak is a leaderboard number you have to caveat.

So the fork is rarely "do prep or skip it." It is how much engineering to invest in prep, and on what hardware. The expensive failure mode is the one nobody plans: discovering, two weeks before a launch, that the prep pipeline cannot tokenize the licensed corpus fast enough on the CPU pool you have, and either delaying the run or — far worse — borrowing GPU nodes to brute-force tokenization, which works and is also one of the most expensive ways imaginable to run a Rust regex engine.

The pipeline, stage by stage

A web-scale prep pipeline is a directed sequence of stages, each with a distinct hardware personality and a distinct fork. The canonical order — text extraction → language ID → quality/heuristic filtering → exact dedup → fuzzy dedup → classifier filtering → decontamination → tokenization → shuffle/shard — is not arbitrary: you filter cheaply before you filter expensively, and you dedup before you tokenize so you never tokenize a duplicate. The table below maps each stage to what binds it, so you can see where the prep cluster's bottleneck actually lives.

Prep pipeline stages → what binds each one

Stage	What it does	Binds on	Hardware fit	Consequence of underspeccing
Text extraction	WARC/HTML to clean text; strip boilerplate	CPU + sequential read bandwidth	CPU pool, streaming from object store	Throughput floor for the whole pipeline
Language ID + heuristic filters	fastText langID, C4-style rules, repetition/perplexity gates	CPU; cheap per-doc	CPU pool	Cheap to run; skipping it makes later stages pay
Exact dedup	Hash whole docs/lines; drop identical	Memory + hash-table I/O	CPU, large RAM, fast scratch	Leaves trivial dupes for fuzzy stage to find
Fuzzy near-dedup	MinHash signatures + LSH banding; drop near-dupes	All-to-all shuffle + memory	CPU at scale, or GPU (RAPIDS/NeMo Curator)	Either the slowest CPU stage or the GPU win
Classifier filtering	Model-scored quality/safety/domain curation	GPU inference throughput	GPU or accelerated CPU	Quality ceiling; false-positive rate set here
Decontamination	n-gram/substring overlap vs eval sets	CPU + index lookups	CPU pool, in-memory eval index	Benchmark leakage; untrustworthy evals
Tokenization	BPE/Unigram encode to token IDs	CPU cores + memory bandwidth	CPU pool (Rust tokenizers)	The CPU-bound wall; serial within a doc
Shuffle + shard	Global shuffle, pack to loader format	Network + storage write bandwidth	Fast scratch + back-end network	Loader stalls or poor mixing at train time

Order is the canonical FineWeb/Nemotron-CC-style pipeline. "Binds on" is the resource that saturates first at trillion-token scale; it is what you size that stage's hardware against.

Deduplication: the highest-leverage and most dangerous stage

Deduplication is where prep earns its keep and where it most easily destroys value. Common Crawl is roughly half to two-thirds duplicate content by the time you have a few snapshots; the web mirrors, syndicates, and templates itself relentlessly. Removing those duplicates is the single biggest lever on token efficiency, because a model wastes capacity memorizing anything it sees too many times, and over-represented duplicates skew the distribution. But dedup is also a deletion operation with no undo, and the aggressiveness knob — how similar is "the same" — is the most consequential single number in the pipeline.

There are three tiers, and they compose. Exact dedup hashes whole documents or lines and drops identical copies; it is cheap, CPU-bound, embarrassingly parallel, and catches the trivial cases. Fuzzy / near-duplicate dedup is the real work: compute a MinHash signature per document (a compact sketch of its n-gram set), then use Locality-Sensitive Hashing (LSH) to bucket documents whose signatures collide in any band, so you only compare plausibly-similar pairs instead of the quadratic all-pairs explosion. The similarity threshold (FineWeb used 5-grams at a 75% Jaccard threshold) decides what counts as a near-dupe. Semantic dedup goes further: embed every document, cluster in embedding space, and drop near-neighbors that are paraphrases or translations a MinHash would miss — at the cost of an embedding-model inference pass over the entire corpus.

The deduplication fork: exact vs fuzzy vs semantic

Tier	Method	Kills	Cost	When it is the right last stage
Exact	SHA/xxHash on doc or line; drop identical	Byte-identical copies and mirrored pages	Lowest (CPU, near-linear)	Tiny budgets; a pre-pass before fuzzy
Fuzzy	MinHash signatures + LSH banding (e.g. 5-grams, 0.75 Jaccard)	Near-dupes: templated, lightly-edited, reformatted	Moderate; dominated by the all-to-all shuffle	Default for web-scale pretraining corpora
Semantic	Embed corpus, cluster, drop near-neighbors	Paraphrases, translations, semantic restatements	Highest (an embedding inference pass over everything)	When budget is token-constrained and quality is paramount

Compose them in order. The 'kills' column is what each tier removes that the prior tier cannot. Cost is per-token relative, at trillion-token scale.

The hardware fork inside dedup is the one with the biggest 2026 swing. Fuzzy dedup at trillion-token scale is an all-to-all shuffle of billions of signatures, and on a CPU cluster it is routinely the slowest stage in the entire pipeline — days of wall-clock on a large cluster. GPU-accelerated dedup (NVIDIA's NeMo Curator on RAPIDS, and the FED-class GPU dedup frameworks) collapses that. NVIDIA reports deduplicating 1.96 trillion tokens (RedPajama-V2 scale) in roughly half an hour on 32 H100s, and a ~16x speedup on fuzzy dedup versus CPU baselines with near-linear scaling across nodes. That is the rare case where the prep cluster legitimately wants a few GPUs — not to train, but to make the dedup stage stop being the long pole. The decision: pay for a large CPU pool to grind the shuffle, or borrow a small GPU partition to do dedup in hours and free the CPUs for tokenization.

Quality and safety filtering: heuristics vs classifiers

After dedup, you decide what is worth keeping. The fork is heuristic filtering vs classifier-based curation, and most serious pipelines use both — heuristics first because they are cheap and interpretable, classifiers second because they have a higher quality ceiling.

Heuristic filters are the C4/Gopher/FineWeb lineage: drop documents by language-ID confidence, mean line length, fraction of alphabetic characters, repetition ratios, presence of boilerplate or blocklisted terms, perplexity under a reference model. They are CPU-cheap, fully auditable, and every threshold is a knob you can explain to a regulator or a model-behavior reviewer. Classifier filters score each document with a trained model — a fastText or transformer quality classifier (often trained to recognize "educational" or instruction-like text, as in FineWeb-Edu and the DataComp-LM curation), plus safety classifiers for toxicity, CSAM signals, and policy categories. These have a much higher ceiling — they capture quality signals no regex can — but they are GPU-bound to run at corpus scale, they are opaque, and their false-positive rate silently shapes the model: a quality classifier that down-weights a dialect, a domain, or a non-English register is making a model-behavior decision disguised as a data-cleaning step.

Own the false-positive rate, or it owns the model

The most under-governed decision in the whole pipeline is the operating point of the safety and quality classifiers. Set the threshold too loose and toxic or low-quality content survives into pretraining; set it too tight and you silently delete entire registers — minority dialects, technical jargon, non-English content, legitimate adult or medical material — because the classifier's training distribution did not represent them. Nobody sees this happen; it shows up months later as a capability gap or a bias the model "mysteriously" has. Make the classifier operating point an explicit, reviewed, version-pinned decision with a measured false-positive rate per category, treat it as a model-shaping hyperparameter (because it is), and keep the filtered-out samples auditable rather than discarded. The cheapest insurance is an ablation: train a small model on filtered vs unfiltered slices and measure the delta before you bet a frontier run on the threshold.

Decontamination: the eval you can trust

Decontamination is the stage that protects the meaning of every benchmark you will ever report. If your pretraining corpus contains the test items from GSM8K, MMLU, HumanEval, or whatever you benchmark on, your model can memorize the answers and your reported scores are fiction. This is not hypothetical: independent audits have found MMLU meaningfully contaminated across public corpora, and models have shown high-single-digit to low-double-digit accuracy drops when re-tested on clean mirrors of benchmarks they had leaked. The fork is how aggressively you scrub, and against which eval sets.

The standard mechanism is n-gram / substring overlap: index every eval set you care about, then drop or flag any training document that shares a long enough contiguous span (commonly an 8- to 13-gram match) with a test item. It is CPU-bound, index-lookup-heavy, and conceptually simple. The trap is that it is necessary but not sufficient: paraphrase, translation, and reformatting slip straight past a literal n-gram match, so a corpus can pass decontamination and still be semantically contaminated. The discipline that actually works is procedural, not algorithmic: maintain a frozen, versioned registry of every benchmark, decontaminate against the entire registry (not just the headline three), and — critically — re-run decontamination whenever you add a new eval, because a benchmark you adopt after training was never scrubbed from the corpus it was trained on. The cost of getting this wrong is not a bug; it is a credibility event when an outside party reproduces your eval on a clean split and your numbers collapse.

Deep dive: why n-gram decontamination quietly fails, and what to add

The n-gram overlap method has two opposite failure modes, and a serious pipeline mitigates both. False negatives (contamination it misses): any transformation that breaks the literal token sequence — paraphrasing a question, translating it, swapping multiple-choice option order, reformatting a code prompt — defeats exact substring matching while preserving the information the model can memorize. Recent work shows literal decontamination removing only a fraction of the true leakage; inference-time and embedding-based methods recover further drops of tens of percent on contaminated benchmarks. False positives (clean data it wrongly drops): on multiple-choice benchmarks, unrelated questions share option boilerplate ("A) B) C) D)", common stems), so a naive n-gram match flags and deletes legitimate training documents — a quiet quality tax.

The defensible posture layers three things. (1) Long-n-gram exact as the cheap floor, tuned long enough (13-gram is a common choice) to avoid the false-positive flood from short shared spans. (2) Embedding / semantic overlap against the eval registry to catch paraphrase and translation — the same embedding infrastructure you already stood up for semantic dedup. (3) A held-out clean mirror for the benchmarks you most care about, so you can measure the contamination delta directly rather than trusting that the scrub worked. Decontamination is the one stage where "we ran the standard script" is not a defensible answer — the standard script is the floor, not the ceiling. → quality and provenance governance in Chapter 10.10.

Tokenization and the CPU-bound preprocessing wall

Tokenization is where the prep cluster's true hardware profile reveals itself, and where the most common sizing surprise lives. Encoding text to token IDs with a BPE or Unigram tokenizer is, within a single document, an inherently serial longest-match scan: you read left to right, match the longest token, consume it, and start the next match where the last one ended — each step depends on the previous. You cannot GPU-parallelize within a document the way you can a matrix multiply. You parallelize across documents instead, which means tokenization throughput is a function of CPU core count and memory bandwidth, not accelerators. The fast production tokenizers (HuggingFace's Rust tokenizers, and the newer parallel-BPE engines) are 10–100x faster than pure-Python, encoding on the order of a gigabyte of text in tens of seconds per server — but "per server" times trillions of tokens is still a wall, and it is a CPU wall.

This is the CPU-bound preprocessing wall, and it is why the prep cluster is not a GPU cluster. At a frontier corpus of, say, 15 trillion tokens (FineWeb's scale, ~44 TB on disk as GPT-2 tokens), the tokenization pass alone — never mind extraction, filtering, and dedup — is a large CPU-core-hour bill. The decision the table below frames is whether to size a dedicated CPU pool for it, lean on a managed distributed framework (Spark, Ray, Dask, or NeMo Curator's streaming executor that overlaps CPU and GPU stages), or — the anti-pattern — burn GPU nodes whose accelerators sit idle while their host CPUs do the tokenizing. The last option works and leaves the accelerators idle while their host CPUs tokenize.

Sizing the prep cluster: a deliberately different machine

The prep supercomputer is sized against the opposite constraints from the GPU fleet, and the headline is that it is CPU-dense, RAM-heavy, storage-bandwidth-bound, and power-light per rack — the inverse of a training hall. Its racks draw a fraction of a liquid-cooled GPU rack's power, it lives happily on air cooling, and its scarce resources are CPU cores, DRAM capacity (dedup hash tables and shuffle buffers are memory-hungry), fast local scratch (NVMe for intermediate shuffle spill), and sustained sequential read bandwidth from the object/capacity tier where the raw corpus lives (→ Chapter 9.6). The network it stresses is the data-center fabric for all-to-all shuffles, not a non-blocking GPU back-end.

The two structural decisions are dedicated vs shared and co-located vs remote. A dedicated prep cluster (or a large pool of general CPU compute) decouples prep wall-clock from GPU availability — you prepare the next corpus while the current run trains — at the cost of standing up and powering a second cluster. Sharing idle GPU-node CPUs is capital-efficient but couples your prep schedule to GPU idleness and wastes accelerators. On location: prep wants to run where the data already is, because raw corpora are multi-petabyte and data gravity makes them economically immovable (cloud egress at $0.05–$0.09/GB makes moving a petabyte a five-to-six-figure line item) — so prep is the textbook case of move compute to data, the same gravity logic developed in Chapter 9.8.

Prep cluster vs GPU training fleet: the inverse profile

Axis	GPU training fleet	Data-prep cluster
Binding resource	GPU FLOPs + scale-up bandwidth	CPU cores + DRAM + read bandwidth
Accelerators	The entire point	A small partition for dedup/classifier only, if any
Rack power	120–600+ kW, liquid-cooled	Single-digit to low-tens kW, air-cooled
Network	Non-blocking back-end (IB/RoCE)	DC fabric for all-to-all shuffle; oversubscription tolerable
Storage emphasis	Hot parallel FS, checkpoint write bursts	Capacity/object read bandwidth + fast NVMe scratch
Job shape	Synchronous, restart-from-checkpoint	Batch, idempotent, restartable per-shard
Siting driver	Cheap firm power + cold climate	Where the corpus already lives (data gravity)

The point of the table is that almost every axis is opposite. Sizing the prep cluster with GPU-fleet instincts (liquid cooling, non-blocking fabric, accelerator-first) overspends on the wrong things and underspends on cores, RAM, and read bandwidth.

15T tokens

FineWeb corpus from 96 Common Crawl snapshots (~44 TB on disk as GPT-2 tokens)

2024FineWeb paper (Penedo et al.), HuggingFace

5-grams @ 0.75

FineWeb fuzzy-dedup config: MinHash over 5-grams, 75% Jaccard, deduped per-snapshot (global dedup regressed quality)

2024FineWeb paper / HuggingFace blog

~16x

GPU fuzzy-dedup speedup vs CPU baseline; ~1.96T tokens deduped in ~0.5 hr on 32 H100s

2025NVIDIA NeMo Curator

~50–67%

duplicate share of multi-snapshot Common Crawl removed by dedup before quality filtering

2024RefinedWeb / FineWeb / Dolma syntheses

8–13-gram

typical contiguous-span threshold for n-gram benchmark decontamination overlap matching

2025OpenAI/Llama/Dolma decontamination practice

~29%

estimated MMLU contamination across public web corpora; clean-mirror retests drop scores high-single to low-double digits

2025benchmark-contamination audits (arXiv)

10–100x

Rust tokenizer speedup vs pure-Python; ~1 GB text encoded in tens of seconds per server, still CPU-bound at corpus scale

2025HuggingFace tokenizers; BlockBPE/fastokens (arXiv)

$0.05–0.09/GB

cloud egress making multi-PB raw corpora economically immovable — prep is move-compute-to-data

2025cloud provider egress tariffs; SemiAnalysis

Deep dive: a worked prep budget, and where it goes wrong

Walk a frontier-scale corpus through the cluster to see where the hours actually accumulate. Start with raw input on the order of tens of petabytes of crawl plus licensed data sitting in object storage. Extraction + language ID + heuristic filtering stream that through a CPU pool; this stage is bound by sequential read bandwidth from the capacity tier and by core count, and it sets the throughput floor for everything downstream — under-provision the read path here and the whole pipeline waits on object-store bandwidth. Exact dedup is cheap. Fuzzy dedup is the danger: on CPU it is the slowest stage by far (an all-to-all shuffle of billions of MinHash signatures, days of wall-clock, memory-bound on the LSH buckets); a small GPU partition collapses it to hours. Classifier filtering wants GPUs for the inference pass over the survivors. Decontamination is CPU + a large in-memory eval index. Tokenization is the CPU wall — trillions of serial-within-doc encodes, parallelized only across documents, sized purely on core-hours and memory bandwidth. Shuffle + shard stresses scratch NVMe and the fabric.

The three recurring sizing mistakes: (1) provisioning the prep cluster like a small GPU cluster, paying for accelerators that sit idle through the CPU-bound stages; (2) under-sizing DRAM and NVMe scratch, so the fuzzy-dedup shuffle spills to slow storage and the stage that should take hours takes days; (3) under-sizing the read path from the object tier, so extraction — the throughput floor — starves. None of these show up in a small-scale test; they only bite at full corpus volume, which is exactly when the GPU fleet is energized and waiting. The fix is to model the prep cluster as its own machine with its own bottleneck analysis, not as a leftover. → capacity tier in Chapter 9.6; loader hand-off in Chapter 9.5.

Anti-patterns

The same prep failures recur, because each comes from treating prep as ETL rather than as a sized supercomputer:

Tokenizing on GPU nodes. Borrowing the training fleet to brute-force the CPU-bound tokenization wall when the dedicated CPU pool was undersized. It works, and the accelerators sit idle while their host CPUs do the encoding.
Decontaminating against the headline benchmarks only. Scrubbing GSM8K and MMLU, shipping, then adopting a new eval after training — which was never removed from the corpus. The leaderboard number is contaminated and you will not find out until someone outside reproduces it on a clean split.
Treating dedup threshold and classifier operating point as hygiene, not hyperparameters. Picking a similarity threshold and a quality-classifier cutoff because they sounded reasonable, never ablating them, and discovering a quality regression or a silent capability gap after a frontier run. Global dedup that deletes the good recurring web is the canonical case.
Sizing the prep cluster with GPU-fleet instincts. Liquid cooling, a non-blocking back-end, accelerator-first — overspending on what prep does not need while starving the cores, DRAM, scratch, and read bandwidth it does. → Chapter 9.8.

Offline prep produces what the runtime loader streams — the online data path is engineered in Chapter 9.5, and the capacity/object tier the corpus lives on is Chapter 9.6. The data-gravity and move-compute-to-data logic that sites the prep cluster is developed in Chapter 9.8; the checkpoint write path that shares the prep cluster's restartable, idempotent batch discipline is Chapter 9.4. What you are legally permitted to put in the corpus — provenance, licensing, opt-outs, PII — is the regime enforced mechanically by these filters, treated in Chapter 10.10. The economics of running this second cluster as a distinct power and capital line item connect to the on-site generation strategy in Chapter 3.5.