The Definitive Guide toAI Data Centers
Ask the Guide
Guide Storage & Data9.9

Chapter 9.9

The Data-Prep Supercomputer: Offline Data Processing

Before a single GPU sees a token, a second supercomputer — CPU-bound, storage-heavy, and almost always undersized — must dedupe, filter, decontaminate, and tokenize trillions of tokens; treat data prep as an afterthought and you either starve the training fleet of clean tokens or burn frontier-GPU hours doing string processing the wrong silicon should never touch.

GOODPUTPOWER-BOUND

What you'll decide here

  1. Whether data prep runs as a first-class, separately-sized cluster (its own CPU/storage/network profile) or is bolted onto idle GPU nodes — and what that does to your GPU goodput and your prep wall-clock.
  2. Where on the dedup spectrum you sit: exact-match only, fuzzy near-dedup (MinHash/LSH), or embedding-based semantic dedup — and how aggressive a similarity threshold you can afford before you delete the high-quality long tail.
  3. Whether quality and safety filtering is heuristic (cheap, CPU, interpretable) or classifier-based (expensive, often GPU, higher ceiling) — and who owns the false-positive rate that silently shapes your model.
  4. How rigorous your decontamination is against every eval and benchmark you will report — because the leakage you fail to remove is the leaderboard number you cannot trust.
  5. Whether tokenization and the rest of preprocessing is CPU-bound at your token volume, and therefore whether the prep cluster's bottleneck is cores and memory bandwidth, not accelerators.

There are two supercomputers in a frontier training program, and only one of them has GPUs. The first — the one this part of the guide has been about — keeps thousands of accelerators saturated with a clean, shuffled, tokenized stream. The second runs before that, often weeks before, and its job is to manufacture the stream in the first place: ingest raw web crawls and licensed corpora, strip boilerplate, deduplicate at corpus scale, filter for quality and safety, scrub out anything that overlaps your evaluations, and tokenize the survivors into the exact binary format the loader will memory-map. This is the data-prep supercomputer, and it is the most consistently undersized, under-instrumented, and under-respected machine in the building.

The reason it gets disrespected is that it looks like ETL — and ETL is something every org thinks it already knows how to do. But web-scale data prep for a frontier run is not a nightly Spark job. It is a multi-petabyte, trillion-token batch pipeline whose stages have opposite hardware appetites — some embarrassingly parallel and CPU-bound, some metadata-storm small-file workloads, some GPU-accelerated classifier sweeps, some all-to-all shuffles that look like a network benchmark. Size it as one thing and you bottleneck on whichever stage you got wrong. This chapter walks the pipeline stage by stage and ends by sizing the cluster that runs it, which has a power, network, and storage profile deliberately unlike the GPU fleet next door.

Why prep is a goodput problem, not a side quest

The strategic case for taking prep seriously is a goodput case, and it cuts two ways. First, prep is on the critical path to the run. If your corpus takes six weeks to prepare and your GPU cluster is energized and idle for four of them, you have paid frontier-cluster lease or depreciation for a month of string processing. The prep cluster's wall-clock is a direct deduction from the training program's schedule, and the GPUs cannot start until the first tokens are ready. Second, the quality of prep sets the ceiling on what the GPUs can achieve. The dominant lesson of the open-data era — FineWeb, RefinedWeb, DataComp-LM, Dolma, Nemotron-CC — is that careful filtering and deduplication beats raw scale: a smaller, cleaner corpus trains a better model than a larger, dirtier one at the same token budget. Every duplicate you fail to remove is a GPU-hour spent memorizing boilerplate; every benchmark sample you leak is a leaderboard number you have to caveat.

So the fork is rarely "do prep or skip it." It is how much engineering to invest in prep, and on what hardware. The expensive failure mode is the one nobody plans: discovering, two weeks before a launch, that the prep pipeline cannot tokenize the licensed corpus fast enough on the CPU pool you have, and either delaying the run or — far worse — borrowing GPU nodes to brute-force tokenization, which works and is also one of the most expensive ways imaginable to run a Rust regex engine.

The pipeline, stage by stage

A web-scale prep pipeline is a directed sequence of stages, each with a distinct hardware personality and a distinct fork. The canonical order — text extraction → language ID → quality/heuristic filtering → exact dedup → fuzzy dedup → classifier filtering → decontamination → tokenization → shuffle/shard — is not arbitrary: you filter cheaply before you filter expensively, and you dedup before you tokenize so you never tokenize a duplicate. The table below maps each stage to what binds it, so you can see where the prep cluster's bottleneck actually lives.

Prep pipeline stages → what binds each one
StageWhat it doesBinds onHardware fitConsequence of underspeccing
Text extractionWARC/HTML to clean text; strip boilerplateCPU + sequential read bandwidthCPU pool, streaming from object storeThroughput floor for the whole pipeline
Language ID + heuristic filtersfastText langID, C4-style rules, repetition/perplexity gatesCPU; cheap per-docCPU poolCheap to run; skipping it makes later stages pay
Exact dedupHash whole docs/lines; drop identicalMemory + hash-table I/OCPU, large RAM, fast scratchLeaves trivial dupes for fuzzy stage to find
Fuzzy near-dedupMinHash signatures + LSH banding; drop near-dupesAll-to-all shuffle + memoryCPU at scale, or GPU (RAPIDS/NeMo Curator)Either the slowest CPU stage or the GPU win
Classifier filteringModel-scored quality/safety/domain curationGPU inference throughputGPU or accelerated CPUQuality ceiling; false-positive rate set here
Decontaminationn-gram/substring overlap vs eval setsCPU + index lookupsCPU pool, in-memory eval indexBenchmark leakage; untrustworthy evals
TokenizationBPE/Unigram encode to token IDsCPU cores + memory bandwidthCPU pool (Rust tokenizers)The CPU-bound wall; serial within a doc
Shuffle + shardGlobal shuffle, pack to loader formatNetwork + storage write bandwidthFast scratch + back-end networkLoader stalls or poor mixing at train time
Order is the canonical FineWeb/Nemotron-CC-style pipeline. "Binds on" is the resource that saturates first at trillion-token scale; it is what you size that stage's hardware against.

Deduplication: the highest-leverage and most dangerous stage

Deduplication is where prep earns its keep and where it most easily destroys value. Common Crawl is roughly half to two-thirds duplicate content by the time you have a few snapshots; the web mirrors, syndicates, and templates itself relentlessly. Removing those duplicates is the single biggest lever on token efficiency, because a model wastes capacity memorizing anything it sees too many times, and over-represented duplicates skew the distribution. But dedup is also a deletion operation with no undo, and the aggressiveness knob — how similar is "the same" — is the most consequential single number in the pipeline.

There are three tiers, and they compose. Exact dedup hashes whole documents or lines and drops identical copies; it is cheap, CPU-bound, embarrassingly parallel, and catches the trivial cases. Fuzzy / near-duplicate dedup is the real work: compute a MinHash signature per document (a compact sketch of its n-gram set), then use Locality-Sensitive Hashing (LSH) to bucket documents whose signatures collide in any band, so you only compare plausibly-similar pairs instead of the quadratic all-pairs explosion. The similarity threshold (FineWeb used 5-grams at a 75% Jaccard threshold) decides what counts as a near-dupe. Semantic dedup goes further: embed every document, cluster in embedding space, and drop near-neighbors that are paraphrases or translations a MinHash would miss — at the cost of an embedding-model inference pass over the entire corpus.

The deduplication fork: exact vs fuzzy vs semantic
TierMethodKillsCostWhen it is the right last stage
ExactSHA/xxHash on doc or line; drop identicalByte-identical copies and mirrored pagesLowest (CPU, near-linear)Tiny budgets; a pre-pass before fuzzy
FuzzyMinHash signatures + LSH banding (e.g. 5-grams, 0.75 Jaccard)Near-dupes: templated, lightly-edited, reformattedModerate; dominated by the all-to-all shuffleDefault for web-scale pretraining corpora
SemanticEmbed corpus, cluster, drop near-neighborsParaphrases, translations, semantic restatementsHighest (an embedding inference pass over everything)When budget is token-constrained and quality is paramount
Compose them in order. The 'kills' column is what each tier removes that the prior tier cannot. Cost is per-token relative, at trillion-token scale.

The hardware fork inside dedup is the one with the biggest 2026 swing. Fuzzy dedup at trillion-token scale is an all-to-all shuffle of billions of signatures, and on a CPU cluster it is routinely the slowest stage in the entire pipeline — days of wall-clock on a large cluster. GPU-accelerated dedup (NVIDIA's NeMo Curator on RAPIDS, and the FED-class GPU dedup frameworks) collapses that. NVIDIA reports deduplicating 1.96 trillion tokens (RedPajama-V2 scale) in roughly half an hour on 32 H100s, and a ~16x speedup on fuzzy dedup versus CPU baselines with near-linear scaling across nodes. That is the rare case where the prep cluster legitimately wants a few GPUs — not to train, but to make the dedup stage stop being the long pole. The decision: pay for a large CPU pool to grind the shuffle, or borrow a small GPU partition to do dedup in hours and free the CPUs for tokenization.

Quality and safety filtering: heuristics vs classifiers

After dedup, you decide what is worth keeping. The fork is heuristic filtering vs classifier-based curation, and most serious pipelines use both — heuristics first because they are cheap and interpretable, classifiers second because they have a higher quality ceiling.

Heuristic filters are the C4/Gopher/FineWeb lineage: drop documents by language-ID confidence, mean line length, fraction of alphabetic characters, repetition ratios, presence of boilerplate or blocklisted terms, perplexity under a reference model. They are CPU-cheap, fully auditable, and every threshold is a knob you can explain to a regulator or a model-behavior reviewer. Classifier filters score each document with a trained model — a fastText or transformer quality classifier (often trained to recognize "educational" or instruction-like text, as in FineWeb-Edu and the DataComp-LM curation), plus safety classifiers for toxicity, CSAM signals, and policy categories. These have a much higher ceiling — they capture quality signals no regex can — but they are GPU-bound to run at corpus scale, they are opaque, and their false-positive rate silently shapes the model: a quality classifier that down-weights a dialect, a domain, or a non-English register is making a model-behavior decision disguised as a data-cleaning step.

Decontamination: the eval you can trust

Decontamination is the stage that protects the meaning of every benchmark you will ever report. If your pretraining corpus contains the test items from GSM8K, MMLU, HumanEval, or whatever you benchmark on, your model can memorize the answers and your reported scores are fiction. This is not hypothetical: independent audits have found MMLU meaningfully contaminated across public corpora, and models have shown high-single-digit to low-double-digit accuracy drops when re-tested on clean mirrors of benchmarks they had leaked. The fork is how aggressively you scrub, and against which eval sets.

The standard mechanism is n-gram / substring overlap: index every eval set you care about, then drop or flag any training document that shares a long enough contiguous span (commonly an 8- to 13-gram match) with a test item. It is CPU-bound, index-lookup-heavy, and conceptually simple. The trap is that it is necessary but not sufficient: paraphrase, translation, and reformatting slip straight past a literal n-gram match, so a corpus can pass decontamination and still be semantically contaminated. The discipline that actually works is procedural, not algorithmic: maintain a frozen, versioned registry of every benchmark, decontaminate against the entire registry (not just the headline three), and — critically — re-run decontamination whenever you add a new eval, because a benchmark you adopt after training was never scrubbed from the corpus it was trained on. The cost of getting this wrong is not a bug; it is a credibility event when an outside party reproduces your eval on a clean split and your numbers collapse.

Deep dive: why n-gram decontamination quietly fails, and what to add

The n-gram overlap method has two opposite failure modes, and a serious pipeline mitigates both. False negatives (contamination it misses): any transformation that breaks the literal token sequence — paraphrasing a question, translating it, swapping multiple-choice option order, reformatting a code prompt — defeats exact substring matching while preserving the information the model can memorize. Recent work shows literal decontamination removing only a fraction of the true leakage; inference-time and embedding-based methods recover further drops of tens of percent on contaminated benchmarks. False positives (clean data it wrongly drops): on multiple-choice benchmarks, unrelated questions share option boilerplate ("A) B) C) D)", common stems), so a naive n-gram match flags and deletes legitimate training documents — a quiet quality tax.

The defensible posture layers three things. (1) Long-n-gram exact as the cheap floor, tuned long enough (13-gram is a common choice) to avoid the false-positive flood from short shared spans. (2) Embedding / semantic overlap against the eval registry to catch paraphrase and translation — the same embedding infrastructure you already stood up for semantic dedup. (3) A held-out clean mirror for the benchmarks you most care about, so you can measure the contamination delta directly rather than trusting that the scrub worked. Decontamination is the one stage where "we ran the standard script" is not a defensible answer — the standard script is the floor, not the ceiling. → quality and provenance governance in Chapter 10.10.

Tokenization and the CPU-bound preprocessing wall

Tokenization is where the prep cluster's true hardware profile reveals itself, and where the most common sizing surprise lives. Encoding text to token IDs with a BPE or Unigram tokenizer is, within a single document, an inherently serial longest-match scan: you read left to right, match the longest token, consume it, and start the next match where the last one ended — each step depends on the previous. You cannot GPU-parallelize within a document the way you can a matrix multiply. You parallelize across documents instead, which means tokenization throughput is a function of CPU core count and memory bandwidth, not accelerators. The fast production tokenizers (HuggingFace's Rust tokenizers, and the newer parallel-BPE engines) are 10–100x faster than pure-Python, encoding on the order of a gigabyte of text in tens of seconds per server — but "per server" times trillions of tokens is still a wall, and it is a CPU wall.

This is the CPU-bound preprocessing wall, and it is why the prep cluster is not a GPU cluster. At a frontier corpus of, say, 15 trillion tokens (FineWeb's scale, ~44 TB on disk as GPT-2 tokens), the tokenization pass alone — never mind extraction, filtering, and dedup — is a large CPU-core-hour bill. The decision the table below frames is whether to size a dedicated CPU pool for it, lean on a managed distributed framework (Spark, Ray, Dask, or NeMo Curator's streaming executor that overlaps CPU and GPU stages), or — the anti-pattern — burn GPU nodes whose accelerators sit idle while their host CPUs do the tokenizing. The last option works and leaves the accelerators idle while their host CPUs tokenize.

Sizing the prep cluster: a deliberately different machine

The prep supercomputer is sized against the opposite constraints from the GPU fleet, and the headline is that it is CPU-dense, RAM-heavy, storage-bandwidth-bound, and power-light per rack — the inverse of a training hall. Its racks draw a fraction of a liquid-cooled GPU rack's power, it lives happily on air cooling, and its scarce resources are CPU cores, DRAM capacity (dedup hash tables and shuffle buffers are memory-hungry), fast local scratch (NVMe for intermediate shuffle spill), and sustained sequential read bandwidth from the object/capacity tier where the raw corpus lives (→ Chapter 9.6). The network it stresses is the data-center fabric for all-to-all shuffles, not a non-blocking GPU back-end.

The two structural decisions are dedicated vs shared and co-located vs remote. A dedicated prep cluster (or a large pool of general CPU compute) decouples prep wall-clock from GPU availability — you prepare the next corpus while the current run trains — at the cost of standing up and powering a second cluster. Sharing idle GPU-node CPUs is capital-efficient but couples your prep schedule to GPU idleness and wastes accelerators. On location: prep wants to run where the data already is, because raw corpora are multi-petabyte and data gravity makes them economically immovable (cloud egress at $0.05–$0.09/GB makes moving a petabyte a five-to-six-figure line item) — so prep is the textbook case of move compute to data, the same gravity logic developed in Chapter 9.8.

Prep cluster vs GPU training fleet: the inverse profile
AxisGPU training fleetData-prep cluster
Binding resourceGPU FLOPs + scale-up bandwidthCPU cores + DRAM + read bandwidth
AcceleratorsThe entire pointA small partition for dedup/classifier only, if any
Rack power120–600+ kW, liquid-cooledSingle-digit to low-tens kW, air-cooled
NetworkNon-blocking back-end (IB/RoCE)DC fabric for all-to-all shuffle; oversubscription tolerable
Storage emphasisHot parallel FS, checkpoint write burstsCapacity/object read bandwidth + fast NVMe scratch
Job shapeSynchronous, restart-from-checkpointBatch, idempotent, restartable per-shard
Siting driverCheap firm power + cold climateWhere the corpus already lives (data gravity)
The point of the table is that almost every axis is opposite. Sizing the prep cluster with GPU-fleet instincts (liquid cooling, non-blocking fabric, accelerator-first) overspends on the wrong things and underspends on cores, RAM, and read bandwidth.
15T tokens
FineWeb corpus from 96 Common Crawl snapshots (~44 TB on disk as GPT-2 tokens)
2024FineWeb paper (Penedo et al.), HuggingFace
5-grams @ 0.75
FineWeb fuzzy-dedup config: MinHash over 5-grams, 75% Jaccard, deduped per-snapshot (global dedup regressed quality)
2024FineWeb paper / HuggingFace blog
~16x
GPU fuzzy-dedup speedup vs CPU baseline; ~1.96T tokens deduped in ~0.5 hr on 32 H100s
2025NVIDIA NeMo Curator
~50–67%
duplicate share of multi-snapshot Common Crawl removed by dedup before quality filtering
2024RefinedWeb / FineWeb / Dolma syntheses
8–13-gram
typical contiguous-span threshold for n-gram benchmark decontamination overlap matching
2025OpenAI/Llama/Dolma decontamination practice
~29%
estimated MMLU contamination across public web corpora; clean-mirror retests drop scores high-single to low-double digits
2025benchmark-contamination audits (arXiv)
10–100x
Rust tokenizer speedup vs pure-Python; ~1 GB text encoded in tens of seconds per server, still CPU-bound at corpus scale
2025HuggingFace tokenizers; BlockBPE/fastokens (arXiv)
$0.05–0.09/GB
cloud egress making multi-PB raw corpora economically immovable — prep is move-compute-to-data
2025cloud provider egress tariffs; SemiAnalysis
Deep dive: a worked prep budget, and where it goes wrong

Walk a frontier-scale corpus through the cluster to see where the hours actually accumulate. Start with raw input on the order of tens of petabytes of crawl plus licensed data sitting in object storage. Extraction + language ID + heuristic filtering stream that through a CPU pool; this stage is bound by sequential read bandwidth from the capacity tier and by core count, and it sets the throughput floor for everything downstream — under-provision the read path here and the whole pipeline waits on object-store bandwidth. Exact dedup is cheap. Fuzzy dedup is the danger: on CPU it is the slowest stage by far (an all-to-all shuffle of billions of MinHash signatures, days of wall-clock, memory-bound on the LSH buckets); a small GPU partition collapses it to hours. Classifier filtering wants GPUs for the inference pass over the survivors. Decontamination is CPU + a large in-memory eval index. Tokenization is the CPU wall — trillions of serial-within-doc encodes, parallelized only across documents, sized purely on core-hours and memory bandwidth. Shuffle + shard stresses scratch NVMe and the fabric.

The three recurring sizing mistakes: (1) provisioning the prep cluster like a small GPU cluster, paying for accelerators that sit idle through the CPU-bound stages; (2) under-sizing DRAM and NVMe scratch, so the fuzzy-dedup shuffle spills to slow storage and the stage that should take hours takes days; (3) under-sizing the read path from the object tier, so extraction — the throughput floor — starves. None of these show up in a small-scale test; they only bite at full corpus volume, which is exactly when the GPU fleet is energized and waiting. The fix is to model the prep cluster as its own machine with its own bottleneck analysis, not as a leftover. → capacity tier in Chapter 9.6; loader hand-off in Chapter 9.5.

Anti-patterns

The same prep failures recur, because each comes from treating prep as ETL rather than as a sized supercomputer:

  • Tokenizing on GPU nodes. Borrowing the training fleet to brute-force the CPU-bound tokenization wall when the dedicated CPU pool was undersized. It works, and the accelerators sit idle while their host CPUs do the encoding.
  • Decontaminating against the headline benchmarks only. Scrubbing GSM8K and MMLU, shipping, then adopting a new eval after training — which was never removed from the corpus. The leaderboard number is contaminated and you will not find out until someone outside reproduces it on a clean split.
  • Treating dedup threshold and classifier operating point as hygiene, not hyperparameters. Picking a similarity threshold and a quality-classifier cutoff because they sounded reasonable, never ablating them, and discovering a quality regression or a silent capability gap after a frontier run. Global dedup that deletes the good recurring web is the canonical case.
  • Sizing the prep cluster with GPU-fleet instincts. Liquid cooling, a non-blocking back-end, accelerator-first — overspending on what prep does not need while starving the cores, DRAM, scratch, and read bandwidth it does. → Chapter 9.8.
Offline prep produces what the runtime loader streams — the online data path is engineered in Chapter 9.5, and the capacity/object tier the corpus lives on is Chapter 9.6. The data-gravity and move-compute-to-data logic that sites the prep cluster is developed in Chapter 9.8; the checkpoint write path that shares the prep cluster's restartable, idempotent batch discipline is Chapter 9.4. What you are legally permitted to put in the corpus — provenance, licensing, opt-outs, PII — is the regime enforced mechanically by these filters, treated in Chapter 10.10. The economics of running this second cluster as a distinct power and capital line item connect to the on-site generation strategy in Chapter 3.5.