Guide › Compute, Silicon & System Integration › 7.15

Chapter 7.15

Deployment Velocity & Cabling at Scale

Time-to-goodput is set on the floor, not in the design package: the rate at which racks land and links light — and the discipline that keeps mis-cabling off the acceptance critical path — is the velocity metric that converts an energized shell into a productive cluster, and pre-terminated cabling plus off-line optics screening are the two levers that move it most.

DENSITY-RAMPGOODPUTPOWER-BOUND

What you'll decide here

Whether to field-terminate fiber in place or buy pre-terminated MPO trunks and modular cassettes — the single largest lever on install rate, and the one that converts skilled-labor scarcity from a schedule risk into a factory problem.
Whether to burn-in and screen optics off the critical path (a staging tent or a vendor screening line) or accept that marginal transceivers surface as link flaps during acceptance and, later, as goodput loss in production.
How much you invest in mis-cabling prevention — labeling schemes, polarity discipline, automated topology verification — versus paying for it as the #1 acceptance failure mode that gates the reference training run.
What install-rate target (racks/day and links/day) the program is actually staffed and staged for, and therefore whether your time-to-power advantage survives contact with the physical-layer buildout or is squandered waiting on cable crews.
Where the L11/L12 handoff boundary sits — how much cabling and optics integration happens in the factory versus the field — because that boundary, fixed in Chapter 7.14, sets how much of this chapter's work is even on your floor.

You have fought for an interconnection slot, energized a hall, plumbed it for liquid, and taken delivery of racks that cost more than the building. None of it earns a dollar until the cluster passes its reference training run — and between the loading dock and that milestone sits a physical-layer buildout that almost nobody scopes as a first-class engineering problem until it is the thing holding up go-live. This chapter is about install velocity: the rate at which racks are set, busways are tapped, manifolds are coupled, and — above all — links are cabled and lit. In a power-bound era where the scarce input is energized megawatts against a running depreciation clock, the install rate is what converts a date on the interconnection agreement into time-to-goodput. Two clusters with identical bills of materials and identical power can differ by months in time-to-revenue purely on how fast and how cleanly they cabled.

This chapter works through three forks in a domain usually treated as commodity labor — field-terminate vs pre-terminate, screen optics off-line vs accept in-line, prevent mis-cabling vs debug it at acceptance — and the cost of choosing wrong, measured in days of stranded capital and percentage points of lost goodput. The install act is owned here; the acceptance and validation that proves the cluster works — fabric BER soak, NCCL all-reduce, the reference run — lives in Part 13. See Chapter 7.14 for where the rack actually gets built (the L1–L12 model) and Chapter 8.10 for the structured-cabling fiber plant whose components we deploy here.

Install rate is the velocity metric that gates time-to-goodput

The honest unit of program velocity for an AI cluster is racks set per day and links lit per day — sustained, on the floor, with first-time-right quality — rather than megawatts-per-month or racks-shipped. Everything upstream — the master schedule, the long-lead procurement, the factory integration — exists to feed a floor that can only absorb work at some finite rate, and everything downstream — acceptance, the reference run, first revenue — cannot begin until that rate has carried the buildout to completion. The install rate is the throttle on the whole pipeline.

The arithmetic is unforgiving because of fan-out. A single GB200 NVL72 rack is one set-in-place event but on the order of 5,000+ physical connections: ~5,184 in-rack copper twin-ax NVLink cables forming the scale-up spine, ~72 cooling-tube couplings, and — the part this chapter cares about most — hundreds of external scale-out links per rack times tens of fibers each. The fiber plant of an AI rack carries roughly an order of magnitude more strands than a conventional CPU rack (industry estimates put it near 36x more fiber per rack), and a flagship 100k–200k-GPU build runs to millions of individual fiber connections. Set the rack in an hour; the rack does not produce goodput until every one of its external links is cabled to the correct port, with correct polarity, inside its loss budget, and verified. The link is the unit of work, and the link count is what sets the schedule.

This is why a velocity-engineered program inverts the naive plan. The naive plan treats cabling as a trailing activity that happens after racks land. The engineered plan treats cabling capacity as the binding constraint and works backward: it stages pre-terminated trunks before racks arrive, screens optics before they reach the floor, and verifies topology continuously rather than in one terminal sweep. The reward is a floor that never stalls waiting on a cable crew; the penalty for skipping it is the most common AI-buildout failure mode — a fully-energized, fully-rack-loaded hall that cannot pass acceptance because thousands of links are mis-mapped, marginal, or simply not yet run.

The fork that sets your install rate: pre-terminate or field-terminate

This is the highest-leverage decision in the chapter, and it is largely irreversible once the fiber plant is procured. Field termination — splicing and connectorizing fiber in place — gives maximal flexibility in exact lengths and routing, but it puts your schedule directly on the supply of scarce, certified fiber technicians, makes quality a function of who happened to be on shift, and turns every link into a potential rework event. Pre-terminated MPO/MTP trunks and modular cassettes — factory-built, factory-tested, plug-and-play assemblies delivered to length — convert termination from a field-labor problem into a procurement problem you can solve months ahead, collapse install time per link by an order of magnitude, and ship with a tested loss budget so the link works the first time. The cost is paying a premium per assembly, committing to lengths and a topology earlier, and depending on a supply chain that is itself constrained. For any cluster on a velocity timeline, pre-termination is the default; field termination survives only at the margins (odd lengths, last-mile patches, repairs).

Pre-terminated trunk/MPO cabling: the primary velocity lever

If install rate is the metric, pre-terminated structured cabling is the lever that moves it most. The mechanism is simple: it moves termination — the slow, skill-intensive, error-prone step — off the critical path and into a factory, where it is done under controlled conditions, tested, and labeled before it ever reaches the floor. On site, a crew is no longer splicing; they are routing trunks and snapping cassettes, an activity that is fast, repeatable, and far less dependent on rare certified-tech hours. In a market where the binding constraint on the whole industry's buildout is increasingly skilled cabling labor, this is the difference between a floor that absorbs racks as fast as they arrive and one that backs up.

The density argument compounds the velocity argument. A high-fiber MPO trunk consolidates many discrete cables into one routed assembly — a single 144-fiber MTP/MPO trunk can carry the equivalent of eighteen MPO-8 runs — so a cluster that would otherwise route thousands of individual jumpers routes tens of trunks broken out at the rack by cassettes. Industry analyses put the physical cable-count reduction near 98% for a large GPU cluster cabled this way, with a corresponding collapse in pathway congestion, weight in the tray, and airflow obstruction. Fewer physical objects to route is fewer chances to mis-route; the density lever and the quality lever are the same lever.

The staging discipline is what makes this real. Velocity-engineered programs build a structured-cabling staging operation — a kitting and pre-lay workflow, often a literal tent or warehouse adjacent to the hall — where trunks are received, inventoried against the port map, labeled to the as-designed topology, pre-laid into trays or coiled at the rack positions, and held ready before the racks they serve arrive. When the rack lands, its links are already waiting at the cabinet, labeled and tested; the crew connects, not constructs. The decision to invest in staging is a decision to decouple cabling progress from rack-delivery timing, and it is the practical core of every fast buildout on record.

The forward-proofing decision rides alongside: standardize the fiber plant on OS2 single-mode with base-8/base-16 MPO trunking and the plant survives the 800G→1.6T→3.2T lane-rate climb as a transceiver-only swap, rather than forcing a re-cable each generation. Re-cabling a live hall is itself a velocity event — and the worst kind, because it competes with production. The single-mode-vs-multimode and MPO-polarity engineering is canonical in Chapter 8.10; the copper-inside/optics-outside reach frontier that decides which links are even fiber is in Chapter 8.9.

Field-terminate vs pre-terminate vs hybrid: the velocity fork

Approach	Install rate / link	Quality at install	Critical-path exposure	Cost posture	Best fit
Field termination (splice/connectorize in place)	Slowest — minutes to tens of minutes per termination, skill-gated	Variable — depends on technician and conditions; rework common	High — schedule rides on certified-tech availability	Low material, high labor; rework erodes the saving	Odd lengths, last-mile patches, repairs, true one-offs
Pre-terminated MPO trunks + cassettes	Fastest — route-and-snap; ~10x fewer field-labor hours	High — factory-tested loss budget, works first time	Low — termination moved to factory, off the floor	Premium per assembly; commits to lengths/topology early	Velocity-timeline clusters; the 2026 default
Hybrid (pre-term backbone + field patch)	Fast on the trunked plant, slow on the patched fraction	High on backbone; variable on the field portion	Moderate — bounded by the field-terminated minority	Balanced — premium where it pays, field where it must	Brownfield retrofits; mixed-generation halls

Practitioner ranges for AI-scale fiber plants, 2025–2026. "Critical-path exposure" is the schedule risk the choice places on scarce certified-fiber labor.

Optics burn-in: screening transceivers off the critical path

The cruelest property of an optical transceiver is that it can pass at the bench and fail under load. The dominant failure mechanism is thermal: a module characterized on a 25 °C bench drifts as its case rises toward the ~65 °C of a hot AI rack under sustained traffic, and its TDECQ margin erodes by roughly 0.5–1.5 dB. A link that comes up clean during a quiet install can therefore start flapping the moment the cluster is loaded — exactly when it is most expensive, because in a synchronous training fabric one flapping link stalls the thousands of GPUs barriered behind it. Optics are simultaneously the single largest source of "gray failures" in an AI cluster and a supply-constrained line item that gates deployment in the first place.

The decision this forces: screen and burn-in transceivers off the critical path, or accept marginal optics in-line. Off-line screening means soaking modules at temperature, under traffic, against a BER/pre-FEC threshold — on a vendor screening line or in your own staging operation — and binning out the marginal population before they ever reach a switch cage on the floor. The infant-mortality and thermally-marginal units fail in the tent, on your schedule, where a failure costs a swap and a note, not a stalled acceptance run or, worse, a production goodput hole discovered weeks later. In-line acceptance — plug everything in, run the fabric soak, and chase whatever flaps — is cheaper to set up and devastating to live with, because every marginal module surfaces as a debugging excursion on the acceptance critical path, and the ones that pass install-day but fail under production thermal load become the operations team's problem indefinitely.

There is no settled industry standard yet for soak duration or BER thresholds — it remains an open question how much screening is economically optimal — but the structural argument is one-directional: the cost of catching a bad optic before install is a fixed, schedulable expense; the cost of catching it after is a variable, unbounded one paid in GPU-hours. On a velocity timeline, screening pays for itself the first time it keeps the acceptance run from stalling. The reliability physics, the pre-FEC/DDM telemetry that forecasts a flap, and per-link MTBF math are in Chapter 8.9; fabric BER-soak as an acceptance gate is in Chapter 13.7.

Mis-cabling: the #1 acceptance failure, engineered out

Ask anyone who has commissioned a large AI fabric what gated the green tag and the answer is almost always the same: mis-cabling. In a rail-optimized topology, a single link plugged into the wrong port — right cable, wrong destination — does not merely lose one connection; it can collapse the collective bandwidth of an entire rail and silently degrade the all-reduce that the whole training job depends on. At a scale of millions of links, even a very low per-link error rate yields thousands of mistakes, and each one is a needle the acceptance team must find in a haystack of correctly-cabled links that look identical. Mis-cabling is not a quality nuisance; it is the dominant schedule risk of the acceptance phase, and the velocity-engineered answer is to prevent it rather than debug it.

Prevention is three disciplines, applied together. First, labeling and port-mapping: every trunk, jumper, and port carries an unambiguous as-designed label tied to a port map and a digital twin, so the crew connects to a named destination rather than a guessed one — this is most of why pre-terminated, pre-labeled assemblies cut errors as much as they cut time. Second, polarity discipline: MPO systems have a polarity scheme (Method A/B/C) that must be consistent end-to-end across trunks, cassettes, and jumpers, or the link is dark or crossed; choosing one polarity method and enforcing it through the whole plant is a design decision that, made wrong or made loosely, produces a class of failures that are maddening to chase because the cable is physically fine. Third, automated topology verification: rather than a human tracing links, the fabric's own LLDP/management plane is queried to confirm that every observed neighbor relationship matches the intended topology, flagging the mis-maps continuously as the buildout proceeds instead of in one terminal sweep at acceptance.

The tradeoff is stark. Spend on labeling rigor, a single enforced polarity method, and automated verification tooling, and mis-cabling becomes a continuously-cleared backlog that never reaches the acceptance gate. Skip it, and mis-cabling becomes the acceptance gate — a multi-week manual hunt across a fully-built hall, with the depreciation clock running and the reference run blocked. The cheapest place to find a mis-cabled link is the moment it is plugged; the most expensive place is the cluster-scale NCCL benchmark. The L5 integrated systems test and the fabric validation that surface these failures if they slip through are in Chapter 13.6 and Chapter 13.7; the rail-optimized topology whose geometry makes a single mis-map so destructive is in Chapter 8.5.

122 days

abandoned factory shell to 100k-GPU training cluster (xAI Colossus, Memphis); doubled to 200k in a further 92 days

2024–2025NVIDIA Newsroom (Spectrum-X / xAI Colossus)

19 days

from first rack on the floor to start of training at Colossus — the install-velocity headline number

2024NVIDIA Newsroom; RCR/RDWorld coverage

~1,500 racks

GPU racks deployed at Colossus Phase 1 — implies a sustained ~12 racks/day set-and-cable rate over 122 days

2025Introl (xAI Colossus infrastructure)

~36x

more fiber per AI rack than a conventional CPU rack; a 100k–200k-GPU build runs to millions of fiber connections

2025–2026Industry fiber-demand analyses (Communications Today; fiber vendors)

~98%

reduction in physical cable count from high-density pre-terminated MPO trunking vs discrete jumpers at cluster scale

2026Structured-cabling vendor analyses (AMPCOM; holight)

~70%

share of AI-datacenter connections expected to use MTP / MTP-LC hybrid pre-terminated systems by 2027

2026 (projected)Structured-cabling market analyses

0.5–1.5 dB

TDECQ margin erosion from 25°C bench to ~65°C AI-rack case temp — why links pass at install and flap under load

2025Optics & Cabling domain keyNumbers; OFC vendor data

~95%

sustained fabric throughput at Colossus with zero flow-collision packet loss — the goodput that clean cabling protects

2024–2025NVIDIA Newsroom (Spectrum-X / xAI Colossus)

Worked exemplar: the xAI Colossus 122-day buildout

Colossus is the public reference point for install-velocity engineering, and it is worth reading not as a stunt but as a case study in every lever this chapter names. xAI and its partners took an abandoned Electrolux factory shell in Memphis to a live, training 100,000-GPU cluster in 122 days, then doubled it to 200,000 in a further 92 — against an industry norm measured in many months to years for systems of that size. The number that should focus a velocity engineer's attention, though, is the smaller one: 19 days from the first rack rolling onto the floor to the start of training. That gap between rack-arrival and training-start is the install act this chapter owns, compressed to nearly nothing.

The compression came from doing exactly what the decision forks above prescribe. Roughly 1,500 GPU racks set in 122 days is a sustained pace near a dozen racks per day, every day — a rate that is only achievable if the links those racks need are not being constructed in place as the racks land. The buildout ran parallel workstreams against pre-staged, factory-terminated, modular cabling, so that rack-setting and link-lighting proceeded concurrently rather than in series. The reward shows in the network result: across all three tiers of the scale-out fabric, the cluster sustained roughly 95% throughput with zero packet loss from flow collisions — the goodput outcome that clean, correctly-mapped, in-budget cabling exists to protect.

The honest part of the story is the chaos. Assembling 200,000 interconnected GPUs at that speed surfaced mismatched BIOS firmware, cable snarls, and even cosmic-ray bit flips — the predictable tax of compressing a year of work into months. The lesson is not that velocity is free; it is that velocity is bought by moving the slow, error-prone work (termination, optics screening, topology verification) off the floor and ahead of the racks, so that the floor itself only ever does the fast, parallelizable work. Colossus is the existence proof that the install rate is an engineerable quantity, not a fixed property of physics. The integrated master schedule and critical-path discipline that makes parallel workstreams possible is in Chapter 2.1; the long-lead procurement that must land the pre-terminated cabling and screened optics in time is in Chapter 2.3.

Deep dive: deriving an install-rate budget (racks/day and links/day)

An install-rate budget is the velocity equivalent of a power or thermal budget, and a program that does not write one is flying blind on the constraint most likely to slip its schedule. Build it backward from the goodput date. Suppose a target of N racks live by a fixed reference-run date D; the required set-and-cable rate is roughly N/(working days to D), but the rack rate is the easy part. The binding term is the link rate: each rack carries on the order of hundreds of external links, so a floor setting a dozen racks a day must light and verify several thousand links a day to keep pace. That number, not the rack number, is what you staff, stage, and tool against.

Now apply the levers as multipliers on that link rate. Pre-terminated trunking raises the links/day a fixed crew can route by roughly an order of magnitude versus field termination, because the slow step is gone. Off-line optics screening keeps the rate from collapsing into rework excursions, because the modules that reach the cage already work. Automated topology verification keeps the rate from collapsing at the end, because mis-maps are cleared continuously instead of discovered in a terminal acceptance sweep. A realistic budget therefore reads: target link rate, fiber-tech crew sized for the pre-terminated (not field) rate, a staging operation sized to hold a few days of trunks ahead of rack delivery, a screening throughput that matches the optics install rate, and a verification cadence that clears mis-maps faster than they accumulate. The moment any one of these falls below the rack-set rate, the floor backs up and the goodput date moves — which is why the budget is written, owned, and tracked daily, exactly like the power ramp it parallels.

Where the velocity boundary sits: factory vs field

How much of this chapter's work is even on your floor is itself a decision, fixed upstream at the L11/L12 integration boundary. Pull cabling and optics integration into the factory — racks delivered with in-rack cabling done, manifolds coupled, optics seated and screened, the assembly tested as a unit — and you shrink the field buildout to inter-rack trunking and the final fabric lacing, cutting floor time from months toward weeks. Push it into the field — barebones racks integrated on site — and you own the full physical-layer buildout at the slowest, most labor-constrained point in the chain. The factory-integration choice is a velocity choice before it is a logistics choice, and it is why the same rack BOM can yield wildly different time-to-goodput depending on where it was built.

The countervailing pressures are real and belong to Chapter 7.14: a factory-integrated rack ships heavier and more fragile, shipping wet vs dry changes the logistics and the risk, and serviceability and RMA models shift when more is sealed at the factory. The point for this chapter is narrow and firm: the further left (toward the factory) you set the integration boundary, the less of the install-rate constraint lands on your critical path — and the more your time-to-power advantage actually converts into time-to-goodput rather than evaporating against a cable backlog. The L1–L12 manufacturing-level model, the build-vs-buy fork, and the wet-vs-dry shipping tradeoff are owned in Chapter 7.14; the acceptance and validation that proves the installed cluster works — burn-in, fabric BER soak, the reference training run, go-live, and handover — are owned across Chapter 13.6, Chapter 13.7, Chapter 13.8, and Chapter 13.9.

Velocity is bought off the floor, not on it

The unifying idea across every lever in this chapter: a fast buildout is one where the floor only ever does fast, parallelizable, low-skill-gated work, because all the slow, serial, skill-gated, error-prone work has been moved into a factory or a staging operation that runs ahead of the racks. Pre-terminated trunks move termination off the floor. Off-line optics screening moves burn-in off the floor. Factory L11/L12 integration moves rack cabling off the floor. Automated topology verification moves the mis-cabling hunt out of the acceptance sweep and into a continuous background process. The install rate is high not because the floor crew works faster, but because the floor crew has less to do. Engineer the buildout so that the rack lands into a position where its links are already staged, screened, labeled, and waiting — and time-to-goodput collapses toward the irreducible minimum of set, connect, verify.

This chapter owns the install act; its inputs and outputs live across the guide. The fiber-plant components deployed here — single-mode vs multimode, MPO trunking, polarity, loss budgets — are engineered in Chapter 8.10, and the optics reliability and copper/optics reach frontier in Chapter 8.9. The rail-optimized topology whose geometry makes a single mis-cabled link so destructive is in Chapter 8.5. Where the rack gets built — the L1–L12 model, factory vs field integration, wet-vs-dry shipping — is Chapter 7.14. The schedule and procurement machinery that feeds the floor is in Chapter 2.1 and Chapter 2.3. And the acceptance, validation, and go-live that begin where this chapter ends are owned in Chapter 13.6, Chapter 13.7, Chapter 13.8, and Chapter 13.9.