Guide › Networking, Fabrics & Optics › 8.7

Chapter 8.7

Management, Out-of-Band Fabric & PTP/IEEE-1588 Timing

An AI cluster has two fabrics that almost nobody scopes on purpose — the out-of-band network that lets you reach a wedged node when the data plane is gone, and the timing plane that gives every telemetry record, RoCE counter, and training step a common clock — and when either is missing you discover it during the incident, which is the most expensive possible time to learn the lesson.

GOODPUTDENSITY-RAMP

What you'll decide here

Whether the management/out-of-band fabric is a true physically-separate network (its own switches, cabling, and addressing) or a logically-segmented VLAN riding the production fabric — and therefore whether you can still reach a BMC when the data plane, the controller, or the in-band NIC is down.
Which management protocol generation you standardize the fleet on — modern Redfish/PLDM-over-MCTP with secure OOB, or legacy IPMI — because that choice sets your firmware-update story, your zero-touch bring-up velocity, and your attack surface for the life of the building.
Whether you deploy a real PTP/IEEE-1588 timing plane (grandmasters, GNSS reference, boundary clocks in every tier, hardware-timestamping NICs) or settle for NTP — the fork between nanosecond-coherent telemetry and millisecond-blurry logs you cannot correlate.
Where time enters the building and how it survives loss of the satellite reference — the GNSS antenna plant, the holdover oscillator class, and the holdover budget you are willing to underwrite before time-of-day drifts past your correlation tolerance.
Who owns time-sync accuracy as a commissioning acceptance gate, and what the measured offset bound is that a hall must demonstrate before it is declared ready to take training load.

Every prior chapter in Part 8 has been about the fabrics that do the work — the scale-up domain that binds GPUs into one accelerator, the scale-out fabric that carries the all-reduce. This chapter is about the two fabrics that exist so the working fabrics can be operated, observed, and trusted. They carry no training traffic and earn no FLOPs, which is exactly why they get scoped last, under-funded, and discovered missing during an outage. The first is the out-of-band (OOB) management fabric: the independent network through which you reach a server's baseboard management controller (BMC) to power-cycle it, re-image it, read its sensors, and update its firmware — especially when the in-band data plane is dead. The second is the timing plane: the PTP/IEEE-1588 distribution that gives every node a clock accurate to nanoseconds, so that a telemetry record from one switch and a RoCE drop counter from another and a NCCL stall on a third can be laid on a single timeline and actually correlated.

This chapter works through four forks — physical separation vs logical segmentation, Redfish vs IPMI, PTP vs NTP, boundary vs transparent clocks — and the cost of choosing each one wrong. Both fabrics are insurance you buy before the incident: their value is invisible until the night the controller is unreachable and you need to find which of 4,000 GPUs stalled the run, and you find you cannot reach the BMC and your logs are 200 ms out of alignment. By then it is too late to add either one.

Why two operational fabrics, and why they must be separate

The management fabric exists to answer one question reliably: can I reach this machine when everything else has failed? A GPU server in a modern cluster has, in effect, two computers in it — the host (CPUs, GPUs, the in-band NICs that carry training traffic) and the BMC, a small always-on management processor with its own CPU, its own dedicated NIC, and its own power domain that stays alive on standby power even when the host is off. The BMC is how you read inlet temperatures and fan speeds, power the chassis on and off, mount a virtual ISO to re-image, capture a serial console, and push firmware. If the path to the BMC shares fate with the production data plane, then the exact failure modes you most need management for — a wedged data-plane NIC, a black-holing leaf switch, a runaway in-band control agent, a misapplied ACL that severs the host — also sever your ability to fix them. You are left with a remote-hands dispatch and a row walk, which at AI-cluster scale is hours of stranded GPUs.

That is the entire argument for keeping the management fabric out of band: it must not share fate with the thing it manages. In a well-built cluster the OOB fabric is its own physical network — its own management switches (typically modest 1/10/25 GbE top-of-rack switches, one per rack or per row), its own cabling on its own pathways, its own IP addressing and DNS, and its own uplink to a management aggregation tier that lands in the operations/orchestration zone rather than the tenant data plane. The BMC port, the PDU management port, the CDU and cooling-controller ports, the switch management/console ports, and the GNSS/timing appliances all home onto this fabric. It is small, cheap, and low-bandwidth relative to the data plane — and it is the single most leveraged dollar in the building when a row goes dark.

The master fork: physical OOB vs logical segmentation

Decide physical-vs-logical separation before you cable the hall. Physical out-of-band means a genuinely separate network: separate switches, separate cabling, separate failure domain, reachable when the production fabric is wholly down. Logical (in-band) segmentation means a management VLAN/VRF riding the same physical switches as production traffic — cheaper, fewer cables, but it shares fate: a control-plane failure, a fabric-wide misconfiguration, or a power event that takes the production switches also takes your management path. The rule of thumb that survives contact with real incidents: physical OOB for anything you must reach precisely when the data plane has failed (BMCs, switch consoles, PDUs, cooling controllers), and logical segmentation only for management traffic that can tolerate sharing fate with production. Most large AI operators run a hybrid — a physical OOB spine for last-resort reachability, plus an in-band management overlay for routine high-volume telemetry that would overwhelm a 1 GbE OOB fabric. Decide which traffic lives where before you cable the hall; retrofitting a physical OOB fabric into a fully-built row is a recabling project, not a config change. → segmentation and isolation policy in Chapter 11.7.

BMC, Redfish, and the management-protocol generation

The management fabric is only as useful as the protocol you speak over it, and here there is a real generational fork. The legacy answer is IPMI — the Intelligent Platform Management Interface — a 1998-era protocol that does power control, sensor reads, and serial-over-LAN. It works, it is everywhere, and it is also a security and operability liability: a clunky binary protocol, a long history of CVEs, weak authentication, and no clean model for the firmware-update and inventory operations a modern fleet needs. The modern answer is Redfish — a DMTF standard that exposes the BMC as a RESTful, JSON, HTTPS-secured API with a structured resource model. Redfish is what makes a fleet programmable: you can enumerate inventory, drive power and boot order, stream telemetry, and orchestrate firmware updates against a uniform API across vendors instead of scripting a different IPMI dialect per OEM. Underneath, component firmware updates increasingly flow over PLDM-over-MCTP (the OCP-blessed transport for talking to GPUs, NICs, and other devices behind the BMC), with a hardware root-of-trust gating what firmware is allowed to land.

The consequence of standardizing on Redfish rather than IPMI is felt in three places that all bear on goodput and ramp velocity. First, bring-up velocity: zero-touch provisioning — PXE/HTTP boot, image push, and config — is dramatically cleaner against a Redfish API, and bare-metal bring-up is an under-appreciated economic lever when you are racing a depreciation clock to first-job. Second, fleet firmware management: GPU/NIC/BMC firmware drift is a real source of silent performance regressions and RoCE pathologies, and a Redfish + PLDM update path with secure OOB is how you keep a 50,000-accelerator fleet on a known-good firmware baseline. Third, attack surface: the management plane is a high-value target precisely because it can power, re-image, and re-flash every machine, so the move to authenticated, TLS-protected, RoT-gated management is a security decision as much as an operational one. → firmware-update and root-of-trust depth lands with the security part; management-plane isolation in Chapter 11.7 and the threat model in Chapter 11.1.

Management-fabric decision: what lives out-of-band, and how

Endpoint class	Why managed	Separation it needs	Protocol	Failure it must survive
Server BMC (host)	Power, re-image, sensors, console, FW	Physical OOB	Redfish/HTTPS (legacy IPMI)	Dead in-band NIC; black-holed leaf; host hang
Switch management / console	Config, recovery, firmware	Physical OOB + serial console server	SSH / NETCONF / gNMI; RS-232 console	Fabric-wide misconfig; data-plane outage
PDU / rack power	Remote power-cycle, metering	Physical OOB	Redfish / SNMP / vendor API	Host and data plane both down
CDU / cooling controller	Coolant flow, leak, throttle telemetry	Physical OOB (ideally air-gapped from tenant)	Modbus / BACnet / Redfish gateway	Cooling event independent of compute health
GNSS / timing appliance	Time reference distribution	Physical OOB management; timing on its own plane	PTP (timing) + Redfish/SNMP (mgmt)	Data-plane loss must not blind the clock
Routine bulk telemetry	Metrics, logs at high volume	In-band overlay acceptable	gNMI / OTLP / streaming	Can share fate with production

The segmentation decision per endpoint class. "Physical OOB" = reachable when production fabric is fully down. Bandwidth figures are typical OOB switch port classes, not data-plane.

The timing plane: why nanoseconds, not milliseconds

The second operational fabric distributes time. The instinct is to file this under "NTP is fine" — and for billing, log rotation, and Kerberos it is. For an AI cluster it is not, and the reason is correlation. When a 4,000-GPU run stalls, the diagnosis is a forensic reconstruction across thousands of independent telemetry streams: a switch's egress-drop counter, a NIC's RoCE NACK, a CDU's flow dip, a GPU's thermal throttle, a NCCL collective timeout. To reconstruct what happened first you have to lay those events on a single timeline — and if every clock is independently drifting by milliseconds, the timeline is a smear in which cause and effect are indistinguishable. NTP gets you to roughly the millisecond over a LAN; that is three to six orders of magnitude too coarse to order events on a microsecond-scale RDMA fabric. PTP/IEEE-1588 gets you to the nanosecond-to-sub-microsecond range, which is what makes the timeline real.

The trick that makes PTP work where NTP cannot is hardware timestamping. NTP timestamps in software, so its accuracy is poisoned by OS scheduling jitter, interrupt latency, and queueing — the exact noise that dominates at the microsecond scale. PTP timestamps the sync packets in the NIC and switch silicon, at the wire, bypassing the software stack entirely. Each NIC carries a PHC — a PTP Hardware Clock — that the daemon disciplines to the grandmaster; the host system clock is then slaved to the PHC. On modern silicon the numbers are striking: NVIDIA reports Spectrum switches holding PTP accuracy around 10 ns with internal ASIC sync error under 4 ns, and ConnectX-class NICs timestamping with under 4 ns of variance. Meta, deploying PTP across its fleet, reported sub-microsecond precision on commodity servers using hardware timestamping, and went on to publish a simplified variant (SPTP) in 2024 to cut the CPU/memory/network cost of running it at fleet scale. The headline: nanosecond-class time is not exotic anymore — it is a property of the switch and NIC silicon you already bought, if you turn it on and distribute it properly.

Time is a dependency of your diagnostics, not a feature of them

The subtle reason precise time matters more in AI clusters than in classical data centers: the failures you chase are distributed and fast. A straggler, a flapping link, a microburst-induced PFC storm, an ECN-marking cascade — these unfold across many devices in microseconds, and the only way to attribute them is a coherent clock. Precise time is therefore not a nice-to-have telemetry feature; it is the substrate on which RoCE diagnostics, congestion-event reconstruction, and NCCL stall analysis are even possible. Build the observability stack on millisecond NTP and you have built a microscope with a blurred lens: you will see that something broke, but not the order in which it broke, which is exactly the information you need. Worse, multi-DC asynchronous training and some in-network and TSN-style scheduling schemes assume a shared time base; without it, cross-site event ordering and any time-aware shaping degrade silently. → congestion and in-network compute in Chapter 8.6; multi-campus scale-across in Chapter 8.8.

Boundary clocks vs transparent clocks: the topology fork

Distributing time across a multi-tier Clos fabric is not just a matter of a grandmaster shouting the time; every switch hop adds and varies delay, and PTP has two architectural answers for compensating it. A boundary clock (BC) terminates PTP at each switch: the switch is a slave to its upstream parent and a master to everything downstream, recovering and regenerating time at every tier. This isolates each segment from the packet-delay variation of the segments above it, scales cleanly through a deep Clos (a few hundred nanoseconds of accumulated error across the tiers is typical), and is the default for large data-center fabrics — but it requires every switch in the timing path to be a capable, configured BC. A transparent clock (TC) takes the opposite approach: the switch does not recover time, it simply measures how long each PTP packet dwelt inside it and writes that residence time into the message's 64-bit correction field, so the end clock can subtract out the accumulated switch delay. TC compensation is extraordinarily precise per-hop (the correction field resolves to sub-nanosecond), but the end node must trust and sum corrections across the whole path, and asymmetry between the forward and reverse path remains the dominant residual error.

The rule that falls out: boundary clocks for the scaled, multi-tier production fabric — they bound error per tier and don't ask the endpoint to reason about the whole path — and transparent clocks where you have a shallow topology or want maximum per-hop fidelity with minimal switch state. Most modern data-center switch ASICs (NVIDIA Spectrum-2 and later, Broadcom Tomahawk/Jericho families, Intel Tofino-class) can do both in hardware, one-step or two-step, so the fork is a design choice rather than a silicon limitation. The expensive mistake is mixing modes inconsistently across tiers, or running a tier of non-PTP-aware switches in the timing path — every such switch injects uncompensated, variable delay that no amount of grandmaster precision can recover. The timing plane is only as good as its weakest hop.

PTP clock-type decision for an AI fabric

Clock type	What it does	Per-hop error	Best fit	Cost / caveat
Grandmaster (GM)	Authoritative source; disciplined to GNSS/PPS	Reference (≤ tens of ns to UTC)	Top of timing tree, ≥2 for redundancy (BMCA failover)	Needs GNSS antenna plant + holdover oscillator
Boundary clock (BC)	Slave up / master down; regenerates time per tier	~tens of ns/tier, error bounded per segment	Scaled multi-tier Clos production fabric	Every switch in path must be a configured BC
Transparent clock (TC)	Writes residence time into 64-bit correction field	Sub-ns per hop; endpoint sums path corrections	Shallow topology; max per-hop fidelity	Endpoint must trust/sum; asymmetry still bites
Ordinary clock (OC)	End node: NIC PHC slaved to GM via ptp4l	Bounded by path + servo + asymmetry	Every server (the consumer of time)	Quality of time is set by the worst hop upstream

Boundary vs transparent vs ordinary clock. "Per-hop error" is the residual a well-configured device contributes; real budgets accumulate across the timing path and are dominated by path asymmetry.

Where time enters the building, and how it survives losing the satellite

The timing tree has a root, and the root has to get time from somewhere. In nearly all data-center deployments that source is GNSS — GPS and its peers — delivered to a grandmaster appliance via a rooftop antenna, a coax/fiber down-lead, and a one-pulse-per-second (PPS) plus time-of-day feed that disciplines the grandmaster's internal oscillator. On Linux that chain is concrete and worth naming because it is what you actually operate: ts2phc steers a NIC's PHC from the external PPS/GNSS timestamps; ptp4l implements IEEE-1588 to distribute and recover that time across the fabric; phc2sys slaves the host's system clock (CLOCK_REALTIME) to the disciplined PHC. The Best Master Clock Algorithm (BMCA) elects the active grandmaster and fails over to a standby if the primary degrades — which is why you deploy at least two grandmasters, ideally fed by independent antennas on independent pathways.

The decision that separates a robust timing plane from a fragile one is holdover: what happens when the GNSS reference is lost — a jammed or spoofed signal, a failed antenna, a cut down-lead, or simply a bad-weather fade. When the satellite fix drops, the grandmaster has to coast on its internal oscillator, and how long it can coast before time-of-day drifts past your correlation tolerance is set entirely by the class of oscillator you paid for. A TCXO drifts fast; an OCXO holds longer; a rubidium or chip-scale-atomic-clock (CSAC) reference holds far longer still — the spread is from minutes to days for the same drift budget. This is a money decision dressed as an engineering one: you size the holdover oscillator against the worst-case GNSS outage you are willing to ride through without your telemetry timeline degrading and without dependent systems (security event correlation, multi-DC ordering) losing trust in the clock. GNSS is also a security surface — it is jammable and spoofable from outside the fence line — so the holdover budget doubles as your defense against a denial-of-time attack. → management-plane and timing-source security in Chapter 11.7.

~10 ns

PTP accuracy held across an NVIDIA Spectrum switch (one-step, across speeds/FEC)

2025NVIDIA Technical Blog, 'Calculating and Synchronizing Time on the Spectrum Switch'

< 4 ns

internal ASIC sync error (Spectrum) and ConnectX-class NIC timestamp variance

2025NVIDIA Technical Blog (Spectrum / ConnectX timestamping)

sub-microsecond

fleet PTP precision on commodity servers via hardware timestamping; SPTP published 2024

2024Engineering at Meta, 'Simple Precision Time Protocol (SPTP)'

~1 ms

typical NTP accuracy over a LAN — 3–6 orders coarser than PTP

2025linuxptp / SUSE & Red Hat PTP tuning guides

64-bit

PTP correction field a transparent clock writes residence time into (sub-ns resolution)

2019IEEE 1588-2019 (PTPv2.1); Microchip TC app note

ptp4l / phc2sys / ts2phc

the linuxptp daemons: distribute IEEE-1588, slave system clock to PHC, steer PHC from GNSS/PPS

2025linuxptp documentation; Red Hat OpenShift PTP guide

Redfish + PLDM/MCTP

modern OOB fleet management & firmware-update transport with secure OOB

2025OCP GPU Firmware Update Spec v1.0; DMTF Redfish

≥ 2 grandmasters

GNSS-disciplined GMs with BMCA failover; holdover oscillator (OCXO→Rb/CSAC) sets coast time

2025IEEE 1588 BMCA; vendor timing-appliance practice

Time-sync accuracy as a commissioning acceptance gate

A timing plane that is designed but not measured is a timing plane you are trusting on faith — and faith fails at the worst moment. The discipline that closes the loop is to make time-sync accuracy an explicit commissioning acceptance gate: before a hall is declared ready to take training load, it must demonstrate, by measurement, that every node's clock sits within a stated offset bound of the grandmaster, that the bound holds under fabric load (PTP accuracy must not degrade when the data plane is saturated), and that grandmaster failover and a defined holdover window behave as specified. The measured artifact — offset-from-master distributions across the fleet, holdover drift over a simulated GNSS outage, BMCA failover time — is what converts "we configured PTP" into "the clock is trustworthy."

The same gate logic applies to the management fabric: commissioning should prove that every BMC, PDU, switch console, and cooling controller is reachable over the OOB path with the production data plane deliberately severed — because an OOB fabric that has only ever been tested with the data plane up has never actually been tested. The two acceptance criteria belong together in the network-fabric commissioning plan, alongside the RoCE and bisection-bandwidth validation, because they are the operability and observability preconditions for trusting every other test result. → fabric commissioning and validation in Chapter 13.7.

Deep dive: the holdover budget as a quantified design decision

Holdover is where the timing plane's robustness is actually bought, and it is worth working the decision quantitatively rather than picking an oscillator off a datasheet by reflex. The chain of reasoning starts from a tolerance: what is the maximum clock offset your dependent systems can absorb before they lose trust in the timeline? For telemetry correlation on an RDMA fabric that might be a microsecond or two; for some security-event ordering and multi-DC training schemes it may be tighter. Call that tolerance T.

Now the physics: when GNSS is lost, the grandmaster's drift is governed by its oscillator's frequency stability, and the accumulated time error grows roughly with the oscillator's fractional frequency offset multiplied by elapsed time (plus aging and temperature terms). A free-running TCXO at ~1 ppm can blow past a microsecond of error in well under a second; an OCXO holds for minutes to tens of minutes; a rubidium or CSAC reference at parts-per-billion-or-better holds for hours to days against the same T. So the design question becomes: what is the worst-case GNSS outage I must ride through — an antenna fault waiting on remote hands, a regional jamming/spoofing event, a multi-hour weather fade — and which oscillator class keeps drift under T for that duration? That is the holdover budget. Under-buy it and a routine antenna failure silently degrades your telemetry timeline and any multi-DC ordering until someone climbs to the roof; over-buy it and you have spent rubidium money to ride out a fault that resolves in minutes. The honest move is to set T from the dependent systems, estimate the outage duration you must survive from your siting and operations reality, and size the oscillator — and the second, independently-fed grandmaster — to that, then verify it in commissioning with a simulated GNSS pull rather than assuming the datasheet. → the acceptance test that proves it lives in Chapter 13.7.

The weakest-hop and shared-fate failure modes

Two failure modes recur because they are invisible until tested under stress. First, the weakest hop in the timing path: a single non-PTP-aware switch, an unconfigured boundary clock, or a tier running the wrong one-step/two-step mode injects uncompensated, variable delay that caps the accuracy of every downstream node — no grandmaster precision recovers it, and it often only manifests as correlation drift under load. Second, shared fate in the management fabric: a "separate" management VLAN that in fact rides the production switches, or an OOB fabric whose uplink lands in the same power/cooling domain as the thing it manages, fails precisely when you need it. The discipline against both is the same — design for the failure case, then commission against the failure case: validate PTP under a saturated data plane, and validate OOB reachability with the data plane deliberately down. An operational fabric tested only in the happy path has not been tested. → Chapter 13.7; reliability framing in Chapter 12.2.

How time and management underpin the rest of the cluster

It is worth making explicit how far these two quiet fabrics reach into the workloads that earn the revenue. Telemetry correlation — the entire observability stack of metrics, traces, and logs that you use to find a regression or attribute a stall — is only as trustworthy as the clock its records are stamped with; PTP is what makes cross-device traces orderable. RoCE diagnostics — chasing PFC storms, ECN cascades, and microburst-induced drops on a lossless Ethernet fabric — are a microsecond-scale forensic exercise that simply cannot be done on a millisecond clock. Multi-DC training — asynchronous and hierarchical schemes that span campuses — depends on a shared time base to order cross-site events and bound staleness. And the management fabric is the substrate of goodput recovery itself: when a node stalls a synchronous run, your ability to reach its BMC, read why, power-cycle or drain it, and return the cluster to useful work is measured in minutes-of-stranded-GPUs, which is real money against the depreciation clock.

The strategic read is that these are not cost centers to minimize but goodput and ramp enablers to fund deliberately. A physical OOB fabric and a measured PTP plane are a small fraction of a cluster's capex and an outsized fraction of its operability — the difference between an incident that resolves in minutes from a console and one that resolves in hours from a row walk, and between a telemetry timeline you can trust and one you cannot. Scope them with the same rigor as the data plane, gate them in commissioning, and they become invisible in the best way: you stop noticing them, because they never fail you at the moment you need them.

This chapter sits inside the broader fabric design: AI traffic characterization and the fundamentals in Chapter 8.1; scale-out topology and oversubscription in Chapter 8.5; the congestion-control and in-network-compute mechanisms whose diagnosis depends on precise time in Chapter 8.6; and the multi-campus scale-across fabric whose cross-site ordering assumes a shared time base in Chapter 8.8. The acceptance gates that prove both the OOB fabric and the timing plane live in Chapter 13.7. Management-plane isolation, GNSS/timing-source security, and segmentation policy are deepened in Chapter 11.7 and framed by the threat model in Chapter 11.1. The goodput-vs-availability lens that explains why management reachability is a reliability lever sits in Chapter 12.2; topology-aware scheduling that consumes node health/telemetry in Chapter 10.2; and the checkpoint math behind why fast node recovery matters in Chapter 9.4.