Guide › Software, Orchestration & Service Delivery › 10.9

Chapter 10.9

Customer Onboarding, Delivery & Productization

A GPU cluster is not a product until someone can buy it, get a job running on it, be metered fairly for it, and leave it — and where you sit on the value-stack ladder (bare-metal up to serverless) deterministically sets your isolation model, your SLA exposure, your billing engine, and the gross margin you keep on every GPU-hour you sell.

GOODPUTPOWER-BOUND

What you'll decide here

Which rung of the value-stack ladder you sell — bare-metal, GPUaaS, managed, or serverless — because that single choice sets your isolation boundary, the share of goodput risk you carry versus the customer, your billing granularity, and the margin you can keep.
Your isolation model — physical-node / bare-metal vs VM vs MIG/time-slice fractional — because it is simultaneously a security boundary, a noisy-neighbor boundary, and a billing-granularity boundary, and the three pull in different directions.
What you actually commit to in the SLA — facility availability (the easy promise) versus goodput / effective-training-time (the promise the customer actually values) — and the credit schedule that prices a miss.
Your capacity-and-billing model across the on-demand / spot / reserved / take-or-pay spectrum — the split that sets your revenue predictability, your debt capacity, and how much utilization risk you have transferred to the customer.
Your time-to-first-job target and the onboarding automation that hits it — because TTFJ is the conversion metric of the whole business, and the offboarding/portability story is what a sophisticated buyer checks before they sign.

Every chapter in Part 10 builds the machine; this is the chapter that turns the machine into a product someone can buy. The orchestration plane, the multi-tenancy isolation, the provisioning automation, the health telemetry, the fault-tolerance loop, the training frameworks — all of it is internal plumbing until it is wrapped in a consumption model, an API, an SLA, a billing meter, and an onboarding path. Productization is the act of drawing a clean line between what the operator owns and what the customer owns, and then pricing the responsibility on each side of that line. Draw the line in the wrong place and you either give away margin you should have kept or carry risk you should have transferred.

We lay out the value-stack ladder from bare-metal to serverless and show how each rung moves the boundary; we work the multi-tenancy and isolation fork as a joint security/noisy-neighbor/billing decision; we build the control plane, API, and provisioning surface that makes the product self-serve; we separate the SLA you can promise (facility uptime) from the SLA the customer values (goodput); we map the billing and capacity spectrum from spot to take-or-pay; and we close on onboarding, time-to-first-job, and offboarding/portability — the metrics that decide conversion and the exit story that decides trust. The rung you pick on the ladder sets every other choice here.

The value-stack ladder: the master commercial fork

The same physical GPUs can be sold at four very different altitudes, and the altitude you pick is the master commercial decision of this chapter. At the bottom is bare-metal: you hand the customer a dedicated, single-tenant node (or a whole cluster) with raw access to the hardware, an OS image, and a network fabric — and almost nothing else. You keep a thin margin on raw capacity, you carry no software risk, and the customer owns goodput, drivers, schedulers, and uptime above the metal. One rung up is GPUaaS (the virtualized or containerized IaaS tier): VMs or Kubernetes with the GPU passed through, multi-tenant fabric, managed networking and storage, a self-serve API. Higher still is managed: the operator runs the orchestration plane for the customer — managed Slurm, managed Kubernetes, validated training stacks, active health-checks, automatic node draining — and prices the operational labor. At the top is serverless: the customer never sees a node at all; they send a request or a function, the platform scales GPUs up from zero and back down, and the meter runs in milliseconds.

The consequence of moving up the ladder is a systematic trade of margin for responsibility. Each rung adds operator-owned software and operator-carried risk, and each rung adds price you can charge for it. A bare-metal hour and a serverless second of the same H100 can differ by an order of magnitude in effective $/GPU-hr, and the difference is not arbitrage — it is the operator absorbing scheduling, idle-time, cold-start, and goodput risk that the bare-metal customer would otherwise carry themselves. The fork is not 'which is better'; it is which risks you want to own and charge for, and which you want to push to the customer at a lower price. → multi-tenancy mechanics in Chapter 10.3; provisioning automation in Chapter 10.5.

The master fork: capacity vendor vs goodput vendor

Decide which side of the capacity/goodput line you sell before anything else. Selling bare-metal or raw GPUaaS makes you a capacity vendor: you promise the hardware is powered, networked, and reachable, and the customer is responsible for turning capacity into useful work. Selling managed or serverless makes you a goodput vendor: you promise effective, useful compute — successful training steps, served tokens within an SLO — and you own the entire stack between the metal and that outcome. These are different businesses with different cost structures, different SLA exposure, and different margins. A capacity vendor competes on $/GPU-hr and time-to-power; a goodput vendor competes on reliability, time-to-first-job, and the operational labor it absorbs. The mistake is selling a goodput-shaped SLA on a capacity-shaped cost structure — promising effective-training-time while staffing and pricing as if you only owe a powered node. Decide which side of this line you are on before you write the SLA, because the credit schedule will bankrupt the wrong one. → SLA framing in Chapter 12.4.

The value-stack ladder — what each rung owns, sells, and risks

Rung	Operator owns	Customer owns	Billing unit	Goodput owner	Margin posture
Bare-metal	Power, cooling, fabric, OS image	Drivers, scheduler, uptime, jobs	GPU-hour (reserved/dedicated)	Customer	Thinnest; pure capacity
GPUaaS (IaaS)	Hypervisor/K8s, network, storage, API	Workloads, orchestration logic	GPU-hour / GPU-second	Shared	Thin-to-moderate; volume tier
Managed	Orchestration, health-checks, validated stacks	Model, data, hyperparameters	GPU-hour + managed-service fee	Operator	Moderate; priced operational labor
Serverless	Whole stack, autoscale, cold-start	The request / function only	Per-second / per-token / per-request	Operator	Highest per-unit; absorbs idle+cold-start

Synthesis of SemiAnalysis ClusterMAX 2.0, Crusoe managed-Slurm, Rafay GPU-PaaS, and Introl serverless comparisons, 2025-2026. 'Goodput owner' = who is contractually on the hook for effective useful compute, not just powered hardware.

Multi-tenancy and isolation: one decision, three boundaries

The moment you sell anything above dedicated bare-metal, you have to decide how tenants share hardware — and the isolation model you pick is simultaneously a security boundary, a noisy-neighbor boundary, and a billing-granularity boundary. These three pull in different directions, which is why the decision is harder than it looks. The taxonomy runs from hard to soft. Physical / bare-metal isolation gives each tenant a whole node or cluster — the strongest boundary, the coarsest billing unit (you cannot sell a fraction of a node), and zero noisy-neighbor risk. VM-level isolation with GPU passthrough multiplexes tenants per host behind a hypervisor; the boundary is strong but the east-west fabric and shared storage become the contended surface. Fractional isolation — NVIDIA MIG (hardware-partitioned) or time-slicing/MPS (software-shared) — lets you sell a slice of a single GPU, which is the only way to make small-inference and serverless economics work, but it is the weakest boundary on every axis.

The consequence to name explicitly: fractional sharing is a billing enabler and a security liability at the same time. MIG partitions are hardware-enforced and the strongest of the sub-GPU options, but time-slicing and MPS are not robust confidentiality boundaries — documented covert and side channels bypass MPS/MIG isolation, and real vGPU CVEs (e.g. the 2025 NVIDIA vGPU advisories) have shown that partitioning is not, by itself, a security boundary you can sell to an adversarial multi-tenant workload. The consequence is therefore: if your tenants are mutually untrusting (a public neocloud), you owe them either bare-metal/VM isolation or hardware-attested confidential computing, and you give up the finest billing granularity. If your tenants are one organization (an internal platform), fractional sharing is the right efficiency lever and the security objection mostly evaporates. → the full isolation engineering and its failure modes are Chapter 10.3; model/weight-in-use protection is Chapter 11.8.

Isolation models scored on the three boundaries they straddle

Isolation model	Security boundary	Noisy-neighbor	Billing granularity	Adversarial-safe?
Bare-metal / dedicated node	Strongest (physical)	None	Whole node/cluster	Yes
VM passthrough	Strong (hypervisor)	Fabric/storage contention	Per-GPU / per-VM	Yes, with fabric isolation
MIG (hardware partition)	Moderate (HW-enforced)	Low (partitioned)	GPU fraction (fixed slices)	Caution — side channels demonstrated
Time-slicing / MPS	Weakest (software)	High (shared SMs)	Finest (arbitrary share)	No — not a confidentiality boundary
Confidential computing (TEE)	Strong (attested, in-use)	Per underlying model	Per attested instance	Yes — designed for it

Synthesis of NVIDIA MIG/Confidential Computing docs, Introl/Aarna multi-tenancy taxonomies, and NVIDIA vGPU security advisories, 2025-2026. 'Adversarial-safe' = defensible boundary for mutually-untrusting public tenants.

The control plane, API and provisioning surface

Productization is, concretely, the act of putting an API in front of the cluster so a customer can self-serve the lifecycle — request capacity, provision an environment, run a job, observe it, tear it down — without a human in the loop. The control plane that backs that API is the difference between a colo full of GPUs and a cloud. Below the API sits the bring-up automation from Chapter 10.5 (Redfish/IPMI, PXE, image pipelines, infrastructure-as-code), the scheduler from Chapter 10.1 (Slurm or Kubernetes, with fair-share, quota, and preemption), and the telemetry from Chapter 10.6 feeding the meter and the health dashboard. The product decision is how much of that surface you expose: a bare-metal vendor exposes node lifecycle; a managed vendor exposes a job-and-cluster abstraction and hides the nodes; a serverless vendor exposes only a function or an inference endpoint.

The consequence of getting the control plane wrong is measured in operational labor that should have been software. Every manual step between 'customer clicks provision' and 'job is running' is a margin leak and a TTFJ penalty; the maturity benchmark in the market (SemiAnalysis ClusterMAX) scores providers in part on exactly this — orchestration, lifecycle automation, and observability are explicit rated dimensions. A neocloud that bring-ups nodes by hand cannot hit a serverless cold-start budget or a managed-Slurm SLA, and it cannot scale its sales without scaling its NOC headcount linearly. The control plane is where the productization either compounds (software that serves the next 1,000 customers at near-zero marginal cost) or fails to (a services business that scales its NOC headcount linearly with revenue).

Deep dive: the API surface a credible GPU product must expose

A productized control plane is not one API but a layered set, and a buyer evaluates each layer. Identity and tenancy: per-tenant projects/namespaces, RBAC, quotas, and budget caps — the substrate of multi-tenancy and the thing that stops one tenant from spending another's capacity. Capacity: request, reserve, and release GPUs against on-demand, spot, and reserved pools, with the reservation model exposed as a first-class object (Google Cloud, instructively, makes capacity reservations distinct API objects from committed-use discounts — separating 'is the hardware held for me' from 'have I committed to pay'). Lifecycle: provision an environment (image, drivers, NCCL, scheduler), run/checkpoint/resume a job, drain/replace a faulty node — ideally declaratively via Terraform-style IaC so the customer can version their cluster. Observability and metering: per-job and per-tenant utilization, goodput/health signals, and a usage feed that the billing engine consumes. Egress and data: object and high-throughput storage, dataset staging, and — critically for the offboarding story — bulk export.

The design tension is abstraction versus control. Training customers running 3D-parallel jobs want low-level control: topology-aware placement, specific NCCL/fabric tuning, bare-metal performance with no hypervisor tax. Inference and app-layer customers want the opposite: a high abstraction that hides nodes entirely. A single product cannot be maximally low-level and maximally abstracted at once, which is the deeper reason the value-stack ladder exists as separate rungs rather than a single dial. Most credible operators ship two or three rungs (e.g. bare-metal/Slurm for training tenants, serverless endpoints for inference tenants) on one underlying fleet, and let the customer pick the abstraction that matches the workload. → scheduling plane in Chapter 10.1; bring-up IaC in Chapter 10.5.

SLAs and reliability commitments: facility uptime vs goodput

Here is the commercial trap that catches new operators. The easy SLA to write is facility availability — 'your node will be powered and reachable 99.9% of the month' — because it maps to the uptime tiers everyone already understands (Tier III ~99.982%, ~1.6 hr/yr down; Tier IV ~99.995%, ~26 min/yr; Uptime Institute). The SLA the customer actually values is goodput — the fraction of contracted GPU-time that produced useful work. For a training tenant, a node that is 'available' but flapping, throttling, or failing NCCL all-reduce is worthless; what they bought was effective training time, and the industry now measures it: goodput averages ~90% across operators, with best-in-class clusters marketed near ~96% (SemiAnalysis ClusterMAX; Google Cloud goodput definition). The gap between those two SLA shapes is the gap between selling capacity and selling outcomes.

The trade is direct: the SLA shape you choose dictates which failures you pay for and how much you must invest to avoid them. Commit only to facility availability and your obligation is met by redundant power and cooling; the customer eats the goodput loss from a bad GPU or a fabric brownout, and you compete on price. Commit to goodput and you must build the whole fault-tolerance loop — active health-checks, fast node draining, automatic restart-from-checkpoint, burn-in and acceptance testing — because every percentage point of badput is now a credit you owe. The reliability overhead to hold goodput high runs an estimated 6–21% of TCO; that spend is optional for a capacity vendor and mandatory for a goodput vendor. The credit schedule is where this gets priced: tiered service credits (e.g. 10% of the affected charges below 99.9%, escalating with the miss) must be calibrated against your actual goodput distribution, or a single bad month of badput erases a quarter of margin. → goodput-vs-availability engineering in Chapter 12.2; the contract mechanics in Chapter 12.4; the fault-tolerance loop in Chapter 10.7.

Don't sell a goodput SLA you can't measure

A goodput commitment you cannot instrument is a liability with no defense. If your telemetry cannot attribute lost training time to a root cause — your bad GPU versus the customer's buggy code versus a checkpoint they forgot to enable — every dispute resolves against you, because you carry the burden of proof on your own infrastructure. Before you write 'effective-training-time' into a contract, you need: per-job health attribution, an agreed definition of badput (and whose badput counts), burn-in/acceptance criteria the customer signs off on at handover, and a measurement methodology both sides trust. ClusterMAX exists in part because buyers learned not to take goodput claims on faith. The failure mode is concrete: an operator markets ~96% goodput, contracts credits below 95%, lacks the attribution telemetry to prove a given month's badput was the customer's fault, and pays credits on losses it did not cause. Measure first, commit second. → acceptance/burn-in and health attribution in Chapter 10.6 and Chapter 10.7.

Billing, metering and capacity reservation

The billing model is where the consumption decision becomes cash, and it spans a spectrum from pure-spot to take-or-pay that mirrors the value-stack ladder in a second dimension. On-demand bills per second or per hour at the highest unit rate, with no commitment — the customer pays for optionality and you carry the utilization risk of an unfilled fleet. Spot / preemptible sells interruptible capacity at a steep discount (often 60–80% off on-demand) to drain idle inventory; the customer carries the interruption risk, which is acceptable for checkpoint-tolerant training and batch but fatal for online inference. Reserved commits the customer to a term (months to years) for a discount and a capacity guarantee — and, importantly, capacity reservation and the price commitment are separable: holding hardware for a tenant ('is it mine?') is a different promise from committing to pay for it ('have I agreed the spend?'), and mature platforms expose them as distinct objects. At the far end, take-or-pay obligates the customer to pay for a contracted block whether or not they use it — the structure that converts merchant utilization risk into contracted revenue, and the one that underwrites the debt capacity behind the build.

The consequence the operator must internalize: the on-demand/spot/reserved/take-or-pay split is a risk-transfer dial, and where you set it determines your debt capacity and your survival of a downturn. A fleet sold entirely on-demand is maximally flexible for customers and maximally exposed for you — you sit directly on the ~70% breakeven utilization cliff (a 1,024-GPU cluster swings from −$330k/month at 55% utilization to +$340k at 85%; the ~70% figure is contested and single-source). A fleet anchored by take-or-pay has transferred that utilization risk to tenants, which is exactly why lenders price GPU-backed debt against contracted backlog rather than merchant hope. The trap on the take-or-pay side is concentration: a backlog that is take-or-pay but lives in two or three anchor tenants converts utilization risk into counterparty risk, and one non-renewal can strand a campus. The metering substrate underneath all of this — per-second granularity, per-token billing for inference, transparent egress, no surprise fees — is itself a competitive axis; ClusterMAX scores pricing transparency as a rated dimension precisely because hidden egress and rounding games erode trust. → the firm-level unit economics of this revenue (metered $/token and $/GPU-hr → margin) are worked in Chapter 1.8.

The capacity-and-billing spectrum — who carries which risk

Model	Commitment	Indicative discount vs on-demand	Utilization risk	Best fit
On-demand	None	0% (baseline)	Operator	Bursty, unpredictable demand
Spot / preemptible	None (interruptible)	~60-80% off	Customer (interruption)	Checkpoint-tolerant training, batch
Reserved (term)	Months-years	~30-60% off	Shared	Steady, forecastable workloads
Take-or-pay	Pay-regardless block	Largest, plus capacity guarantee	Customer (transferred)	Anchor tenants; debt-underwriting

Synthesis of Google Cloud GPU pricing/reservations docs, Compute Exchange reserved-vs-on-demand, and SemiAnalysis market-structure analysis, 2025-2026. Discounts are indicative ranges, highly volatile.

Onboarding, time-to-first-job and the productization lifecycle

Time-to-first-job (TTFJ) is the conversion metric of the entire business — the wall-clock from 'customer signs up' to 'customer's first useful GPU job is running' — and it is where the productization either delights or dies. On a bare-metal rung, TTFJ is dominated by node provisioning, image deployment, driver/NCCL setup, and acceptance burn-in; a manual operator measures this in days, an automated one in hours, and the gap is pure product quality. On a serverless rung, TTFJ collapses to cold-start latency — the seconds from request to first token — and the numbers are now precise: warm-pool serverless H100 endpoints hit single-digit-to-low-teens seconds to first token, cold (scale-from-zero) containers run 30–90 seconds depending on image size, and snapshot/weights-in-VRAM tricks claim up to ~10x cold-start improvements (Modal/RunPod comparisons, 2025–2026). The same metric, TTFJ, has completely different physics at the two ends of the ladder.

The consequence is that onboarding automation is not a nicety; it is the marginal-cost structure of the product. Each manual onboarding step caps how many customers one engineer can land, which caps growth at the speed of hiring. The fix is the same automation that backs the control plane — IaC-driven provisioning, self-serve quota/RBAC, pre-validated stacks, and acceptance tests that run without a human — so that the 1,000th customer onboards as cheaply as the first. The productization lifecycle that follows TTFJ is the long tail: usage growth, expansion to reserved/committed capacity, support tiers, and the metering that bills it all. But the lifecycle never starts if TTFJ is bad, which is why it is the first thing a sophisticated buyer tests with a trial workload before committing budget.

And then there is the decision most operators underweight and the best buyers check first: offboarding and portability. A customer evaluating a multi-year, take-or-pay GPU commitment is implicitly asking 'how hard is it to leave?' — because the answer prices their lock-in risk into the deal. The portability story has three parts: data egress (can I bulk-export my datasets and checkpoints without a punitive egress bill or a multi-week throughput bottleneck?), stack portability (is my orchestration standard — vanilla Slurm/Kubernetes, OCI containers — or a proprietary control plane that traps my workflows?), and commitment exit (does my reserved/take-or-pay block have real termination economics or a secondary market, or am I locked to a depreciating asset for the full term?). The conclusion is counterintuitive: a credible, low-friction exit story is a sales asset, not a risk. The operator who builds on open standards and clean egress wins the cautious enterprise buyer precisely because the lack of lock-in lowers the buyer's perceived risk of signing — and a secondary market for reserved blocks (an emerging structure for reserved-capacity trading) turns a take-or-pay commitment from a trap into a liquid, transferable asset. The operator who maximizes lock-in wins the term but loses the next deal to a competitor the buyer trusts more.

~90% / ~96%

goodput (effective training time): industry average vs best-in-class marketed; reliability overhead 6-21% of TCO

2025SemiAnalysis ClusterMAX / CoreWeave

10 dimensions / 5 tiers

ClusterMAX 2.0 GPU-cloud rating: Security, Lifecycle, Orchestration, Storage, Networking, Reliability, Monitoring, Pricing, Partnerships, Availability — Platinum to UnderPerform

2025SemiAnalysis ClusterMAX 2.0

~70%

breakeven utilization for a debt-financed fleet; the cliff the on-demand/take-or-pay mix transfers or carries (contested — single-source)

2025AM Compute / McKinsey

8-15s warm / 30-90s cold

serverless GPU time-to-first-token (H100): warm-pool vs scale-from-zero; snapshots claim ~10x cold-start gains

2026RunPod / Modal serverless comparisons

~60-80% off

spot/preemptible discount vs on-demand; the price of transferring interruption risk to the customer

2026Spheron / GCP GPU pricing synthesis

~$1.03 - $12.29/GPU-hr

H100 on-demand ladder: spot floor to Azure managed; neocloud median ~$2.29-3.50 (the value-stack premium, monetized)

2026SemiAnalysis H100 Index / AM Compute

99.982% / 99.995%

Tier III vs Tier IV facility availability (~1.6 hr vs ~26 min/yr) — the easy SLA, distinct from goodput

2025Uptime Institute

The unit-economics tie-back

Everything in this chapter resolves into a single revenue line that Chapter 1.8 turns into a return. The value-stack rung sets your price per unit; the isolation model sets your finest billable granularity; the SLA shape sets your reliability spend and your credit exposure; the capacity model sets how much of your revenue is contracted versus merchant; and TTFJ sets your conversion. Together they determine metered revenue per token and per GPU-hour, which — net of the cost stack from Part 5 through Part 9 and the depreciation debate from Chapter 1.8 — is the gross margin the asset earns. The clean way to read this chapter is as the revenue-side complement to 1.8's cost-side analysis: 1.8 asks whether the asset earns its cost of capital; this chapter decides how the revenue that feeds that question actually gets priced, packaged, metered, and collected. The recurring warning from 1.8 applies in full here: underwrite the metered revenue with an explicit price-decline curve (inference $/token has fallen ~10x/yr at fixed quality), never a flat line, because the product you price profitably today is re-priced by a cheaper rung or a more efficient serving stack within quarters. → inference serving efficiency, the governor of $/token, is engineered in Chapter 10.11; data-handling obligations that ride alongside the commercial contract are Chapter 10.10.

The altitude marker

This chapter is the commercial wrapper around the Part 10 machine — how internal infrastructure becomes a buyable, billable, leaveable product. It deliberately stops at the boundary of three neighbors. The engineering of the things it productizes lives upstream: multi-tenancy isolation in Chapter 10.3, provisioning in Chapter 10.5, observability in Chapter 10.6, fault tolerance in Chapter 10.7, inference serving in Chapter 10.11. The contractual depth of the SLA — goodput contracts, credit math, availability modeling — lives downstream in Chapter 12.4. The economics of the revenue this chapter prices lives in Chapter 1.8. Keep the altitudes separate: this chapter is about the product surface and the responsibility line, not the silicon beneath it or the balance sheet above it.

The workload archetypes that determine which rung of the ladder a customer needs are framed in Chapter 1.1, and the firm-level economics this chapter feeds are Chapter 1.8. The internal machinery being productized lives across Part 10: the scheduling plane in Chapter 10.1, multi-tenancy and isolation in Chapter 10.3, provisioning and bring-up in Chapter 10.5, observability and health in Chapter 10.6, fault tolerance in Chapter 10.7, and inference serving in Chapter 10.11. The SLA shape introduced here is engineered as goodput-vs-availability in Chapter 12.2 and contracted in Chapter 12.4. Weight/model-in-use protection behind the isolation boundary is Chapter 11.8; the customer-data governance riding alongside the commercial contract is Chapter 10.10; data residency and sovereignty constraints on where the product can be sold are Chapter 3.12.