Chapter 10.9
Customer Onboarding, Delivery & Productization
A GPU cluster is not a product until someone can buy it, get a job running on it, be metered fairly for it, and leave it — and where you sit on the value-stack ladder (bare-metal up to serverless) deterministically sets your isolation model, your SLA exposure, your billing engine, and the gross margin you keep on every GPU-hour you sell.
What you'll decide here
- Which rung of the value-stack ladder you sell — bare-metal, GPUaaS, managed, or serverless — because that single choice sets your isolation boundary, the share of goodput risk you carry versus the customer, your billing granularity, and the margin you can keep.
- Your isolation model — physical-node / bare-metal vs VM vs MIG/time-slice fractional — because it is simultaneously a security boundary, a noisy-neighbor boundary, and a billing-granularity boundary, and the three pull in different directions.
- What you actually commit to in the SLA — facility availability (the easy promise) versus goodput / effective-training-time (the promise the customer actually values) — and the credit schedule that prices a miss.
- Your capacity-and-billing model across the on-demand / spot / reserved / take-or-pay spectrum — the split that sets your revenue predictability, your debt capacity, and how much utilization risk you have transferred to the customer.
- Your time-to-first-job target and the onboarding automation that hits it — because TTFJ is the conversion metric of the whole business, and the offboarding/portability story is what a sophisticated buyer checks before they sign.
Every chapter in Part 10 builds the machine; this is the chapter that turns the machine into a product someone can buy. The orchestration plane, the multi-tenancy isolation, the provisioning automation, the health telemetry, the fault-tolerance loop, the training frameworks — all of it is internal plumbing until it is wrapped in a consumption model, an API, an SLA, a billing meter, and an onboarding path. Productization is the act of drawing a clean line between what the operator owns and what the customer owns, and then pricing the responsibility on each side of that line. Draw the line in the wrong place and you either give away margin you should have kept or carry risk you should have transferred.
We lay out the value-stack ladder from bare-metal to serverless and show how each rung moves the boundary; we work the multi-tenancy and isolation fork as a joint security/noisy-neighbor/billing decision; we build the control plane, API, and provisioning surface that makes the product self-serve; we separate the SLA you can promise (facility uptime) from the SLA the customer values (goodput); we map the billing and capacity spectrum from spot to take-or-pay; and we close on onboarding, time-to-first-job, and offboarding/portability — the metrics that decide conversion and the exit story that decides trust. The rung you pick on the ladder sets every other choice here.
The value-stack ladder: the master commercial fork
The same physical GPUs can be sold at four very different altitudes, and the altitude you pick is the master commercial decision of this chapter. At the bottom is bare-metal: you hand the customer a dedicated, single-tenant node (or a whole cluster) with raw access to the hardware, an OS image, and a network fabric — and almost nothing else. You keep a thin margin on raw capacity, you carry no software risk, and the customer owns goodput, drivers, schedulers, and uptime above the metal. One rung up is GPUaaS (the virtualized or containerized IaaS tier): VMs or Kubernetes with the GPU passed through, multi-tenant fabric, managed networking and storage, a self-serve API. Higher still is managed: the operator runs the orchestration plane for the customer — managed Slurm, managed Kubernetes, validated training stacks, active health-checks, automatic node draining — and prices the operational labor. At the top is serverless: the customer never sees a node at all; they send a request or a function, the platform scales GPUs up from zero and back down, and the meter runs in milliseconds.
The consequence of moving up the ladder is a systematic trade of margin for responsibility. Each rung adds operator-owned software and operator-carried risk, and each rung adds price you can charge for it. A bare-metal hour and a serverless second of the same H100 can differ by an order of magnitude in effective $/GPU-hr, and the difference is not arbitrage — it is the operator absorbing scheduling, idle-time, cold-start, and goodput risk that the bare-metal customer would otherwise carry themselves. The fork is not 'which is better'; it is which risks you want to own and charge for, and which you want to push to the customer at a lower price. → multi-tenancy mechanics in Chapter 10.3; provisioning automation in Chapter 10.5.
| Rung | Operator owns | Customer owns | Billing unit | Goodput owner | Margin posture |
|---|---|---|---|---|---|
| Bare-metal | Power, cooling, fabric, OS image | Drivers, scheduler, uptime, jobs | GPU-hour (reserved/dedicated) | Customer | Thinnest; pure capacity |
| GPUaaS (IaaS) | Hypervisor/K8s, network, storage, API | Workloads, orchestration logic | GPU-hour / GPU-second | Shared | Thin-to-moderate; volume tier |
| Managed | Orchestration, health-checks, validated stacks | Model, data, hyperparameters | GPU-hour + managed-service fee | Operator | Moderate; priced operational labor |
| Serverless | Whole stack, autoscale, cold-start | The request / function only | Per-second / per-token / per-request | Operator | Highest per-unit; absorbs idle+cold-start |
Multi-tenancy and isolation: one decision, three boundaries
The moment you sell anything above dedicated bare-metal, you have to decide how tenants share hardware — and the isolation model you pick is simultaneously a security boundary, a noisy-neighbor boundary, and a billing-granularity boundary. These three pull in different directions, which is why the decision is harder than it looks. The taxonomy runs from hard to soft. Physical / bare-metal isolation gives each tenant a whole node or cluster — the strongest boundary, the coarsest billing unit (you cannot sell a fraction of a node), and zero noisy-neighbor risk. VM-level isolation with GPU passthrough multiplexes tenants per host behind a hypervisor; the boundary is strong but the east-west fabric and shared storage become the contended surface. Fractional isolation — NVIDIA MIG (hardware-partitioned) or time-slicing/MPS (software-shared) — lets you sell a slice of a single GPU, which is the only way to make small-inference and serverless economics work, but it is the weakest boundary on every axis.
The consequence to name explicitly: fractional sharing is a billing enabler and a security liability at the same time. MIG partitions are hardware-enforced and the strongest of the sub-GPU options, but time-slicing and MPS are not robust confidentiality boundaries — documented covert and side channels bypass MPS/MIG isolation, and real vGPU CVEs (e.g. the 2025 NVIDIA vGPU advisories) have shown that partitioning is not, by itself, a security boundary you can sell to an adversarial multi-tenant workload. The consequence is therefore: if your tenants are mutually untrusting (a public neocloud), you owe them either bare-metal/VM isolation or hardware-attested confidential computing, and you give up the finest billing granularity. If your tenants are one organization (an internal platform), fractional sharing is the right efficiency lever and the security objection mostly evaporates. → the full isolation engineering and its failure modes are Chapter 10.3; model/weight-in-use protection is Chapter 11.8.
| Isolation model | Security boundary | Noisy-neighbor | Billing granularity | Adversarial-safe? |
|---|---|---|---|---|
| Bare-metal / dedicated node | Strongest (physical) | None | Whole node/cluster | Yes |
| VM passthrough | Strong (hypervisor) | Fabric/storage contention | Per-GPU / per-VM | Yes, with fabric isolation |
| MIG (hardware partition) | Moderate (HW-enforced) | Low (partitioned) | GPU fraction (fixed slices) | Caution — side channels demonstrated |
| Time-slicing / MPS | Weakest (software) | High (shared SMs) | Finest (arbitrary share) | No — not a confidentiality boundary |
| Confidential computing (TEE) | Strong (attested, in-use) | Per underlying model | Per attested instance | Yes — designed for it |
The control plane, API and provisioning surface
Productization is, concretely, the act of putting an API in front of the cluster so a customer can self-serve the lifecycle — request capacity, provision an environment, run a job, observe it, tear it down — without a human in the loop. The control plane that backs that API is the difference between a colo full of GPUs and a cloud. Below the API sits the bring-up automation from Chapter 10.5 (Redfish/IPMI, PXE, image pipelines, infrastructure-as-code), the scheduler from Chapter 10.1 (Slurm or Kubernetes, with fair-share, quota, and preemption), and the telemetry from Chapter 10.6 feeding the meter and the health dashboard. The product decision is how much of that surface you expose: a bare-metal vendor exposes node lifecycle; a managed vendor exposes a job-and-cluster abstraction and hides the nodes; a serverless vendor exposes only a function or an inference endpoint.
The consequence of getting the control plane wrong is measured in operational labor that should have been software. Every manual step between 'customer clicks provision' and 'job is running' is a margin leak and a TTFJ penalty; the maturity benchmark in the market (SemiAnalysis ClusterMAX) scores providers in part on exactly this — orchestration, lifecycle automation, and observability are explicit rated dimensions. A neocloud that bring-ups nodes by hand cannot hit a serverless cold-start budget or a managed-Slurm SLA, and it cannot scale its sales without scaling its NOC headcount linearly. The control plane is where the productization either compounds (software that serves the next 1,000 customers at near-zero marginal cost) or fails to (a services business that scales its NOC headcount linearly with revenue).
Deep dive: the API surface a credible GPU product must expose
A productized control plane is not one API but a layered set, and a buyer evaluates each layer. Identity and tenancy: per-tenant projects/namespaces, RBAC, quotas, and budget caps — the substrate of multi-tenancy and the thing that stops one tenant from spending another's capacity. Capacity: request, reserve, and release GPUs against on-demand, spot, and reserved pools, with the reservation model exposed as a first-class object (Google Cloud, instructively, makes capacity reservations distinct API objects from committed-use discounts — separating 'is the hardware held for me' from 'have I committed to pay'). Lifecycle: provision an environment (image, drivers, NCCL, scheduler), run/checkpoint/resume a job, drain/replace a faulty node — ideally declaratively via Terraform-style IaC so the customer can version their cluster. Observability and metering: per-job and per-tenant utilization, goodput/health signals, and a usage feed that the billing engine consumes. Egress and data: object and high-throughput storage, dataset staging, and — critically for the offboarding story — bulk export.
The design tension is abstraction versus control. Training customers running 3D-parallel jobs want low-level control: topology-aware placement, specific NCCL/fabric tuning, bare-metal performance with no hypervisor tax. Inference and app-layer customers want the opposite: a high abstraction that hides nodes entirely. A single product cannot be maximally low-level and maximally abstracted at once, which is the deeper reason the value-stack ladder exists as separate rungs rather than a single dial. Most credible operators ship two or three rungs (e.g. bare-metal/Slurm for training tenants, serverless endpoints for inference tenants) on one underlying fleet, and let the customer pick the abstraction that matches the workload. → scheduling plane in Chapter 10.1; bring-up IaC in Chapter 10.5.
SLAs and reliability commitments: facility uptime vs goodput
Here is the commercial trap that catches new operators. The easy SLA to write is facility availability — 'your node will be powered and reachable 99.9% of the month' — because it maps to the uptime tiers everyone already understands (Tier III ~99.982%, ~1.6 hr/yr down; Tier IV ~99.995%, ~26 min/yr; Uptime Institute). The SLA the customer actually values is goodput — the fraction of contracted GPU-time that produced useful work. For a training tenant, a node that is 'available' but flapping, throttling, or failing NCCL all-reduce is worthless; what they bought was effective training time, and the industry now measures it: goodput averages ~90% across operators, with best-in-class clusters marketed near ~96% (SemiAnalysis ClusterMAX; Google Cloud goodput definition). The gap between those two SLA shapes is the gap between selling capacity and selling outcomes.
The trade is direct: the SLA shape you choose dictates which failures you pay for and how much you must invest to avoid them. Commit only to facility availability and your obligation is met by redundant power and cooling; the customer eats the goodput loss from a bad GPU or a fabric brownout, and you compete on price. Commit to goodput and you must build the whole fault-tolerance loop — active health-checks, fast node draining, automatic restart-from-checkpoint, burn-in and acceptance testing — because every percentage point of badput is now a credit you owe. The reliability overhead to hold goodput high runs an estimated 6–21% of TCO; that spend is optional for a capacity vendor and mandatory for a goodput vendor. The credit schedule is where this gets priced: tiered service credits (e.g. 10% of the affected charges below 99.9%, escalating with the miss) must be calibrated against your actual goodput distribution, or a single bad month of badput erases a quarter of margin. → goodput-vs-availability engineering in Chapter 12.2; the contract mechanics in Chapter 12.4; the fault-tolerance loop in Chapter 10.7.
Billing, metering and capacity reservation
The billing model is where the consumption decision becomes cash, and it spans a spectrum from pure-spot to take-or-pay that mirrors the value-stack ladder in a second dimension. On-demand bills per second or per hour at the highest unit rate, with no commitment — the customer pays for optionality and you carry the utilization risk of an unfilled fleet. Spot / preemptible sells interruptible capacity at a steep discount (often 60–80% off on-demand) to drain idle inventory; the customer carries the interruption risk, which is acceptable for checkpoint-tolerant training and batch but fatal for online inference. Reserved commits the customer to a term (months to years) for a discount and a capacity guarantee — and, importantly, capacity reservation and the price commitment are separable: holding hardware for a tenant ('is it mine?') is a different promise from committing to pay for it ('have I agreed the spend?'), and mature platforms expose them as distinct objects. At the far end, take-or-pay obligates the customer to pay for a contracted block whether or not they use it — the structure that converts merchant utilization risk into contracted revenue, and the one that underwrites the debt capacity behind the build.
The consequence the operator must internalize: the on-demand/spot/reserved/take-or-pay split is a risk-transfer dial, and where you set it determines your debt capacity and your survival of a downturn. A fleet sold entirely on-demand is maximally flexible for customers and maximally exposed for you — you sit directly on the ~70% breakeven utilization cliff (a 1,024-GPU cluster swings from −$330k/month at 55% utilization to +$340k at 85%; the ~70% figure is contested and single-source). A fleet anchored by take-or-pay has transferred that utilization risk to tenants, which is exactly why lenders price GPU-backed debt against contracted backlog rather than merchant hope. The trap on the take-or-pay side is concentration: a backlog that is take-or-pay but lives in two or three anchor tenants converts utilization risk into counterparty risk, and one non-renewal can strand a campus. The metering substrate underneath all of this — per-second granularity, per-token billing for inference, transparent egress, no surprise fees — is itself a competitive axis; ClusterMAX scores pricing transparency as a rated dimension precisely because hidden egress and rounding games erode trust. → the firm-level unit economics of this revenue (metered $/token and $/GPU-hr → margin) are worked in Chapter 1.8.
| Model | Commitment | Indicative discount vs on-demand | Utilization risk | Best fit |
|---|---|---|---|---|
| On-demand | None | 0% (baseline) | Operator | Bursty, unpredictable demand |
| Spot / preemptible | None (interruptible) | ~60-80% off | Customer (interruption) | Checkpoint-tolerant training, batch |
| Reserved (term) | Months-years | ~30-60% off | Shared | Steady, forecastable workloads |
| Take-or-pay | Pay-regardless block | Largest, plus capacity guarantee | Customer (transferred) | Anchor tenants; debt-underwriting |
Onboarding, time-to-first-job and the productization lifecycle
Time-to-first-job (TTFJ) is the conversion metric of the entire business — the wall-clock from 'customer signs up' to 'customer's first useful GPU job is running' — and it is where the productization either delights or dies. On a bare-metal rung, TTFJ is dominated by node provisioning, image deployment, driver/NCCL setup, and acceptance burn-in; a manual operator measures this in days, an automated one in hours, and the gap is pure product quality. On a serverless rung, TTFJ collapses to cold-start latency — the seconds from request to first token — and the numbers are now precise: warm-pool serverless H100 endpoints hit single-digit-to-low-teens seconds to first token, cold (scale-from-zero) containers run 30–90 seconds depending on image size, and snapshot/weights-in-VRAM tricks claim up to ~10x cold-start improvements (Modal/RunPod comparisons, 2025–2026). The same metric, TTFJ, has completely different physics at the two ends of the ladder.
The consequence is that onboarding automation is not a nicety; it is the marginal-cost structure of the product. Each manual onboarding step caps how many customers one engineer can land, which caps growth at the speed of hiring. The fix is the same automation that backs the control plane — IaC-driven provisioning, self-serve quota/RBAC, pre-validated stacks, and acceptance tests that run without a human — so that the 1,000th customer onboards as cheaply as the first. The productization lifecycle that follows TTFJ is the long tail: usage growth, expansion to reserved/committed capacity, support tiers, and the metering that bills it all. But the lifecycle never starts if TTFJ is bad, which is why it is the first thing a sophisticated buyer tests with a trial workload before committing budget.
And then there is the decision most operators underweight and the best buyers check first: offboarding and portability. A customer evaluating a multi-year, take-or-pay GPU commitment is implicitly asking 'how hard is it to leave?' — because the answer prices their lock-in risk into the deal. The portability story has three parts: data egress (can I bulk-export my datasets and checkpoints without a punitive egress bill or a multi-week throughput bottleneck?), stack portability (is my orchestration standard — vanilla Slurm/Kubernetes, OCI containers — or a proprietary control plane that traps my workflows?), and commitment exit (does my reserved/take-or-pay block have real termination economics or a secondary market, or am I locked to a depreciating asset for the full term?). The conclusion is counterintuitive: a credible, low-friction exit story is a sales asset, not a risk. The operator who builds on open standards and clean egress wins the cautious enterprise buyer precisely because the lack of lock-in lowers the buyer's perceived risk of signing — and a secondary market for reserved blocks (an emerging structure for reserved-capacity trading) turns a take-or-pay commitment from a trap into a liquid, transferable asset. The operator who maximizes lock-in wins the term but loses the next deal to a competitor the buyer trusts more.
The unit-economics tie-back
Everything in this chapter resolves into a single revenue line that Chapter 1.8 turns into a return. The value-stack rung sets your price per unit; the isolation model sets your finest billable granularity; the SLA shape sets your reliability spend and your credit exposure; the capacity model sets how much of your revenue is contracted versus merchant; and TTFJ sets your conversion. Together they determine metered revenue per token and per GPU-hour, which — net of the cost stack from Part 5 through Part 9 and the depreciation debate from Chapter 1.8 — is the gross margin the asset earns. The clean way to read this chapter is as the revenue-side complement to 1.8's cost-side analysis: 1.8 asks whether the asset earns its cost of capital; this chapter decides how the revenue that feeds that question actually gets priced, packaged, metered, and collected. The recurring warning from 1.8 applies in full here: underwrite the metered revenue with an explicit price-decline curve (inference $/token has fallen ~10x/yr at fixed quality), never a flat line, because the product you price profitably today is re-priced by a cheaper rung or a more efficient serving stack within quarters. → inference serving efficiency, the governor of $/token, is engineered in Chapter 10.11; data-handling obligations that ride alongside the commercial contract are Chapter 10.10.