Guide › Software, Orchestration & Service Delivery › 10.10

Chapter 10.10

Data Governance, Privacy & the Training-Data Legal Regime

The legal and privacy posture of the data flowing through an AI data center is not a paperwork problem bolted on at the end — it is an architecture decision that, made wrong, can require you to retrain a model, geo-fence a hall, or hand twenty million customer conversations to opposing counsel.

GOODPUT

What you'll decide here

Which provenance tier your training corpus is built on — fully licensed, lawfully-acquired-and-transformed, or scraped-and-hope — because that single choice sets your settlement exposure, your EU disclosure obligations, and whether a single court ruling can strand the weights you spent nine figures training.
Whether your data-residency commitment to customers is logical (contractual, software-enforced) or physical (the bytes never leave the jurisdiction) — and therefore whether one workload can run in a shared global fleet or must be pinned to in-region silicon.
How you will honor a deletion / right-to-erasure request that touches data already baked into model weights — the retention architecture, the lineage graph, and the retrain-vs-filter fallback you commit to before the first request arrives.
Where the controller/processor line sits for every tenant — what you may and may not do with customer prompts and outputs, who your sub-processors are, and how tenant data is isolated so one customer's data never trains on another's behalf.
Which decisions are data-governance (this chapter) versus model/weight security (Chapter 11.8) — because conflating 'protect the data' with 'protect the weights' produces controls that satisfy neither auditor.

An AI data center is a machine for turning data into a model and then turning user data into outputs. Every other part of this guide treats data as a payload — bytes to store, move, and compute on. This chapter treats data as a legal and regulated object: something with a provenance, an owner, a jurisdiction, a retention clock, and a set of rights attached to it that the operator inherits whether or not anyone scoped for them. The decisions here are not compliance theater. In 2025 they became the difference between a defensible asset and a stranded one: a model trained on the wrong corpus can be the subject of the largest copyright settlement in US history; a customer-data hall sited in the wrong country can be uncontractable; a deletion request you cannot honor can be a regulatory finding.

We work through five governance surfaces — training-data provenance and the copyright/fair-use frontier; the PII and privacy program; data-residency-by-jurisdiction and cross-border transfer; retention, deletion and the right-to-erasure plumbing; and the customer-facing contract stack (DPAs, sub-processors, tenant isolation). Most of these are architecture decisions, not legal ones. You cannot bolt residency onto a global fleet, or erasure onto a model with no lineage graph, after the concrete is poured and the weights are trained. This chapter governs the data; the companion problem of protecting the model and its weights — at rest, in transit, in confidential execution — is a different threat model handled in Chapter 11.8.

The master fork: provenance tier of the training corpus

Choose the provenance tier of your corpus before you crawl a single page. Every training corpus sits in one of three provenance tiers, and the tier you pick is the dominant lever on your legal exposure, your disclosure burden, and the durability of the weights. Tier 1 — fully licensed: every source under a license or contract that explicitly permits AI training. Highest data cost, lowest legal risk, cleanest EU disclosure. Tier 2 — lawfully acquired and transformed: works you bought, scanned, or accessed legitimately and used for a transformative purpose — the posture US courts have so far blessed as fair use. Tier 3 — scraped-and-hope: bulk web crawl and shadow-library downloads, betting that fair use or obscurity will hold. In June 2025 a US court drew the line precisely on the Tier 2 / Tier 3 boundary: training on lawfully-acquired books was "spectacularly" transformative fair use, but downloading pirated copies to build the library was infringement — and the bill for crossing into Tier 3 was a $1.5 billion settlement, ~$3,000 per work, plus an order to destroy the pirated copies (Bartz v. Anthropic, N.D. Cal., 2025). The tier is chosen at corpus-assembly time, it is effectively irreversible once the weights are trained, and it cannot be cured by a content filter at the output. Decide it before you crawl.

Training-data provenance and the copyright/fair-use frontier

The legal regime around training data crystallized in 2025 from speculation into case law, and the shape it took is jurisdiction-specific in a way that breaks any single global policy. The operator's job is to understand the fork between three regimes and build a corpus that survives the most demanding one its model will be deployed under.

The US: fair use, but the acquisition matters. The watershed is Bartz v. Anthropic. Judge Alsup's June 2025 summary-judgment ruling split the question in two. Using lawfully-acquired books to train an LLM that returns new text was held "among the most transformative" uses imaginable — fair use under §107. But Anthropic had also downloaded millions of books from shadow libraries (LibGen, PiLiMi) to build a permanent "central library," and that acquisition was not excused by the downstream transformative training. The September 2025 settlement — a minimum of $1.5 billion across roughly 500,000 works, plus destruction of the pirated copies — is the most expensive lesson in the industry: fair use can protect the training act while the data-acquisition act remains independently infringing. The consequence for an operator is concrete: provenance is per-document, not per-corpus, and the acquisition method is a load-bearing fact you must be able to prove for every source.

The UK: copies, not weights. In Getty Images v. Stability AI (UK High Court, November 2025), Getty abandoned its primary training-copy claims mid-trial (those copies were made outside the UK, beyond jurisdiction) and lost on secondary infringement: the court held that AI model weights are not an "infringing copy" under the CDPA, because the weights store statistically-trained parameters, not stored or reconstructable images. The narrow win that survived was a trademark finding for generated images bearing Getty's watermark. The lesson: in the UK the model weights themselves are not contraband, which shifts the fight to where the training copies were physically made — a siting question as much as a legal one.

The EU: a statutory regime, not a fair-use doctrine. The EU has no fair use. Training on copyrighted works relies on the text-and-data-mining (TDM) exception of the 2019 DSM Directive, which lets rightsholders opt out via machine-readable reservations (robots.txt, TDM Reservation Protocol, metadata). Honor the opt-out and the mining is lawful; ignore it and the exception evaporates. Layered on top, the EU AI Act's GPAI obligations — applicable to new models since 2 August 2025 — require every general-purpose model provider to (a) maintain a copyright policy that respects opt-outs, and (b) publish a public summary of training content using the Commission's mandatory template (finalized July 2025). The disclosure is tiered: thin for licensed and private data, detailed for publicly-available datasets. This is a transparency obligation that does not exist anywhere else, and it makes "we don't disclose our training data" a non-option for any model placed on the EU market.

The training-data legal regime by jurisdiction (the fork that breaks a single global policy)

Jurisdiction	Governing doctrine	What it permits	What it forbids / requires	Operator consequence
United States	Fair use (§107), fact-specific	Transformative training on lawfully-acquired works (Bartz v. Anthropic, 2025)	No fair-use shield for the acquisition of pirated copies; per-document provenance	Prove lawful acquisition for every source; shadow-library data is a settlement-scale liability ($1.5B / ~$3k per work)
European Union	DSM TDM exception + AI Act GPAI rules	Mining works whose rightsholders have not opted out	Must honor machine-readable opt-outs; must publish training-content summary (template, since Aug 2025)	Crawler must respect robots.txt / TDM reservations; disclosure is mandatory, not optional
United Kingdom	CDPA; TDM only for non-commercial	Narrow research TDM; weights are not an infringing 'copy' (Getty v. Stability, 2025)	No commercial TDM exception; physical location of copy-making matters	Where training copies are made is a jurisdictional fact; weights themselves are not contraband
Japan	Art. 30-4 Copyright Act	Broad permission to use works for ML / information analysis	Limited where use unreasonably prejudices the rightsholder's interests	The most permissive major regime; a siting consideration for corpus assembly

2025-2026 state of the law. The right column is the operator consequence, not legal advice. US fair use is fact-specific and unsettled above the district-court level; EU obligations phase in through 2027.

The defensive artifact that ties this together is dataset documentation — a per-corpus record of source, acquisition method, license terms, opt-out status, collection date, and any filtering applied. Datasheets-for-datasets and data cards are no longer academic hygiene; they are the evidence you produce when a plaintiff's expert asks how a specific book entered your weights, and they are the raw material for the EU training-content summary. An operator who cannot answer "where did this document come from and were we allowed to use it" for an arbitrary training sample has, in effect, chosen Tier 3 by omission.

PII handling, minimization, and the privacy program

Copyright governs creative works in the corpus; privacy law governs personal data — and the two regimes are orthogonal, so a corpus can be copyright-clean and privacy-dirty at the same time. The binding question under GDPR and its global cousins (CCPA/CPRA, LGPD, PIPL, India's DPDP) is whether you have a lawful basis to process personal data for training, and whether the model that results still "contains" that personal data.

The lawful-basis fork. The EDPB's Opinion 28/2024 (December 2024) set the European frame: training on personal data generally relies on legitimate interest (Art. 6(1)(f)), which must survive a three-step test — a legitimate interest, the necessity of the processing, and a balancing against data-subject rights. The Opinion sets a deliberately high bar for claiming the trained model is "anonymous": you must show that extracting personal data from the model, or regurgitating it, is not reasonably likely. The sting in the tail: if a model was trained on unlawfully-processed personal data, that can taint the lawfulness of deploying the model — unless it has been genuinely anonymized. This is the privacy analogue of the Bartz acquisition problem: an upstream data sin can follow the model downstream.

The enforcement is already here. Italy's Garante fined OpenAI €15 million in December 2024 for training ChatGPT without an adequate legal basis and for inadequate transparency, and ordered a six-month public-awareness campaign. That is the template: regulators are not waiting for the AI Act to mature before applying ordinary GDPR principles to training pipelines.

The engineering response is minimization at ingest, not cleanup at audit. The cheapest place to handle PII is before it enters the corpus: PII detection and redaction in the data pipeline, de-duplication (which both improves the model and reduces the memorization that drives privacy risk), and aggressive filtering of high-sensitivity categories. The expensive place is after training, when removing a person's data may mean retraining. Build the privacy program as a pipeline stage with a data-protection impact assessment (DPIA) for the training run, a record of processing activities (ROPA), and a documented lawful basis — the same artifacts a DPA will ask for, produced before the run rather than after the complaint.

Memorization is the bridge between the copyright problem and the privacy problem

The two halves of this chapter — copyright and privacy — share one technical root cause: memorization. A model that regurgitates a training example verbatim is simultaneously a copyright-output risk (it reproduced a protected work) and a privacy risk (it leaked personal data), and it undermines the EDPB's "the model is anonymous" defense in one shot. The levers that suppress memorization are the same ones that improve the model: de-duplication of the corpus, scale of data relative to parameters, and output-side filters. This is why the German Hamburg DPA's 2024 discussion paper argued that storing an LLM is not itself processing personal data (the weights are not a structured record of individuals) — a position that, if it holds, makes the regulated act the training and the output, not the static weight file. Whether that argument survives is one of the load-bearing open questions of 2026; design as if regurgitation is the regulated event, because that is the version of the rule that is hardest to comply with.

Data residency, cross-border transfer, and the controller/processor line

Residency is where data governance becomes a fleet-architecture decision. The question a customer or regulator asks is deceptively simple — "where does our data live and who can reach it?" — and the honest answer determines whether a workload can run on shared global silicon or must be pinned to in-region hardware.

The fork is logical vs physical residency. Logical residency is a contractual and software-enforced promise: data is tagged with a region and the control plane routes it to in-region storage and compute, but the underlying fleet is global and a misconfiguration can leak it across a border. Physical residency is a hard guarantee: the bytes are processed only on silicon physically located in the jurisdiction, enforced by separate clusters, separate key custody, and often separate operational staff. Physical residency is dramatically more expensive — it fragments your fleet, strands capacity in low-utilization regions, and forecloses the global load-balancing that makes inference economics work — but it is what sovereign, government, and regulated-industry customers increasingly require. Choosing physical residency for a workload is choosing a smaller, more expensive, less elastic deployment, and that choice cascades into siting (you need a hall in that jurisdiction) and capacity planning (you cannot pool demand across borders).

Cross-border transfer mechanisms. Where data does cross a border, GDPR requires a transfer mechanism: an adequacy decision (the EU-US Data Privacy Framework, for participating US importers), Standard Contractual Clauses with a transfer impact assessment, or Binding Corporate Rules. China's PIPL and India's DPDP add their own localization and transfer-approval regimes that do not map onto the European ones. The operator consequence is that "the model is served from a US region" can be a compliance event for an EU customer's prompts, and the cross-border posture must be designed per-data-class, not per-company. The deeper sovereignty stack — export controls, control-of-stack, and the geopolitics of where compute is allowed to sit — is a siting problem treated in Chapter 3.12; this chapter handles the customer-data residency obligations that sit on top of it.

The controller/processor line governs what you may do with the data at all. When you serve another company's users, you are typically a processor: you may process their data only on documented instructions, and — critically — you may not use it to train your own models unless the contract explicitly permits it. When you collect data for your own model, you are a controller with the full weight of lawful-basis, transparency, and data-subject-rights obligations. Misplacing this line is the most common governance failure in multi-tenant AI: silently training the foundation model on tenant prompts is a controller act dressed up as a processor convenience, and it is exactly the kind of thing that turns a customer's data into a competitor's model.

$1.5B

Bartz v. Anthropic settlement — largest US copyright payout; ~$3,000 per work across ~500,000 works; pirated copies ordered destroyed

2025Bartz v. Anthropic (N.D. Cal.); Authors Guild; Fortune

€15M

Italian Garante fine on OpenAI for training ChatGPT without adequate legal basis + transparency failures; plus a 6-month awareness campaign

Dec 2024Garante per la protezione dei dati personali

Aug 2, 2025

EU AI Act GPAI obligations apply to new models; training-content summary template (Commission) mandatory

2025European Commission; EU AI Act

Aug 2, 2027

Deadline for pre-existing (placed before Aug 2025) GPAI models to publish their training-content summary

2025European Commission; Mayer Brown analysis

20M logs

ChatGPT conversation logs OpenAI was ordered to produce in NYT v. OpenAI discovery — output logs deemed relevant to fair-use defense

2025NYT v. OpenAI (S.D.N.Y.); Bloomberg Law

3-step test

EDPB legitimate-interest assessment for AI training (interest, necessity, balancing); high bar to claim a trained model is anonymous

Dec 2024EDPB Opinion 28/2024

weights ≠ copy

UK High Court: AI model weights are not an infringing 'copy' under the CDPA (statistical parameters, not stored images)

Nov 2025Getty Images v. Stability AI (UK High Court)

opt-out

EU DSM TDM exception is opt-out by default — crawlers must honor machine-readable reservations (robots.txt / TDM Reservation Protocol)

2025EU DSM Directive 2019/790, Art. 4; AI Act Code of Practice

Retention, deletion, and the right-to-erasure plumbing

The right to erasure (GDPR Art. 17, mirrored by CCPA deletion rights and others) is the governance obligation that AI architecture handles worst, because the naive implementation — "delete the row" — does not reach the place the data actually lives. Personal data in an AI system exists in at least four locations, and a deletion request must be reasoned about for each: (1) the operational store (databases, prompt logs, vector indexes), (2) the training corpus (the dataset snapshot the model was trained from), (3) the model weights (where the data may have been memorized), and (4) downstream artifacts (caches, backups, derived datasets, fine-tunes).

Locations (1) and (2) are tractable with disciplined plumbing. Location (3) is the hard one, and it is where the fork lives. You cannot surgically delete an individual's contribution from a trained weight tensor — the data is distributed across billions of parameters in a way that is not addressable. The realistic responses are three, in increasing order of cost and decreasing order of frequency:

Filter at output. Suppress regurgitation of the specific data via output filters and guardrails. Cheap, fast, and the de-facto first line — but it does not remove the data from the model, and a regulator may not accept it as true erasure.
Machine unlearning. Apply an algorithm that approximately removes a data point's influence without a full retrain. An active research area as of 2026, not yet a turnkey production control, and hard to prove to an auditor's satisfaction.
Retrain or fine-tune away. Exclude the data from the corpus and retrain — the only response that unambiguously satisfies erasure, and the most expensive. This is why erasure obligations push toward not memorizing in the first place (de-duplication, minimization) and toward retention windows that bound how long raw data is kept before the corpus is frozen.

The architecture that makes any of this possible is data lineage — a graph that records, for every model and dataset, which sources fed it, when, and under what basis. Lineage is not a compliance nicety; it is the artifact that answers "is this person's data in this model?" and therefore "what is the cheapest valid response to their erasure request?" Without lineage, every erasure request is unanswerable and every audit is a fishing expedition. With it, you can scope the blast radius of a deletion to the smallest set of artifacts that must change.

Your logs are discoverable — retention is a litigation surface, not just a storage cost

The instinct to retain everything "in case we need it" is now a quantified liability. In NYT v. OpenAI, the court ordered OpenAI to produce 20 million ChatGPT conversation logs in discovery — and crucially held that output logs are relevant to the fair-use defense even when they do not reproduce a plaintiff's work, because the pattern of outputs bears on whether the model substitutes for the copyrighted material. The lesson for any operator: prompt and output logs you retain are discoverable evidence, subject to litigation holds that override your normal deletion schedule, and a target for subpoena. This collides head-on with the privacy program, which wants short retention and aggressive deletion. The reconciliation is a deliberate, documented retention policy — minimal retention by default, defensible legal-hold exceptions, and a deletion schedule you actually execute — not an accidental "keep forever" that becomes a 20-million-record disclosure. Retention is a fork with two expensive wrong answers: keep too much and it is discoverable; keep too little and you cannot honor a legal hold.

DPAs, sub-processors, and tenant data isolation

The contract stack is where governance becomes enforceable against the operator. For every tenant whose data you process, a Data Processing Agreement (GDPR Art. 28) defines the controller/processor relationship: the scope and purpose of processing, the prohibition on using data outside instructions (the no-training clause again), the security measures, the data-subject-rights support you must provide, and the breach-notification timeline. The DPA is the document that makes "we will not train on your data" legally binding rather than a marketing claim — and the absence of an explicit no-training clause is, in many enterprise deals, a silent license to do exactly that.

Sub-processor management is the part operators underestimate. Every party you hand tenant data to — the cloud region, the GPU neocloud you burst to, the vector-database SaaS, the content-moderation API — is a sub-processor, and the DPA typically requires you to maintain a published sub-processor list, give advance notice of changes, flow your obligations down to each one, and remain liable for their conduct. An AI serving stack assembled from a dozen vendors is a dozen sub-processors, and a tenant's right to object to a new one can constrain your own procurement. The governance consequence: your supply chain is part of your compliance surface, and a neocloud burst-out (see Chapter 1.8) is a sub-processor event that must be papered before the traffic flows.

Tenant data isolation is where data governance meets the multi-tenancy architecture. The governance requirement is simple to state and hard to guarantee: one tenant's data must never be readable by another, never leak through a shared cache or vector index, and never train a model that serves a different tenant. The enforcement mechanisms — namespace isolation, per-tenant encryption keys, isolated KV-cache and embedding stores, and the strict separation of any data used for model improvement — are the data-plane half of a problem whose compute-plane half (MIG, vGPU, confidential computing as security boundaries) lives in Chapter 11.6. The distinction this chapter insists on: isolating tenants is about the data not crossing the boundary; protecting the model weights from extraction is a different objective handled in Chapter 11.8. An operator who treats these as one problem builds controls that satisfy neither the privacy auditor nor the security auditor.

Deep dive: why 'we'll just delete it later' fails for memorized training data

The single most consequential misunderstanding in AI data governance is the belief that erasure is a database operation. It is not, and the reason is architectural. When a person's data is in your operational store, deletion is a DELETE statement and a backup-expiry policy. When that same data has been included in a training corpus and the model has been trained, the data has undergone a one-way transformation: it has been smeared across billions of floating-point parameters via gradient descent, in a representation that is not indexed by individual, not addressable, and not reversible. There is no row to delete. The information may not even be recoverable from the model — but "may not be recoverable" is not the same as "has been erased," and a regulator applying the EDPB's high anonymity bar will ask you to demonstrate the former.

This is why the governance burden shifts upstream. The cheapest erasure is the one you never have to perform on the weights, which means: minimize what enters the corpus, de-duplicate aggressively (de-duplication is the single most effective memorization suppressant), keep raw personal data only inside a bounded retention window before the corpus is frozen, and maintain lineage so you can scope any future request. The realistic production posture in 2026 is a tiered response — output filtering for the common case, lineage-scoped corpus exclusion plus retrain-on-next-cycle for verified high-stakes requests, and machine unlearning held in reserve as the techniques mature. What you cannot credibly commit to is on-demand surgical deletion from a frozen model, and a DPA that promises it is a liability you cannot honor. Design the retention windows and the retrain cadence so that erasure requests resolve within a model generation, and disclose that timeline rather than promising the impossible.

Deep dive: building the data-governance control plane as infrastructure, not policy

The recurring failure mode is treating governance as a set of PDFs reviewed by a legal team, when the obligations are enforceable only if they are wired into the data plane. The artifacts that actually hold up under audit are systems, not documents:

A provenance ledger at corpus-assembly time: every source tagged with acquisition method, license, opt-out status, and collection date — the evidence for a Bartz-style acquisition question and the raw material for the EU training-content summary.
PII handling as a pipeline stage: detection, redaction, and de-duplication enforced in the ingest path, with a DPIA and lawful-basis record generated per training run.
A residency-aware control plane: data tagged with a jurisdiction, routing that honors it, and a transfer-impact register for every cross-border path — so "logical residency" is a software guarantee with a failure alarm, not a promise.
A lineage graph connecting data subjects to datasets to models, so an erasure request resolves to a bounded set of artifacts instead of a shrug.
A retention engine that executes deletion schedules by default and applies legal holds as documented exceptions — reconciling the privacy program's pull toward deletion with litigation's pull toward preservation.
A sub-processor and DPA registry that flows obligations down the supply chain and gates procurement on contractual coverage.

The point is that each of these is an engineering deliverable with an owner and an SLO, sitting alongside the orchestration and storage planes (Chapter 10.1), not a quarterly compliance review. Governance that lives only in policy is governance that fails the first time it is tested by a discovery request, a DPA inquiry, or an erasure demand.

Anti-patterns

The same governance mis-scopes recur, each one from treating data as a payload instead of a regulated object:

Tier-3 by omission. Crawling and training without a provenance ledger, discovering only under subpoena that you cannot prove lawful acquisition for the sources a plaintiff cares about. The fix is upstream: tag provenance at assembly, not at audit.
The silent no-training violation. Training the foundation model on tenant prompts without an explicit DPA clause permitting it — a controller act dressed as a processor convenience, and the fastest way to turn a customer's data into a competitor's model and a regulatory finding.
Promising surgical erasure from frozen weights. A DPA or privacy notice that commits to on-demand deletion of memorized data the architecture cannot deliver. Commit instead to a lineage-scoped, retention-windowed, retrain-cadence response and disclose the timeline.
Keep-everything logging. Retaining prompt and output logs indefinitely "for debugging," then discovering they are a 20-million-record discoverable liability. Minimal retention by default; legal holds as exceptions.
Conflating data governance with weight security. Building one control set for "protect the data" that an auditor reads as also covering "protect the weights," satisfying neither. They are different threat models — this chapter and Chapter 11.8 respectively.

This chapter governs the data; the model and its weights are protected as a distinct asset in Chapter 11.8, and the multi-tenant compute-isolation boundary (MIG, vGPU, confidential computing) is engineered in Chapter 11.6. The sovereignty, export-control, and geopolitical layer beneath data residency is sited in Chapter 3.12, with the market-cluster and site-scoring mechanics in Chapter 3.13. The orchestration and storage planes that the governance control plane plugs into are in Chapter 10.1; the KV-cache and embedding stores that must be tenant-isolated are the memory hierarchy of Chapter 9.7. The economics of the neocloud burst-out that creates sub-processor obligations live in Chapter 1.8; the metric and vocabulary backbone is Chapter 0.3; and the SLA framing that DPAs sit alongside is Chapter 12.4.