Technology Law

Training Data Licensing Agreements

Data-for-AI agreements for South African dataset owners and AI builders — permitted training scope, the trained-model survival clause, provenance warranties, and POPIA-compliant de-identification.

Written by

Martin Kotze

Attorney, Conveyancer & Notary Public

Last reviewed:

Quick answer

A training-data licence governs the supply of datasets to train, fine-tune or evaluate AI models. Three legal layers stack: (1) IP — an original compiled dataset can attract copyright as a compilation, while the underlying records may carry separate third-party rights the compilation licence does not clear; (2) POPIA — personal information in a dataset needs a lawful ground for both the disclosure to the AI builder and the training use itself, and only de-identification to the point where the data cannot be re-identified takes it out of POPIA's scope; (3) contract — the licence defines the permitted training scope (which models, exclusivity, derivatives), whether trained models survive termination of the data licence, and the provenance warranties each side gives. South Africa has no text-and-data-mining copyright exception, so scraping-based training carries real infringement risk here. Bespoke drafting from R12,500.

What rights live inside a dataset?

A dataset is not one asset — it is a bundle of overlapping rights, and a training-data licence has to clear each layer separately. The compilation may be protected even where its contents are not; the contents may carry rights the compiler never owned; and personal information inside the records answers to POPIA regardless of who owns the copyright. Four layers, each needing its own clearance:

Copyright in the compilation

The Copyright Act 98 of 1978 includes tables and compilations within "literary works", so an original compiled dataset — one assembled with sufficient skill, labour and judgment in its selection, verification or arrangement — can attract copyright as a compilation. The raw facts and data points themselves are not protected, and South Africa has no EU-style standalone database right, so where the compilation copyright is thin, contract and confidentiality carry the protective load.

Rights in the underlying records

Each photograph, article, post, transcript or code snippet inside the dataset may be a separate third-party copyright work. A licence over the compilation does not clear the contents — the licensor can only pass on rights it actually holds in each record. This is why the provenance of a dataset (created in-house, licensed in, scraped) determines its clearance burden and, ultimately, its price.

Confidentiality and trade secrets

Many commercially valuable datasets — pricing histories, operational telemetry, proprietary labels and annotations — derive their value from secrecy rather than copyright. Confidentiality obligations, use restrictions and trade-secret protection do the work copyright cannot, and need to survive in a usable form even after the data has been transformed into model weights.

Personal information

POPIA protects identifiable natural persons and, unusually by international standards, juristic persons too. Disclosing a dataset to an AI builder is one act of processing and training on it is another — each needs its own lawful ground under section 11. De-identification takes data out of POPIA entirely, but only if it cannot be re-identified; anything less keeps the Act fully in play.

The eight clauses that matter

A data-for-AI agreement borrows from software licensing and from data-sharing agreements, but neither template survives contact with the central problem: once data has been trained into model weights, it cannot be handed back. That single fact reshapes every clause — scope, termination, deletion, audit — and explains why a training-data licence is its own species of contract rather than an API licence with the names changed:

1

Dataset description + delivery and refresh

Precisely what is licensed: fields, record counts, coverage period, format, labelling and annotation standards, and whether delivery is a one-off snapshot or a continuing feed with a defined refresh cadence. Most dataset disputes start where the dataset definition is vague.

2

Permitted training scope

Which activities are licensed (pre-training, fine-tuning, evaluation and benchmarking), which model families and versions, whether models may be commercialised to customers or used for internal R&D only, and whether the licence is exclusive, field-limited or non-exclusive. Express prohibitions matter as much as the grant — resale of raw data, redistribution, and training models that compete with the licensor's own business.

3

Derivative and trained-model status on termination

The "model survives the licence" clause — consistently the most negotiated point in the deal. A trained model cannot practically "unlearn" the data, so the agreement must say whether termination obliges the licensee to destroy or retire models trained on the dataset, merely ends further training while existing models continue, or lets models survive against a continuing royalty.

4

Provenance + non-infringement warranties

How the dataset was assembled — created in-house, licensed in, collected from users, scraped — recorded as a contractual description, with warranties of title and non-infringement scaled to that provenance. Licensees want absolute warranties; licensors of third-party-sourced data resist them with knowledge qualifiers and indemnity caps.

5

POPIA allocation

Who held the lawful ground under section 11 for the disclosure, and on what ground the AI builder trains; whether the builder processes as an operator or as an independent responsible party; the de-identification standard the dataset must meet, backed by an express re-identification prohibition; and section 72 transfer conditions if the dataset leaves South Africa.

6

Security + deletion

Safeguards appropriate to the sensitivity of the data, breach-notification obligations, and deletion plus certification of the raw dataset on exit — with the trained-model carve-out stated expressly, so deletion of the data is not read as destruction of the model.

7

Audit

Records of training runs, dataset versions and model lineage sufficient to verify that training stayed within scope, with audit or inspection rights for the licensor — or third-party attestation as the compromise where the licensee will not open its training infrastructure.

8

Indemnities for dataset defects

Who pays when the dataset turns out to infringe third-party rights, to contain unlawfully processed personal information, or to be corrupted or poisoned in a way that damages the model. Caps, baskets and carve-outs should track the warranty allocation — uncapped data-defect indemnities are how dataset deals sink licensors.

Can you train on scraped or "publicly available" data?

Training begins with copying: works are reproduced when they are collected, cleaned and processed into a corpus. Several jurisdictions have built statutory text-and-data-mining carve-outs of varying width into their copyright laws — South Africa has not. The Copyright Act's fair-dealing grounds are narrow, purpose-specific and were never written for machine-scale reproduction, so every in-copyright work in a scraped corpus is copied at the trainer's risk unless a licence covers it. That does not make scraping-based training automatically unlawful — much scraped content is not protected at all, and some uses may be defensible — but the burden of clearing it sits on the party doing the scraping, which is precisely why provenance warranties dominate dataset negotiations.

Copyright is only the first barrier. Website terms and conditions routinely prohibit scraping, automated access and AI-training use, and operate as a contractual restriction that applies even to content copyright would not protect. The strength of that restriction varies with how the terms were presented — terms accepted on registration bind more cleanly than terms linked in a footer — but a well-drafted prohibition creates genuine dispute risk either way. A useful shorthand: copyright governs what the data is; contract governs how you got it. A training pipeline needs to clear both.

And "publicly available" does not mean POPIA-free. The Act applies to personal information whether or not it is public — public accessibility, and whether the data subject deliberately made the information public, can influence which section 11 lawful ground is realistically available, but it never operates as a blanket licence to process. A scraped corpus containing identifiable people therefore needs the same lawful-ground analysis as any other personal-information processing, plus section 72 transfer conditions if the corpus or the training happens outside South Africa. The honest framing for SA businesses: scraping for training is neither categorically forbidden nor safely "public domain" — it is an unallocated risk, and the licence is where that risk gets allocated.

Selling YOUR data: when customers' data becomes a product

The other side of the same deal: South African businesses sitting on years of transaction histories, support conversations, industry-specific records or labelled operational data are increasingly approached by AI builders who want exactly that. The asset is real — domain-specific, current, hard-to-replicate data is what frontier models lack — but it arrives encumbered, and the encumbrances determine whether it can be sold at all:

Consent provenance

The value of a customer dataset depends on what data subjects were told when the data was collected. Licensing it to an AI builder is further processing under section 15 of POPIA, which must be compatible with the original purpose of collection — measured against your privacy notice and your customers' reasonable expectations — failing which a fresh lawful ground, usually consent, is needed. A modest dataset with a documented notice-and-consent trail is worth more than a larger one without it, because the trail is the first thing the buyer's lawyers will ask for.

The anonymisation threshold

De-identification takes data out of POPIA only where it cannot be re-identified — a demanding standard. Stripping names is not enough if the remaining fields can be used, manipulated or linked to identify someone, and record-level data is far harder to anonymise robustly than aggregated insights. Contractual re-identification prohibitions should back the technical measures, because data that is non-identifiable today can become re-identifiable when combined with tomorrow's datasets.

The reputational dimension

Even a lawful licence can be a customer-trust event. Headlines about a company "selling customer data to an AI firm" do not quote section numbers, and the commercial damage from a trust backlash can exceed any licence fee. Transparency with customers, genuine opt-outs and a clear-eyed business case belong in the decision alongside the bare compliance analysis.

One drafting point peculiar to derived and synthetic datasets: where a dataset is itself computer-generated — synthetic records, machine-produced labels, model-derived embeddings — the Copyright Act attributes authorship of a computer-generated work to the person who undertook the arrangements necessary for its creation. In a deal where both parties' systems contributed to generating the derived data, that test does not produce an obvious answer, so the agreement should allocate ownership of derived and synthetic datasets expressly rather than leaving it to statutory inference.

Frequently asked

Is a dataset protected by copyright in South Africa?

It can be. The Copyright Act 98 of 1978 includes tables and compilations within "literary works", so an original compiled dataset — original in the sense that sufficient skill, labour and judgment went into selecting, verifying, structuring or arranging it — can attract copyright as a compilation. Two limits matter: raw facts and data points are not themselves protected, only the original compilation of them; and copyright in the compilation is separate from rights in the underlying records, which may belong to third parties. South Africa has no standalone sui generis database right of the EU kind, so the compilation analysis — supplemented by confidentiality and contract — carries the protective load.

Can I train an AI model on scraped public data in South Africa?

There is no flat prohibition, but the risk profile is real and SA-specific. Training starts with copying — works are reproduced when collected and processed — and the Copyright Act contains no text-and-data-mining exception, so every in-copyright work in a scraped corpus is reproduced at the scraper's risk unless a licence or one of the narrow fair-dealing grounds covers it. Website terms of use can operate as a separate contractual barrier prohibiting automated access. And POPIA applies to personal information even when it is publicly available — public accessibility may influence which lawful ground is plausible, but it does not switch the Act off. The analysis is corpus-by-corpus: what is in the data, whose rights attach, what terms governed access, and what ground supports any personal-information processing.

Does POPIA apply if the data is anonymised?

POPIA does not apply to information that has been de-identified to the point where it cannot be re-identified — that is the statutory threshold, and it is demanding. Removing names is not enough if the remaining fields can be used, manipulated or linked to identify someone, and pseudonymised or keyed data, where a method of re-identification exists, remains personal information. Two further wrinkles: the act of de-identifying is itself processing that needs a lawful ground, and a well-drafted licence backs the technical standard with a contractual re-identification prohibition, because a dataset that is non-identifiable today can become re-identifiable when combined with tomorrow's data.

What happens to a model trained on licensed data when the licence ends?

Whatever the contract says — and if the contract says nothing, both parties have a problem. A trained model does not contain the dataset in retrievable form, and "untraining" specific data from model weights is not commercially feasible, so the realistic options are structural: trained models survive termination (with further training ending but inference continuing, sometimes against a continuing royalty), or the licensee must destroy or retire the affected models. Licensors push for destruction or royalties; licensees insist on survival, because a model that dies with the data licence is a model nobody will invest in training. This is consistently the most negotiated clause in a training-data deal — price it upfront rather than discovering it at termination.

What warranties should a data licensor give — and resist?

Give warranties about what you actually control: that you compiled or lawfully acquired the dataset, that the provenance description in the agreement is accurate, and that you hold the rights needed to grant the licence. Resist absolute non-infringement and "contains no personal information" warranties over third-party-sourced or web-scale data you cannot fully audit — qualify them by knowledge, tie them to the disclosed provenance, and cap the matching indemnity. The licensee's sensible counter is to scale warranty strength to provenance: first-party data the licensor generated itself should carry strong warranties; aggregated or scraped third-party data justifies qualified warranties and a price adjustment, not a pretence of certainty.

Can we license our customers' data to an AI builder?

Only after a three-part clearance. First, POPIA section 15: licensing data collected to provide your service is further processing for a new purpose, and must be compatible with the original purpose of collection — judged against what your privacy notice said and what customers would reasonably expect — failing which you need a fresh lawful ground, usually consent. Second, consent provenance: documentary proof of what data subjects were told and agreed to at collection is the first thing the AI builder's due-diligence team will request. Third, your own customer contracts may restrict or prohibit the licence regardless of POPIA. If anything marketing-adjacent is in play, section 69's opt-in regime for electronic direct marketing adds a further layer. De-identification to the cannot-be-re-identified standard is often the cleaner route, because it takes the data outside POPIA altogether.

How does exclusivity affect training-data pricing?

Dramatically. A non-exclusive snapshot licence prices as a commodity; exclusivity — even field-of-use exclusivity limited to a model family or industry — prices as a competitive moat, often at multiples. Two dynamics drive the negotiation: data loses uniqueness once a competitor's model has learned from it, so exclusivity is partly the licensor agreeing not to dilute its own asset; and continuing refresh feeds outprice static snapshots, because model performance decays without current data. Expect exclusivity to be paired with minimum-commitment terms, and watch the interaction with termination — exclusivity over data already embedded in a surviving model is worth less than it appears on paper.

What does a training-data licensing agreement cost?

From R12,500 for a single-dataset licence in one direction — dataset description, permitted training scope, trained-model termination treatment, provenance warranties and the POPIA allocation. Two-way or multi-dataset frameworks, a de-identification protocol annex, or an operator-style data processing addendum where the AI builder processes personal information on your behalf are quoted on scope — typically R18,000–R30,000 for the full stack.

This page is general information, not legal advice. The lawfulness of any particular dataset, training use or data sale depends on the content of the data, how it was collected, and the terms that governed it — speak to an attorney about your specific circumstances.

Why you can trust this: Martin Kotze has been an admitted Attorney of the High Court of South Africa, registered Conveyancer, and Notary Public since 2014, practising from Pretoria. The firm is regulated by the Legal Practice Council under firm registration F17333.

This guide is general information, not legal advice for your specific matter.