Appearance
Navigate corpus licensing by separating three distinct questions: may you analyse the texts, may you redistribute them, and what may others do with your corpus. The answers differ for public-domain and in-copyright material, and a research text-and-data-mining exception that lets you analyse a work rarely lets you republish it. Record a rights status for every document up front, and your whole collection stays defensible rather than a liability waiting for a takedown.
Why treat analysis and redistribution as separate questions?
The single most common licensing mistake is assuming that permission to read implies permission to share. They are governed by different rules. In the UK and EU, non-commercial researchers may copy lawfully-accessed in-copyright works to run computational analysis under a text-and-data-mining exception. That same exception gives you no right to post the readable texts online. Keep the two questions apart from the start and you avoid the trap of building a corpus you can study but never release.
What can I do with in-copyright versus public-domain text?
| Action | Public domain | In-copyright (non-commercial research) |
|---|---|---|
| Analyse computationally | yes | yes, under a TDM exception (lawful access) |
| Store a working copy | yes | yes, for the analysis |
| Redistribute full text | yes | generally no, needs a licence |
| Share derived data | yes | usually yes (frequency lists, n-grams) |
The "share derived data" row is the practical escape hatch: you can almost always publish counts, n-grams, or token offsets even when you cannot publish the source, because those rarely reproduce a substantial part of the protected expression.
How do I record licensing across a whole collection?
Per-document rights belong in your metadata, expressed with a controlled vocabulary rather than free text so they can be filtered by machine:
csv
doc_id,source,year,rights_uri
0001,EEBO,1623,http://rightsstatements.org/vocab/NoC-OKLR/1.0/
0002,own-scan,1955,https://creativecommons.org/licenses/by/4.0/
0003,archive.org,1701,https://creativecommons.org/publicdomain/mark/1.0/With a rights_uri column you can programmatically build a shareable subset:
python
shareable = meta[meta.rights_uri.str.contains("publicdomain|licenses")]Standardised URIs from RightsStatements.org or Creative Commons make this trivial; ad-hoc notes like "probably fine" do not.
Are public-domain texts always free to reuse?
Not automatically, and this catches people out. The original work may be out of copyright, but the digitisation you obtained it from can carry separate database rights, the edition may have its own editorial copyright, and the platform hosting it often imposes terms of use that restrict bulk download or redistribution. Always check the source's terms, not merely the death date of the author. A 1650 pamphlet is public domain; one library's high-resolution scan of it may still come with strings attached.
What belongs in a corpus licence statement?
When you release a corpus, accompany it with a clear statement covering four things: the licence of the corpus as a compilation; the rights status of the constituent texts; any embedded third-party material with its own terms; and an explicit list of what reusers may and may not do. Ambiguity here is precisely what generates disputes and takedown requests months later. A two-paragraph statement written carefully at release saves a stressful email exchange down the line.
What is a working licensing checklist?
Run this before you publish anything:
- Confirm you had lawful access to every source.
- Record a controlled-vocabulary rights status per document.
- Separate items you may share in full from those you may not.
- For restricted items, prepare derived data instead of full text.
- Check digitisation and platform terms, not just the work's age.
- Write a compilation licence and a plain-language reuse statement.
- Keep a provenance log of where each text came from and under what terms.
Key Takeaways
- Treat analysis, redistribution, and downstream reuse as three separate permissions.
- A TDM research exception lets you mine in-copyright text but not republish it.
- Share derived data (frequencies, n-grams) when you cannot share the source text.
- Record per-document rights with controlled URIs so subsets can be built by machine.
- Public-domain works can still carry digitisation, edition, or platform restrictions.
- Release every corpus with a compilation licence and a plain-language reuse statement.
Frequently Asked Questions
Does building a research corpus require permission from rights holders?
It depends on the texts and your jurisdiction. Public-domain works are free to use; in-copyright works may be covered by a text-and-data-mining exception for non-commercial research in the UK and EU, but redistribution of the texts usually still needs a licence.
Can I share a corpus that contains in-copyright text?
Often not the full text. A common workaround is to distribute derived data - frequency lists, n-grams, or token offsets - rather than the readable source, since these may not reproduce a substantial part of the protected expression.
What is the UK text and data mining exception?
It permits copying lawfully-accessed works to carry out computational analysis for non-commercial research, and contract terms cannot override it. It allows analysis, but it does not grant a right to republish the underlying texts.
How do I record licensing per document in a large corpus?
Add a rights field to your metadata table for every item, using a controlled vocabulary such as the RightsStatements.org URIs or SPDX identifiers, so the licence travels with the text and can be filtered programmatically.
Are public-domain texts always safe to reuse?
The work may be public domain, but a specific digitisation or edition can carry its own database or layout rights, and the hosting platform's terms may add restrictions. Check the source's terms, not just the age of the original work.
What should a corpus licence statement include?
It should state the licence of the corpus as a whole, the rights status of the constituent texts, any third-party material with its own terms, and what reusers may and may not do. Ambiguity here is what causes takedowns later.