Best Practices to Distinguish copyright from database rights

A dataset can carry two independent legal layers: copyright protects the original selection and arrangement of contents, while the sui generis database right protects the substantial investment in obtaining, verifying or presenting those contents. To distinguish them reliably, ask two separate questions of every dataset — "is the structure creative?" (copyright) and "did making this take substantial investment?" (database right) — and record each answer. They are not the same test and a single catalogue can fail one and pass the other.

What exactly does copyright protect in a dataset?

Copyright in a compilation subsists only in the intellectual creation expressed through the selection or arrangement of its contents — not in the facts themselves. A telephone directory ordered alphabetically is the textbook example of something with too little creativity to qualify. By contrast, a curated anthology of correspondence, where you chose which 200 of 5,000 letters to include and grouped them thematically, can attract copyright in that editorial structure.

Crucially, copyright never protects individual facts: a death date, a coordinate, a shelfmark. Those are free to copy. What may be protected is your particular expression of the whole.

What does the database right protect instead?

The sui generis right (EU Directive 96/9/EC; in the UK, the Copyright and Rights in Databases Regulations 1997) protects the investment in a database, regardless of any creativity. The legal test is whether there was "substantial investment in obtaining, verifying or presenting" the contents.

The CJEU's British Horseracing Board v William Hill ruling drew a sharp line: investment in creating the underlying data does not count — only investment in collecting, checking and presenting pre-existing data does. So if you generated fixtures and then catalogued them, the cataloguing effort counts but the fixture-generation does not.

How do I tell which one applies?

Run both tests independently. The table below shows how four common heritage datasets typically fall.

Dataset	Creative structure? (copyright)	Substantial investment in collation? (database right)
Alphabetical name index	No	Often yes
Curated themed exhibition catalogue	Usually yes	Yes
Raw OCR text dump	No	Maybe (verification effort)
Coordinate gazetteer of place-names	No	Yes

Notice the index and gazetteer carry only database rights — a frequent trap, because they "feel" copyright-free and people forget the second layer.

A working checklist you can document

text

For each dataset, record:
[ ] Jurisdiction of creation and of hosting
[ ] Copyright: is selection/arrangement an intellectual creation? (Y/N + note)
[ ] Database right: substantial investment in obtaining/verifying/presenting? (Y/N + note)
[ ] Who is the maker (database right) vs author (copyright)?
[ ] Term: 70 yrs p.m.a. (copyright) / 15 yrs rolling (database right)
[ ] Licence chosen for EACH layer (e.g. CC0 + PDDL, or CC BY + ODbL)
[ ] Date and name of assessor

Storing this beside the data — as a RIGHTS.md or a structured field — is what makes your decisions defensible months later when someone asks "can we re-publish this?".

Why does the licence choice differ?

Creative Commons 4.0 licences do now cover sui generis database rights, but earlier versions (3.0 and before) did not. If you only ever apply a copyright-flavoured statement, an EU/UK reuser may still be blocked by an un-waived database right. For pure data, the Open Data Commons suite was built for this: PDDL (public-domain dedication for data), ODbL (share-alike for data), and ODC-BY (attribution for data).

text

Pure facts, no creativity, want full openness  -> CC0 4.0 or PDDL
Facts + attribution required                   -> ODC-BY (or CC BY 4.0)
Facts + share-alike                            -> ODbL
Creative compilation + share-alike             -> CC BY-SA 4.0

What about extracted subsets and AI training?

The database right is infringed by extracting or re-using a substantial part — judged qualitatively or quantitatively. Systematic extraction of insubstantial parts can also infringe if it amounts to rebuilding the database. For text-and-data-mining, the UK and EU have specific exceptions, but they come with conditions (lawful access, opt-outs) that vary, so do not assume a blanket right to scrape catalogue data even where copyright is absent.

Key Takeaways

Copyright and the database right are separate layers; clear and license both.
Copyright protects creative selection/arrangement, never raw facts.
The database right protects substantial investment in collation, not in creating the data (per BHB v William Hill).
The database right is essentially an EU/UK phenomenon and lasts 15 years on a rolling basis.
Use Open Data Commons (PDDL/ODbL/ODC-BY) or CC 4.0 to address data rights explicitly.
Record both decisions in a stored, dated rights assessment so they stay defensible.

Frequently Asked Questions

Can a single dataset be protected by both copyright and database rights?

Yes. Copyright can protect the original selection and arrangement (a creative structure), while the sui generis database right protects substantial investment in obtaining, verifying or presenting the contents. They are separate layers and you must clear both.

Does the database right exist outside the EU and UK?

The sui generis database right is largely an EU/UK construct under the 1996 Database Directive and the UK's post-Brexit equivalent. Most jurisdictions, including the US, do not recognise it, so a US-hosted copy of a 'thin' database may face no database-right claim there.

How long does the database right last compared to copyright?

The sui generis right lasts 15 years from completion or publication, but a substantial new investment resets the clock for another 15 years. Copyright in the UK typically lasts 70 years after the author's death, so the two clocks rarely align.

Is a CSV of catalogue records covered by copyright?

Usually not by copyright if the fields are factual and the arrangement is dictated by function rather than creativity. It may still attract database rights if you invested substantially in compiling and verifying it, which is the common situation for catalogues.

Why does this distinction matter for open data publishing?

If you only think about copyright, you might apply CC0 or CC BY and still leave database rights unwaived in the EU/UK, creating legal ambiguity. Licences like the Open Data Commons ODbL and PDDL address the database right explicitly.

What exactly does copyright protect in a dataset? ​

What does the database right protect instead? ​

How do I tell which one applies? ​

A working checklist you can document ​

Why does the licence choice differ? ​

What about extracted subsets and AI training? ​

Key Takeaways ​

Frequently Asked Questions ​

Can a single dataset be protected by both copyright and database rights? ​

Does the database right exist outside the EU and UK? ​

How long does the database right last compared to copyright? ​

Is a CSV of catalogue records covered by copyright? ​

Why does this distinction matter for open data publishing? ​

Related reading ​