Skip to content
Linked Open Data

Introducing linked open data (LOD) to a heritage collection means re-expressing your catalogue as RDF statements where every entity has a resolvable HTTP URI, then linking those URIs to shared authorities like Wikidata, GeoNames and the Getty vocabularies. The fastest route to a usable result is to pick one small, well-described series, model it against established vocabularies, publish it as a Turtle file plus a SPARQL endpoint, and only then expand. Do not start by converting everything.

Why bother with LOD for a collection?

The concrete payoff is that other systems can discover and reuse your records without scraping HTML. When your "Bristol" links to https://sws.geonames.org/2654675/, every dataset that also uses that URI becomes a potential bridge to yours. You gain federated discovery, automatic disambiguation, and provenance that survives platform migrations. The cost is real but bounded: a few weeks of modelling per collection, not a new repository system.

What do you actually need before you start?

A short checklist that I run before any pilot:

  • A collection of a few hundred to a few thousand records with consistent fields.
  • At least one column already under authority control (names, places, or subject terms).
  • A namespace you control, for example https://data.myarchive.org/.
  • A way to export to CSV or already-structured XML.

If your subject column is free text with thirty spellings of one place, fix that first in OpenRefine. LOD amplifies whatever order or chaos already exists in the data.

Step one: design your URIs before anything else

URIs are the foundation and the hardest thing to change later. Use a pattern like https://data.myarchive.org/item/{id} and never embed software details or file extensions in the path. Keep the identifier opaque and stable.

https://data.myarchive.org/item/MS-0421      # good: stable, technology-neutral
https://cms.myarchive.org/node/8837?v=2       # bad: leaks the CMS, has a query string

How do you turn a spreadsheet into RDF?

You do not hand-write triples. Map columns to properties and let a tool emit Turtle. A minimal mapping for a manuscript record:

turtle
@prefix dct: <http://purl.org/dc/terms/> .
@prefix schema: <http://schema.org/> .

<https://data.myarchive.org/item/MS-0421>
    a schema:CreativeWork ;
    dct:title "Letter to John Aubrey" ;
    dct:created "1685"^^xsd:gYear ;
    dct:spatial <https://sws.geonames.org/2654675/> ;
    dct:subject <http://vocab.getty.edu/aat/300026877> .

Tools that do the CSV-to-RDF mapping well: OpenRefine with the RDF extension, Tarql for SPARQL-based CSV mapping, and the LinkedDataHub or rdflib in Python for scripted control.

Which vocabularies should you reuse?

NeedReuse firstNotes
Basic descriptionDublin Core Terms, schema.orgLowest barrier, widely understood
Subjects / conceptsGetty AAT, Library of Congress, your SKOSLink, do not invent
PlacesGeoNames, World Historical GazetteerResolvable, coordinate-rich
People / orgsWikidata, VIAFReconcile in OpenRefine
Events / deep modellingCIDOC CRMPowerful but heavy; defer it

Resist inventing new properties. Every term you mint is a term nobody else understands.

How do you publish and prove it works?

Load the Turtle into a triplestore. For a pilot, Apache Jena Fuseki runs in one command and gives you a SPARQL endpoint:

bash
fuseki-server --file=collection.ttl /ds

Then verify a real link resolves and a query returns sense:

sparql
SELECT ?title ?place WHERE {
  ?item dct:title ?title ; dct:spatial ?place .
} LIMIT 10

If that returns rows and your URIs dereference in a browser, you have published genuine linked data.

Key Takeaways

  • Start with one small, well-described series, not the whole catalogue.
  • Design stable, technology-neutral URIs before publishing anything.
  • Reuse Dublin Core, schema.org, SKOS, Getty, GeoNames and Wikidata; invent almost nothing.
  • Generate RDF from a spreadsheet with OpenRefine or Tarql instead of hand-writing triples.
  • Defer CIDOC CRM until your basic model is solid.
  • Prove the pilot with a running SPARQL endpoint and dereferenceable URIs before scaling.

Frequently Asked Questions

What is linked open data in a heritage context?

It is collection data expressed as RDF statements where every record, place, person and concept gets a resolvable HTTP URI, so machines can follow links between your catalogue and external sources like Wikidata or the Getty vocabularies.

Do I need to convert my whole catalogue at once?

No. Start with one well-described series of a few hundred records, model that thoroughly, publish it, and iterate. A small clean dataset linked to authorities is far more useful than a large unmodelled dump.

What vocabularies should a beginner pick?

Begin with Dublin Core Terms and schema.org for description, SKOS for your thesauri, and reuse Getty AAT, Wikidata and GeoNames URIs for subjects and places before considering CIDOC CRM.

How much RDF do I need to learn first?

Enough to read a Turtle file and recognise subject-predicate-object structure. You can produce valid RDF from a spreadsheet without hand-writing triples, so deep ontology theory can wait.

What is the single most common beginner mistake?

Minting opaque, non-persistent URIs tied to a software product (for example a CMS internal ID in the path). Design stable, technology-neutral URIs before you publish anything.

Is LOD worth it for a small archive?

Yes if you have authority-controlled names, places or subjects worth linking. The payoff is discoverability and reuse; the cost is a few weeks of modelling, not a new platform.