Best Practices to Resolve toponyms in text with NLP

Resolving toponyms in text with NLP means doing two distinct jobs well: first finding which spans are place names, then linking each to one specific gazetteer record with coordinates and an era. The best practice is to treat these as separate, auditable stages — recognition then resolution — because most of the error and almost all of the ambiguity live in the second stage. Below is a working checklist that keeps results consistent, documented and defensible across a whole corpus rather than one lucky document.

Why is toponym resolution harder than named-entity recognition?

NER will happily mark "Tripoli" as a place. Resolution has to decide which Tripoli — Libya, Lebanon, or one of several in Greece and the United States — and pin it to a record with a stable identifier. That decision depends on context the tagger never sees: the document's period, the other places mentioned nearby, and the gazetteer you trust for that era. Keep the stages separate so you can swap the gazetteer or the disambiguation logic without retraining the recogniser.

How do I choose the right gazetteer for the period?

Match the gazetteer to the material, not the other way round. A classical itinerary resolved against modern GeoNames will silently anchor ancient settlements to unrelated modern towns.

Source period	Primary gazetteer	Notes
Ancient / classical	Pleiades	Stable URIs, period-aware
Pre-modern, cross-cultural	World Historical Gazetteer	Temporal attestations, variant names
Early-modern to modern	GeoNames + domain set	Huge coverage, weak on history
National / parish	Local authority file	Best for fine-grained admin units

Record the gazetteer name and release date in your output. A resolution that cannot name its gazetteer version is not reproducible.

A practical resolution pipeline

A dependable pipeline normalises text, recognises spans, generates candidates, scores them with context, and emits every candidate — not just the winner.

python

import spacy

nlp = spacy.load("en_core_web_trf")

def candidates(name, gazetteer):
    # returns list of records with id, lat, lon, population/prominence prior
    return gazetteer.lookup(name)

def resolve(doc_text, gazetteer):
    doc = nlp(doc_text)
    places = [e.text for e in doc.ents if e.label_ == "GPE" or e.label_ == "LOC"]
    resolved = []
    for name in places:
        cands = candidates(name, gazetteer)
        for c in cands:
            c["score"] = 0.6 * c["prior"] + 0.4 * spatial_support(c, resolved)
        best = max(cands, key=lambda c: c["score"]) if cands else None
        resolved.append({"name": name, "best": best, "all": cands})
    return resolved

The key line is keeping all candidates with their scores. That single habit makes every later audit and threshold tweak possible without re-running the tagger.

How do I disambiguate same-name places reliably?

Use two priors and one tie-breaker. A prominence prior (population, attestation count, or PageRank in a place network) handles the common case. A spatial coherence prior — does this candidate sit near other toponyms already resolved in the document? — handles the rest. When the document is geographically focused, spatial support outweighs raw population. Reserve the era-appropriate gazetteer as the final tie-breaker so a 19th-century parish never resolves to a same-named modern suburb.

Does OCR quality change the approach?

Substantially. Historical resolution usually runs on OCR or HTR output, where "Newcastle" may read as "Newcaftle". Apply targeted normalisation before recognition: collapse long-s, repair end-of-line hyphenation, and fix a short list of confusion pairs you actually observe in your corpus. This is far cheaper than re-running the pipeline and recovers toponyms the recogniser would otherwise drop entirely.

How do I measure and document quality?

Build a gold sample of 150 to 300 resolved mentions drawn from your own corpus, not a benchmark from another century. Report accuracy@161km (the standard 100-mile tolerance) and mean error distance, and break results down by gazetteer and by document quality. Re-run the same metric on every pipeline change so improvements are provable rather than felt.

Key Takeaways

Separate recognition from resolution; the ambiguity and most errors live in resolution.
Match the gazetteer to the source period and always record its version.
Keep every candidate with its score — it is the foundation of every later audit.
Disambiguate with prominence plus spatial coherence; use the era gazetteer to break ties.
Normalise OCR before tagging; it is the cheapest accuracy gain available.
Evaluate on your own gold sample with accuracy@161km, not a foreign benchmark.

Frequently Asked Questions

What is toponym resolution and how is it different from NER?

Named-entity recognition (NER) only finds the span of text that is a place; toponym resolution goes further and links that span to a specific gazetteer record with coordinates. Resolution is the harder, ambiguity-laden step.

Which gazetteer should I resolve historical toponyms against?

For pre-modern and classical material use Pleiades and the World Historical Gazetteer; for early-modern to modern English-language sources GeoNames plus a domain gazetteer works well. Always record which gazetteer and which version you used.

How do I disambiguate two places with the same name?

Combine population or prominence priors with spatial context from nearby resolved toponyms in the same document, then break ties with the era-appropriate gazetteer. Log the score for every candidate so a reviewer can audit the choice.

What accuracy can I expect from automated toponym resolution?

On clean modern text, accuracy@161km figures of 0.7 to 0.9 are common; on noisy OCR of historical sources expect a substantial drop, often to 0.4 to 0.6 before manual review. Always evaluate on a sample from your own corpus.

Should I correct OCR before resolving toponyms?

Yes where feasible. Even light normalisation of long-s, hyphenation and common OCR confusions recovers many toponyms the tagger would otherwise miss, and it is cheaper than re-running the whole pipeline later.

How do I make toponym resolution reproducible?

Pin tool and model versions, snapshot the gazetteer release, store every candidate with its score, and keep a held-out gold sample with a stable metric so reruns are comparable.

Why is toponym resolution harder than named-entity recognition? ​

How do I choose the right gazetteer for the period? ​

A practical resolution pipeline ​

How do I disambiguate same-name places reliably? ​

Does OCR quality change the approach? ​

How do I measure and document quality? ​

Key Takeaways ​

Frequently Asked Questions ​

What is toponym resolution and how is it different from NER? ​

Which gazetteer should I resolve historical toponyms against? ​

How do I disambiguate two places with the same name? ​

What accuracy can I expect from automated toponym resolution? ​

Should I correct OCR before resolving toponyms? ​

How do I make toponym resolution reproducible? ​

Related reading ​