How to Use gazetteers to boost place NER

Q: How do I handle a place name that is also a surname?

Use context. 'Lincoln' can be a person or a city. Keep a small disambiguation rule set or let the statistical model's label win when the gazetteer and model disagree on type.

To use gazetteers to boost place NER, run a curated list of known place names as a dictionary matcher alongside your statistical model, then merge the two outputs so the gazetteer catches historical toponyms the model misses while the model catches novel names the list lacks. The gazetteer's biggest contribution is recall on period-specific names and, as a bonus, instant resolution to coordinates or identifiers. Expect a recall jump on historical place names that general models systematically overlook.

A modern NER model has never seen "Eboracum", "Salopia", or a parish hamlet that vanished in 1700. A gazetteer has. Pairing them is the single highest-leverage move for place extraction in historical text.

How does a gazetteer actually boost recall?

The statistical model recognises patterns from training data; a gazetteer recognises membership in a list. They fail independently. The gazetteer reliably tags every "Eboracum" even though no modern corpus contains it, while the model tags an unexpected place the list never anticipated. Union the two and your recall is higher than either alone, with the gazetteer also handing you a coordinate or identifier for free.

Which gazetteers should I use?

Match the gazetteer to your period and region:

Gazetteer	Best for	Note
GeoNames	broad modern coverage	huge, but modern names
Pleiades	ancient Mediterranean world	scholarly, well-identified
World Historical Gazetteer	period-correct historical names	links to other resources
Project gazetteer	your specific corpus	highest precision, you build it

The strongest setup combines a broad source (GeoNames) with a curated local list of attested historical forms for your sources.

How do I wire a gazetteer into spaCy?

Use a PhraseMatcher for the dictionary layer and run it next to the model:

python

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_trf")
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

places = ["Eboracum", "Salopia", "Cestria", "Lindum"]   # from your gazetteer
matcher.add("PLACE", [nlp.make_doc(p) for p in places])

doc = nlp(text)
model_places = {(e.start, e.end) for e in doc.ents if e.label_ in {"GPE", "LOC"}}
gaz_places   = {(s, e) for _, s, e in matcher(doc)}

all_spans = model_places | gaz_places   # union for higher recall

Setting attr="LOWER" makes matching case-insensitive, which matters for inconsistently capitalised historical text.

How do I resolve overlaps and conflicts?

When the model and gazetteer both fire on overlapping text, apply two rules:

Prefer the more specific span. If the gazetteer matches "New Sarum" and the model only "Sarum", keep the longer, identified match.
Let context break type ties. When the gazetteer says PLACE but the model says PERSON for "Lincoln", weigh the surrounding words — a preceding "of" or "at" favours place; a title favours person.

A small ranked rule set handles the recurring cases; send the genuinely ambiguous ones to review.

What about the place that is also a person's name?

Toponym-as-surname is the classic false positive: "Lincoln", "Washington", "York". Do not let the gazetteer blindly override the model here. Keep these high-ambiguity names in a watch list and require contextual support — a locative preposition, a "shire"/"-ton" pattern, or co-occurrence with other places — before accepting the place reading.

How do I cover historical spellings the gazetteer lacks?

Three escalating options:

Add variants — the cheapest and most reliable: extend the gazetteer with attested forms ("Glowcester", "Glocester").
Normalise first — run spelling normalisation so variants collapse to a canonical form before matching.
Fuzzy match — allow a small edit distance, but keep the threshold tight (one or two edits) or you flood the output with false hits.

Can I grow a gazetteer from my own corpus?

Yes, and you should — this bootstrapping loop is how project gazetteers become powerful. Extract candidate place spans, have a person confirm each and attach coordinates or an identifier, then fold the confirmed list back in. Each pass lifts recall on your specific sources, and within a few iterations the project gazetteer outperforms any generic list for your material.

Key Takeaways

A gazetteer boosts place NER mainly by recovering recall on historical toponyms.
Run the gazetteer matcher and statistical model together, then union the spans.
Combine a broad source (GeoNames) with a curated project list.
Resolve overlaps by preferring the more specific, identified span.
Guard against place-as-surname false positives with context rules.
Add attested spelling variants rather than loosening fuzzy thresholds.
Bootstrap a project gazetteer from confirmed corpus extractions.

Frequently Asked Questions

How does a gazetteer improve place NER?

A gazetteer gives the model a curated list of known place names, so it can catch historical toponyms that a general model misses and resolve them to coordinates or identifiers. It mainly boosts recall on names the model has never seen.

Which gazetteers are best for historical places?

GeoNames for modern coverage, Pleiades for the ancient world, and the World Historical Gazetteer or a project-specific list for period-correct names. Combine a broad source with a curated local one for best results.

Should the gazetteer run before or after the model?

Run both and merge. The gazetteer matcher catches known names with high precision; the statistical model catches novel ones. Union the spans and resolve overlaps in favour of the more specific match.

How do I handle a place name that is also a surname?

Use context. "Lincoln" can be a person or a city. Keep a small disambiguation rule set or let the statistical model's label win when the gazetteer and model disagree on type.

What about historical spellings not in the gazetteer?

Add a variant list and use fuzzy matching with a tight threshold, or normalise spelling first. Expanding the gazetteer with attested historical forms is the most reliable fix.

Can I build a gazetteer from my own corpus?

Yes. Extract candidate place spans, have a human confirm and geolocate them, and feed the confirmed list back as a project gazetteer. This bootstrapping loop steadily improves recall on your sources.

How does a gazetteer actually boost recall? ​

Which gazetteers should I use? ​

How do I wire a gazetteer into spaCy? ​

How do I resolve overlaps and conflicts? ​

What about the place that is also a person's name? ​

How do I cover historical spellings the gazetteer lacks? ​

Can I grow a gazetteer from my own corpus? ​

Key Takeaways ​

Frequently Asked Questions ​

How does a gazetteer improve place NER? ​

Which gazetteers are best for historical places? ​

Should the gazetteer run before or after the model? ​

How do I handle a place name that is also a surname? ​

What about historical spellings not in the gazetteer? ​

Can I build a gazetteer from my own corpus? ​

Related reading ​