Appearance
To use gazetteers to boost place NER, run a curated list of known place names as a dictionary matcher alongside your statistical model, then merge the two outputs so the gazetteer catches historical toponyms the model misses while the model catches novel names the list lacks. The gazetteer's biggest contribution is recall on period-specific names and, as a bonus, instant resolution to coordinates or identifiers. Expect a recall jump on historical place names that general models systematically overlook.
A modern NER model has never seen "Eboracum", "Salopia", or a parish hamlet that vanished in 1700. A gazetteer has. Pairing them is the single highest-leverage move for place extraction in historical text.
How does a gazetteer actually boost recall?
The statistical model recognises patterns from training data; a gazetteer recognises membership in a list. They fail independently. The gazetteer reliably tags every "Eboracum" even though no modern corpus contains it, while the model tags an unexpected place the list never anticipated. Union the two and your recall is higher than either alone, with the gazetteer also handing you a coordinate or identifier for free.
Which gazetteers should I use?
Match the gazetteer to your period and region:
| Gazetteer | Best for | Note |
|---|---|---|
| GeoNames | broad modern coverage | huge, but modern names |
| Pleiades | ancient Mediterranean world | scholarly, well-identified |
| World Historical Gazetteer | period-correct historical names | links to other resources |
| Project gazetteer | your specific corpus | highest precision, you build it |
The strongest setup combines a broad source (GeoNames) with a curated local list of attested historical forms for your sources.
How do I wire a gazetteer into spaCy?
Use a PhraseMatcher for the dictionary layer and run it next to the model:
python
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_trf")
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
places = ["Eboracum", "Salopia", "Cestria", "Lindum"] # from your gazetteer
matcher.add("PLACE", [nlp.make_doc(p) for p in places])
doc = nlp(text)
model_places = {(e.start, e.end) for e in doc.ents if e.label_ in {"GPE", "LOC"}}
gaz_places = {(s, e) for _, s, e in matcher(doc)}
all_spans = model_places | gaz_places # union for higher recallSetting attr="LOWER" makes matching case-insensitive, which matters for inconsistently capitalised historical text.
How do I resolve overlaps and conflicts?
When the model and gazetteer both fire on overlapping text, apply two rules:
- Prefer the more specific span. If the gazetteer matches "New Sarum" and the model only "Sarum", keep the longer, identified match.
- Let context break type ties. When the gazetteer says PLACE but the model says PERSON for "Lincoln", weigh the surrounding words — a preceding "of" or "at" favours place; a title favours person.
A small ranked rule set handles the recurring cases; send the genuinely ambiguous ones to review.
What about the place that is also a person's name?
Toponym-as-surname is the classic false positive: "Lincoln", "Washington", "York". Do not let the gazetteer blindly override the model here. Keep these high-ambiguity names in a watch list and require contextual support — a locative preposition, a "shire"/"-ton" pattern, or co-occurrence with other places — before accepting the place reading.
How do I cover historical spellings the gazetteer lacks?
Three escalating options:
- Add variants — the cheapest and most reliable: extend the gazetteer with attested forms ("Glowcester", "Glocester").
- Normalise first — run spelling normalisation so variants collapse to a canonical form before matching.
- Fuzzy match — allow a small edit distance, but keep the threshold tight (one or two edits) or you flood the output with false hits.
Can I grow a gazetteer from my own corpus?
Yes, and you should — this bootstrapping loop is how project gazetteers become powerful. Extract candidate place spans, have a person confirm each and attach coordinates or an identifier, then fold the confirmed list back in. Each pass lifts recall on your specific sources, and within a few iterations the project gazetteer outperforms any generic list for your material.
Key Takeaways
- A gazetteer boosts place NER mainly by recovering recall on historical toponyms.
- Run the gazetteer matcher and statistical model together, then union the spans.
- Combine a broad source (GeoNames) with a curated project list.
- Resolve overlaps by preferring the more specific, identified span.
- Guard against place-as-surname false positives with context rules.
- Add attested spelling variants rather than loosening fuzzy thresholds.
- Bootstrap a project gazetteer from confirmed corpus extractions.
Frequently Asked Questions
How does a gazetteer improve place NER?
A gazetteer gives the model a curated list of known place names, so it can catch historical toponyms that a general model misses and resolve them to coordinates or identifiers. It mainly boosts recall on names the model has never seen.
Which gazetteers are best for historical places?
GeoNames for modern coverage, Pleiades for the ancient world, and the World Historical Gazetteer or a project-specific list for period-correct names. Combine a broad source with a curated local one for best results.
Should the gazetteer run before or after the model?
Run both and merge. The gazetteer matcher catches known names with high precision; the statistical model catches novel ones. Union the spans and resolve overlaps in favour of the more specific match.
How do I handle a place name that is also a surname?
Use context. "Lincoln" can be a person or a city. Keep a small disambiguation rule set or let the statistical model's label win when the gazetteer and model disagree on type.
What about historical spellings not in the gazetteer?
Add a variant list and use fuzzy matching with a tight threshold, or normalise spelling first. Expanding the gazetteer with attested historical forms is the most reliable fix.
Can I build a gazetteer from my own corpus?
Yes. Extract candidate place spans, have a human confirm and geolocate them, and feed the confirmed list back as a project gazetteer. This bootstrapping loop steadily improves recall on your sources.