Appearance
Merging multiple gazetteers means building a crosswalk between equivalent place records while keeping every source intact — never overwriting one gazetteer with another. The reliable workflow is: normalise each source separately, block to reduce comparisons, score candidate matches, review the uncertain band by hand, and emit a merged view as a regenerable last step. Done this way, provenance survives, mistakes are reversible, and the result is defensible. This guide runs that workflow end to end with reusable examples.
Step 1 — Normalise each gazetteer on its own
Bring every source to a common shape before comparing anything: a source-prefixed id, a normalised name, coordinates in one CRS (WGS84), and a feature type. Crucially, keep every original field too.
python
import pandas as pd
def normalise(df, source):
df = df.copy()
df["uid"] = source + ":" + df["id"].astype(str) # source-prefixed id
df["name_norm"] = df["name"].str.lower().str.strip()
df["ftype"] = df["feature_type"].fillna("unknown")
return df
pleiades = normalise(pd.read_csv("pleiades.csv"), "pleiades")
local = normalise(pd.read_csv("local_gaz.csv"), "local")The source prefix guarantees that pleiades:579885 and local:42 never collide, which is the foundation of preserving provenance.
Step 2 — Why does blocking matter before matching?
Comparing every record against every other is O(n²) — two 50,000-row gazetteers is 2.5 billion comparisons, which is hopeless. Blocking groups records by a cheap key so you only compare plausible pairs.
python
pleiades["block"] = pleiades["name_norm"].str[0] + "_" + pleiades["ftype"]
local["block"] = local["name_norm"].str[0] + "_" + local["ftype"]
pairs = pleiades.merge(local, on="block", suffixes=("_p", "_l"))Block on something stable but discriminating — first letter plus feature type, or a coarse geographic grid. Too loose and you save nothing; too tight and you miss real matches across spelling variants.
Step 3 — How do I decide two records are the same place?
Score, do not guess. Combine three signals and set a threshold with a review band around it.
| Signal | Cheap measure | Weight (typical) |
|---|---|---|
| Name similarity | Jaro–Winkler or token ratio | 0.45 |
| Distance | Metres between coordinates | 0.40 |
| Feature type agreement | Exact / compatible / no | 0.15 |
python
from rapidfuzz.distance import JaroWinkler
def score(p, l):
name = JaroWinkler.normalized_similarity(p["name_norm"], l["name_norm"])
dist = max(0, 1 - haversine_m(p, l) / 5000) # 0 beyond 5 km
ftype = 1.0 if p["ftype"] == l["ftype"] else 0.3
return 0.45*name + 0.40*dist + 0.15*ftypeAuto-accept above ~0.9, auto-reject below ~0.5, and send the middle band to a human. Never auto-merge on name alone — it is how "Newport, Wales" fuses with "Newport, Rhode Island".
Step 4 — Crosswalk, not destructive merge
Store matches in a separate table, not by overwriting records. This single decision is what keeps the merge reversible and auditable.
text
pleiades_uid,local_uid,score,decision,reviewer
pleiades:579885,local:42,0.94,accept,ER
pleiades:579885,local:88,0.61,reject,ERFrom this crosswalk you regenerate a merged view on demand. If you later find a wrong match, you edit one row instead of trying to un-bake a destroyed record.
How do I handle conflicting coordinates?
Do not average two coordinates blindly — averaging a precise survey point with a vague centroid produces a location that is true to neither. Prefer the more authoritative or more precise source, keep both originals, and if you must choose one, set an uncertainty radius large enough to cover the disagreement. Record which source won and why.
How do I keep provenance through the whole process?
Provenance is preserved structurally, not by good intentions: source-prefixed ids mean every value traces home, kept original fields mean nothing is lost, and the crosswalk records who decided what. A merged record should always be able to answer "which gazetteer said this, and who agreed?"
Key Takeaways
- Normalise each gazetteer separately to a common shape, but keep every original field.
- Use source-prefixed identifiers so records never collide and provenance always traces home.
- Block before matching; all-pairs comparison is intractable on real gazetteers.
- Score matches on name, distance and feature type, with a human-reviewed middle band.
- Keep a crosswalk table rather than overwriting — it makes the merge reversible and auditable.
- Never average conflicting coordinates blindly; prefer the better source and widen uncertainty.
Frequently Asked Questions
How do I merge two gazetteers without losing provenance?
Give every incoming record a source-prefixed identifier, keep all original fields, and create a separate match table linking equivalent records rather than overwriting one with the other. Provenance survives because no source record is ever destroyed.
What is the safest order of operations for merging?
Normalise each gazetteer separately, block candidate matches to cut comparisons, score and review matches, then build a crosswalk table. Only generate a merged view as the final, regenerable step.
How do I decide when two records are the same place?
Combine name similarity, distance between coordinates, and feature type agreement into a score, then set a threshold with a manual review band around it. Never auto-merge on name alone.
Should I physically merge records or keep a crosswalk?
Keep a crosswalk wherever possible. A crosswalk lets you regenerate a merged view, correct mistakes, and trace every value back to its source, which a destructive merge makes impossible.
How do I handle conflicting coordinates between gazetteers?
Do not average them blindly. Prefer the more authoritative or more precise source, record both, and set an uncertainty radius that covers the disagreement when you must pick one.
What is blocking and why does it matter when merging?
Blocking groups records by a cheap key — first letter, region, feature type — so you only compare records that could plausibly match. It turns an impossible all-pairs comparison into a tractable one on large gazetteers.