Skip to content
Python for Historians

To geocode places in Python, the practical workflow is: deduplicate your place names, look up each unique name once through geopy against an appropriate service, cache every result locally, and join the coordinates back to your records. For historical material, add a gazetteer step for places that have moved, merged, or disappeared — modern geocoders only know today's map.

This guide walks the full pipeline an archivist can reuse, from a column of messy place strings to a clean table of coordinates.

Step 1: Deduplicate before you geocode

Historical sources repeat the same places endlessly — "Manchester" might appear 4,000 times. Geocoding each row wastes thousands of API calls. Reduce to unique names first:

python
import pandas as pd

df = pd.read_csv("records.csv")
unique_places = df["place"].dropna().str.strip().unique()
print(f"{len(df)} rows -> {len(unique_places)} unique places to geocode")

You geocode the small unique set, then merge results back to the full table.

Step 2: Geocode with geopy and Nominatim

geopy gives one interface over many services. Nominatim (OpenStreetMap) is free and good for a first pass, but its usage policy is strict:

python
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

geolocator = Nominatim(user_agent="parish-history-project ([email protected])")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)  # 1 req/sec

result = geocode("Whitby, North Yorkshire, England")
if result:
    print(result.latitude, result.longitude, result.address)

The RateLimiter enforces the one-request-per-second rule; a descriptive user_agent is mandatory. Ignore either and the public endpoint will block you.

Why must I cache geocoding results?

Geocoding is slow, rate-limited, and prone to outages. Caching turns repeated runs from hours into seconds and makes your pipeline reproducible. A tiny SQLite or JSON cache is enough:

python
import json, os

CACHE = "geocache.json"
cache = json.load(open(CACHE)) if os.path.exists(CACHE) else {}

def lookup(name):
    if name not in cache:
        r = geocode(name)
        cache[name] = [r.latitude, r.longitude] if r else None
        json.dump(cache, open(CACHE, "w"))
    return cache[name]

How do I geocode places that no longer exist?

This is the crux of historical geocoding. A modern service cannot find a deserted medieval village or a renamed town. Route those through a historical gazetteer:

ResourceStrengthUse when
Nominatim / OSMFree, current placesModern, well-known locations
GeoNamesAlternate and former namesMultilingual, renamed places
PleiadesAncient worldClassical and late-antique sites
World Historical GazetteerDated place changesPlaces that moved or vanished

The pattern is: try the modern geocoder first, and send the failures to a gazetteer that records former names and date ranges.

How do I deal with ambiguous names?

"Boston", "Newport", and "Richmond" exist in dozens of places. Disambiguate by adding context to the query and inspecting the returned display_name:

python
result = geocode("Boston, Lincolnshire, England")  # not Massachusetts
# keep the returned address so a reviewer can audit the choice
matched = result.address if result else "NO MATCH"

Always store the geocoder's returned name alongside the coordinates. Silent wrong matches are the most damaging error in historical mapping, and a saved display name lets you audit them.

Step 3: Join coordinates back to your records

With unique places resolved and cached, merge back:

python
coords = pd.DataFrame(
    [(p, *(lookup(p) or (None, None))) for p in unique_places],
    columns=["place", "lat", "lon"],
)
geocoded = df.merge(coords, on="place", how="left")
unresolved = geocoded[geocoded["lat"].isna()]["place"].unique()

Review unresolved by hand — these are usually the historically interesting names that deserve a gazetteer lookup.

Key Takeaways

  • Deduplicate place names before geocoding to cut API calls dramatically.
  • Use geopy with a descriptive user-agent and a one-request-per-second RateLimiter for Nominatim.
  • Cache every result to JSON or SQLite for reproducibility and resilience to outages.
  • Route vanished or renamed places through historical gazetteers like Pleiades or the World Historical Gazetteer.
  • Disambiguate names by adding county/country context and storing the returned display name for audit.
  • Geocode unique names once, then join the coordinates back to your full dataset.

Frequently Asked Questions

Which Python library should I use to geocode place names?

Use geopy, which wraps many services behind one interface. For historical work, pair it with Nominatim for free general lookups and with a historical gazetteer like the World Historical Gazetteer or Pleiades for places that have moved or vanished.

How do I respect Nominatim's usage policy?

Send a descriptive user-agent string, cap requests to one per second, and cache every result so you never re-query the same name. Bulk runs that ignore the rate limit get blocked, and the public endpoint forbids heavy automated use.

Why should I cache geocoding results?

Geocoding is slow and rate-limited, and the same place names recur constantly in historical sources. Caching to a local file or SQLite turns a multi-hour run into seconds on reruns and protects you against API outages.

How do I geocode historical place names that no longer exist?

Modern geocoders will fail on vanished or renamed places. Match those against a historical gazetteer such as Pleiades, GeoNames, or the World Historical Gazetteer, which record former names and date ranges rather than only the present landscape.

How do I handle ambiguous place names like 'Boston' or 'Newport'?

Add context — county, country, or a bounding box — to the query, and inspect the returned match before trusting it. Always keep the geocoder's confidence or returned display name so you can audit doubtful matches later.

Should I geocode every row live or batch with a cache?

Batch with a cache. Deduplicate place names first, geocode each unique name once, store the coordinates, then join them back to your full table. This minimises API calls and makes the whole process reproducible.