Troubleshooting: Use spatial joins for historical data

If a spatial join on historical data misbehaves, the cause is usually one of four things: a CRS mismatch, invalid geometry, anachronistic boundaries, or a wrong join predicate. Check them in that order. Roughly nine out of ten "no matches" cases are simply two layers sitting in different coordinate systems, so they never physically overlap. Below is a fault-tree you can run top to bottom.

Why does my spatial join return zero matches?

Start with the coordinate reference system. If your points are in EPSG:4326 and your parishes are in EPSG:27700, they occupy completely different numeric spaces and nothing overlaps. Confirm and align first:

python

import geopandas as gpd

points = gpd.read_file("burials.geojson")
parishes = gpd.read_file("parishes_1851.shp")

print(points.crs, parishes.crs)          # are they the same?
points = points.to_crs(parishes.crs)      # align to one projected CRS

joined = gpd.sjoin(points, parishes, predicate="within", how="left")
print(joined["index_right"].isna().sum(), "unmatched points")

Only after both layers share a CRS does it make sense to look further.

Why do points fall outside their polygons?

Two distinct causes, and they need different fixes:

Invalid geometry. Self-intersections and ring errors make within quietly fail. Validate before joining.
Wrong historical vintage. A 1901 address joined to 1851 parish boundaries will land "outside" wherever boundaries moved. This is a data problem, not a GIS bug.

python

from shapely.validation import make_valid
parishes["geometry"] = parishes.geometry.apply(make_valid)

Always pair a geometry fix with a date check — a clean polygon of the wrong year is still the wrong answer.

What predicate should I use?

Historical coordinates are approximate: a geocoded 18th-century address may sit 30 m off. A strict within then drops legitimate matches. Choose deliberately:

Predicate / join	When to use	Risk
`within`	Precise points, trusted boundaries	Drops near-edge points
`intersects`	Points possibly on boundaries	Can match two polygons
`sjoin_nearest`	Fuzzy historical coordinates	May match a wrong neighbour
`contains` (reverse)	Counting points per polygon	Direction matters

For uncertain positions, gpd.sjoin_nearest(..., max_distance=50) often recovers points a within join would discard.

Why did my join duplicate rows?

Duplication means a feature matched more than one polygon — a point on a shared edge, or overlapping polygons from a sloppy digitisation. Inspect, then resolve:

python

dupes = joined[joined.index.duplicated(keep=False)]
# keep the polygon of largest overlap, or simply the first match
joined = joined[~joined.index.duplicated(keep="first")]

Better still, fix the source topology so boundaries do not overlap; deduplicating after the fact hides the underlying data quality issue.

How do I make a slow join fast?

A naive join compares every feature against every other — fine for hundreds, painful for hundreds of thousands. Use a spatial index. GeoPandas' sjoin already builds an R-tree; for large recurring jobs, push the work into PostGIS:

sql

CREATE INDEX idx_parishes_geom ON parishes USING GIST (geom);
SELECT b.id, p.parish
FROM burials b
JOIN parishes p ON ST_Within(b.geom, p.geom);

The GiST index lets the planner skip non-overlapping candidates entirely.

Which tool should I reach for?

QGIS ("Join attributes by location") is best for a quick visual check. GeoPandas is best when the join belongs in a reproducible script. PostGIS wins when the data is large, must be re-run, or needs an audit trail. All three implement the same DE-9IM predicates, so results agree once CRS and geometry are clean.

Key Takeaways

Diagnose in order: CRS mismatch → invalid geometry → wrong date → wrong predicate.
Zero matches is almost always a CRS mismatch, not a real "no overlap".
Validate polygons with make_valid before joining.
Anachronistic boundaries cause "outside" points — check the historical vintage.
For fuzzy coordinates prefer sjoin_nearest with a sensible max_distance.
Duplicates signal shared edges or overlapping polygons — fix the topology.
Index geometries (R-tree / GiST) before blaming the algorithm for slowness.

Frequently Asked Questions

Why does my spatial join return zero matches?

Almost always a CRS mismatch: the two layers are in different coordinate systems, so they do not physically overlap. Reproject both to one projected CRS and re-run the join before suspecting anything else.

Why do points fall outside the polygons they belong to?

Either invalid polygon geometry (self-intersections, slivers) or boundaries that did not exist at the date of the points. Run a geometry validity check first, then confirm the polygons are the right historical vintage.

What is the difference between a point-in-polygon join and a nearest join?

Point-in-polygon (predicate 'within' or 'intersects') assigns each point to the polygon containing it. A nearest join assigns each feature to the closest one regardless of overlap, useful when historical coordinates are approximate and rarely land exactly inside.

Why did my join duplicate rows?

A point fell on a shared boundary or inside overlapping polygons, so it matched more than one. Fix the polygon topology, or use a join that keeps only the largest-overlap or first match.

How do I speed up a slow spatial join?

Build a spatial index. In GeoPandas use the default sjoin (it uses an R-tree); in PostGIS add a GiST index on the geometry columns. Indexing turns an O(n*m) scan into something usable on large historical datasets.

Should I join in QGIS, GeoPandas or PostGIS?

Use QGIS for one-off exploratory joins, GeoPandas for scripted reproducible pipelines, and PostGIS when datasets are large or the join must be repeated and audited. They share the same DE-9IM predicates underneath.

Why does my spatial join return zero matches? ​

Why do points fall outside their polygons? ​

What predicate should I use? ​

Why did my join duplicate rows? ​

How do I make a slow join fast? ​

Which tool should I reach for? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does my spatial join return zero matches? ​

Why do points fall outside the polygons they belong to? ​

What is the difference between a point-in-polygon join and a nearest join? ​

Why did my join duplicate rows? ​

How do I speed up a slow spatial join? ​

Should I join in QGIS, GeoPandas or PostGIS? ​

Related reading ​

Why does my spatial join return zero matches?

Why do points fall outside their polygons?

What predicate should I use?

Why did my join duplicate rows?

How do I make a slow join fast?

Which tool should I reach for?

Key Takeaways

Frequently Asked Questions

Why does my spatial join return zero matches?

Why do points fall outside the polygons they belong to?

What is the difference between a point-in-polygon join and a nearest join?

Why did my join duplicate rows?

How do I speed up a slow spatial join?

Should I join in QGIS, GeoPandas or PostGIS?

Related reading