Appearance
If a spatial join on historical data misbehaves, the cause is usually one of four things: a CRS mismatch, invalid geometry, anachronistic boundaries, or a wrong join predicate. Check them in that order. Roughly nine out of ten "no matches" cases are simply two layers sitting in different coordinate systems, so they never physically overlap. Below is a fault-tree you can run top to bottom.
Why does my spatial join return zero matches?
Start with the coordinate reference system. If your points are in EPSG:4326 and your parishes are in EPSG:27700, they occupy completely different numeric spaces and nothing overlaps. Confirm and align first:
python
import geopandas as gpd
points = gpd.read_file("burials.geojson")
parishes = gpd.read_file("parishes_1851.shp")
print(points.crs, parishes.crs) # are they the same?
points = points.to_crs(parishes.crs) # align to one projected CRS
joined = gpd.sjoin(points, parishes, predicate="within", how="left")
print(joined["index_right"].isna().sum(), "unmatched points")Only after both layers share a CRS does it make sense to look further.
Why do points fall outside their polygons?
Two distinct causes, and they need different fixes:
- Invalid geometry. Self-intersections and ring errors make
withinquietly fail. Validate before joining. - Wrong historical vintage. A 1901 address joined to 1851 parish boundaries will land "outside" wherever boundaries moved. This is a data problem, not a GIS bug.
python
from shapely.validation import make_valid
parishes["geometry"] = parishes.geometry.apply(make_valid)Always pair a geometry fix with a date check — a clean polygon of the wrong year is still the wrong answer.
What predicate should I use?
Historical coordinates are approximate: a geocoded 18th-century address may sit 30 m off. A strict within then drops legitimate matches. Choose deliberately:
| Predicate / join | When to use | Risk |
|---|---|---|
within | Precise points, trusted boundaries | Drops near-edge points |
intersects | Points possibly on boundaries | Can match two polygons |
sjoin_nearest | Fuzzy historical coordinates | May match a wrong neighbour |
contains (reverse) | Counting points per polygon | Direction matters |
For uncertain positions, gpd.sjoin_nearest(..., max_distance=50) often recovers points a within join would discard.
Why did my join duplicate rows?
Duplication means a feature matched more than one polygon — a point on a shared edge, or overlapping polygons from a sloppy digitisation. Inspect, then resolve:
python
dupes = joined[joined.index.duplicated(keep=False)]
# keep the polygon of largest overlap, or simply the first match
joined = joined[~joined.index.duplicated(keep="first")]Better still, fix the source topology so boundaries do not overlap; deduplicating after the fact hides the underlying data quality issue.
How do I make a slow join fast?
A naive join compares every feature against every other — fine for hundreds, painful for hundreds of thousands. Use a spatial index. GeoPandas' sjoin already builds an R-tree; for large recurring jobs, push the work into PostGIS:
sql
CREATE INDEX idx_parishes_geom ON parishes USING GIST (geom);
SELECT b.id, p.parish
FROM burials b
JOIN parishes p ON ST_Within(b.geom, p.geom);The GiST index lets the planner skip non-overlapping candidates entirely.
Which tool should I reach for?
QGIS ("Join attributes by location") is best for a quick visual check. GeoPandas is best when the join belongs in a reproducible script. PostGIS wins when the data is large, must be re-run, or needs an audit trail. All three implement the same DE-9IM predicates, so results agree once CRS and geometry are clean.
Key Takeaways
- Diagnose in order: CRS mismatch → invalid geometry → wrong date → wrong predicate.
- Zero matches is almost always a CRS mismatch, not a real "no overlap".
- Validate polygons with
make_validbefore joining. - Anachronistic boundaries cause "outside" points — check the historical vintage.
- For fuzzy coordinates prefer
sjoin_nearestwith a sensiblemax_distance. - Duplicates signal shared edges or overlapping polygons — fix the topology.
- Index geometries (R-tree / GiST) before blaming the algorithm for slowness.
Frequently Asked Questions
Why does my spatial join return zero matches?
Almost always a CRS mismatch: the two layers are in different coordinate systems, so they do not physically overlap. Reproject both to one projected CRS and re-run the join before suspecting anything else.
Why do points fall outside the polygons they belong to?
Either invalid polygon geometry (self-intersections, slivers) or boundaries that did not exist at the date of the points. Run a geometry validity check first, then confirm the polygons are the right historical vintage.
What is the difference between a point-in-polygon join and a nearest join?
Point-in-polygon (predicate 'within' or 'intersects') assigns each point to the polygon containing it. A nearest join assigns each feature to the closest one regardless of overlap, useful when historical coordinates are approximate and rarely land exactly inside.
Why did my join duplicate rows?
A point fell on a shared boundary or inside overlapping polygons, so it matched more than one. Fix the polygon topology, or use a join that keeps only the largest-overlap or first match.
How do I speed up a slow spatial join?
Build a spatial index. In GeoPandas use the default sjoin (it uses an R-tree); in PostGIS add a GiST index on the geometry columns. Indexing turns an O(n*m) scan into something usable on large historical datasets.
Should I join in QGIS, GeoPandas or PostGIS?
Use QGIS for one-off exploratory joins, GeoPandas for scripted reproducible pipelines, and PostGIS when datasets are large or the join must be repeated and audited. They share the same DE-9IM predicates underneath.