Troubleshooting: Analyse large image collections

When analysing large image collections goes wrong, the cause is nearly always one of four things: memory exhaustion from loading full-resolution images, crashes on corrupt files, a preprocessing mismatch that scrambles results, or I/O starvation that makes the GPU look slow. Diagnose by isolating one image and one batch first, then scale. Fix the data-loading layer before you touch the model — that is where most pipelines actually break.

Computer vision lets you cluster, search and measure thousands of artworks or photographs by visual content rather than metadata. The standard pipeline extracts a numeric embedding per image with a pretrained model, then runs similarity, clustering or classification on those vectors. Most failures happen in the plumbing around the model, not the model itself.

Why does my script run out of memory?

This is the most common wall. The root cause is holding every decoded image in RAM. A 50,000-image collection at full resolution is tens of gigabytes of pixels.

The fix is batching plus incremental writes:

python

import numpy as np, torch
from torch.utils.data import DataLoader

embeddings = np.memmap("emb.dat", dtype="float32",
                       mode="w+", shape=(n_images, 512))
i = 0
for batch in DataLoader(dataset, batch_size=64, num_workers=4):
    with torch.no_grad():
        v = model(batch.cuda()).cpu().numpy()
    embeddings[i:i+len(v)] = v
    i += len(v)
embeddings.flush()

Resize images to the model's input size before stacking, and write to a memory-mapped array so RAM stays flat regardless of collection size.

Why do corrupt images crash a long run?

Heritage downloads are full of truncated JPEGs and zero-byte files. One bad file should never kill a 50,000-image job. Validate and quarantine instead:

python

from PIL import Image, ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = False

def safe_open(path):
    try:
        with Image.open(path) as im:
            im.verify()
        return Image.open(path).convert("RGB")
    except Exception as e:
        log_failure(path, str(e))
        return None

Log every failure to a CSV with the file ID and error, then review the quarantine list. A 0.5% corruption rate is normal; a 20% rate signals a broken ingest.

Why do my similarity results look random?

When visually identical images score as unrelated, suspect a preprocessing mismatch. The usual culprits:

Symptom	Likely cause	Fix
Random nearest neighbours	BGR vs RGB channel order	Convert with `convert("RGB")`
Everything similar to everything	Missing mean/std normalisation	Apply the model's exact transforms
Off-by-a-bit clusters	Wrong resize interpolation	Match the training resize
All-white embeddings	Reading 16-bit TIFFs as 8-bit	Downcast explicitly

The model expects the exact preprocessing it was trained with. Copy the official transform pipeline; do not improvise normalisation constants.

How do I scale beyond one machine?

Do not re-run the model for every query. Extract embeddings once, then index them:

python

import faiss
index = faiss.IndexFlatIP(512)        # cosine if vectors L2-normalised
faiss.normalize_L2(embeddings)
index.add(embeddings)
D, I = index.search(query_vec, k=20)  # 20 nearest artworks

A FAISS or Annoy index turns million-image similarity search into milliseconds and decouples analysis from the heavy image store. Keep a manifest mapping each row back to its source image ID and checksum.

Why is my GPU barely faster than CPU?

Because the GPU is starving. The model is fast; decoding and reading JPEGs from disk is slow. Diagnose by timing the data loader alone versus the model alone. Fixes, in order of impact: raise num_workers, pre-resize images to a cache, and store them on a fast local SSD rather than a network share. Once the loader keeps up, the GPU speedup appears.

What keeps image analysis reproducible?

Pin the model weights hash and every library version.
Record the preprocessing transform verbatim.
Seed any clustering (KMeans, UMAP) so runs match.
Manifest every embedding to a source ID and file checksum.

Without this, you cannot explain to a reviewer why a re-run produced different clusters.

Key Takeaways

Most large-collection failures are in data loading, not the model.
Batch, resize early, and write embeddings to a memory-mapped file to bound RAM.
Validate and quarantine corrupt images so one bad file never halts a run.
Match the model's exact preprocessing or similarity results become meaningless.
Index embeddings with FAISS or Annoy to scale search to millions of images.
A starved GPU usually means I/O is the bottleneck — fix the loader first.

Frequently Asked Questions

Why does my embedding script run out of memory on a big collection?

You are almost certainly loading every full-resolution image into RAM at once. Process in batches, resize to the model's input size (often 224x224) before stacking, and write embeddings to disk incrementally rather than holding them all in memory.

Why are corrupt or truncated images crashing my pipeline?

Heritage collections contain partial downloads and broken files. Wrap image loading in a try/except, verify with Pillow's Image.verify(), and log failures to a CSV instead of letting one bad file halt a 50,000-image run.

Why do my visual similarity results look random?

Usually a colour-channel or normalisation mismatch: feeding BGR to a model trained on RGB, or skipping the mean/std normalisation it expects. Match the exact preprocessing used to train the model.

How do I handle a collection too large to fit on one machine?

Extract embeddings once and store them in a vector index like FAISS or Annoy. Then all downstream similarity, clustering and search runs against the compact index rather than the images, scaling to millions of items.

Why is GPU inference barely faster than CPU?

Your bottleneck is likely disk I/O and image decoding, not the model. Use a DataLoader with multiple worker processes, pre-resize images, and store them in a fast format so the GPU is not waiting on the disk.

How do I make image analysis reproducible?

Pin the model weights and library versions, record the exact preprocessing, set random seeds for clustering, and store a manifest linking every embedding back to its source image ID and checksum.

Why does my script run out of memory? ​

Why do corrupt images crash a long run? ​

Why do my similarity results look random? ​

How do I scale beyond one machine? ​

Why is my GPU barely faster than CPU? ​

What keeps image analysis reproducible? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does my embedding script run out of memory on a big collection? ​

Why are corrupt or truncated images crashing my pipeline? ​

Why do my visual similarity results look random? ​

How do I handle a collection too large to fit on one machine? ​

Why is GPU inference barely faster than CPU? ​

How do I make image analysis reproducible? ​

Related reading ​

Why does my script run out of memory?

Why do corrupt images crash a long run?

Why do my similarity results look random?

How do I scale beyond one machine?

Why is my GPU barely faster than CPU?

What keeps image analysis reproducible?

Key Takeaways

Frequently Asked Questions

Why does my embedding script run out of memory on a big collection?

Why are corrupt or truncated images crashing my pipeline?

Why do my visual similarity results look random?

How do I handle a collection too large to fit on one machine?

Why is GPU inference barely faster than CPU?

How do I make image analysis reproducible?

Related reading