Skip to content
Python for Historians

Use Scrapy to crawl historical archives when you need to traverse many linked pages, harvest thousands of records, and keep concurrency, retries, and rate limiting under control. For a single page or a dozen known URLs, requests plus BeautifulSoup is simpler. Crawl only after you have checked for a bulk download or API and confirmed the archive's terms allow automated access.

When does Scrapy beat requests and BeautifulSoup?

The dividing line is structure and scale. Scrapy is a full crawling framework: it gives you an asynchronous engine, a request scheduler, automatic retries, throttling, response caching, and item pipelines for cleaning and storing data. That machinery pays off when your task involves following links rather than fetching a fixed list.

SituationBetter tool
One catalogue page, known URLrequests + BeautifulSoup
10-50 known item URLsrequests in a loop
Paginated finding aid, thousands of recordsScrapy
Follow links across a whole collectionScrapy CrawlSpider
Bulk OAI-PMH or REST API existsNeither — use the API

If the archive publishes an OAI-PMH endpoint, a CSV export, or a documented API, use that first. A crawl is the method of last resort, not the default.

Should you crawl this archive at all?

Before writing a spider, run a short due-diligence checklist. Crawling a fragile institutional server is an ethical and sometimes legal matter, not just a technical one.

  • Read robots.txt and the site's terms of use. Scrapy obeys robots by default via ROBOTSTXT_OBEY = True — keep it on.
  • Check copyright status of the digitised items; metadata and public-domain scans differ from in-copyright text.
  • Look for a sitemap, an API, or a research-data download that makes crawling unnecessary.
  • Email the archive. Many will hand you a database dump rather than have you hammer their site.
  • Estimate load: pages multiplied by request size. A 50,000-page crawl at one request per second still takes 14 hours and that is deliberate.

How do you build a polite Scrapy spider?

Start a project, then write a spider that extracts records and follows pagination. The settings below are the polite baseline I use for every archive.

python
# settings.py
BOT_NAME = "archive_crawler"
ROBOTSTXT_OBEY = True
USER_AGENT = "ElaraReed-research (+https://digitalrelics.uk; mailto:[email protected])"
DOWNLOAD_DELAY = 2.0
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
HTTPCACHE_ENABLED = True
CONCURRENT_REQUESTS_PER_DOMAIN = 2
python
# spiders/finding_aid.py
import scrapy

class FindingAidSpider(scrapy.Spider):
    name = "finding_aid"
    start_urls = ["https://archive.example.org/collection/letters?page=1"]

    def parse(self, response):
        for row in response.css("article.record"):
            yield {
                "ref": row.css("span.ref::text").get(),
                "title": row.css("h3 a::text").get(),
                "date": row.css("time::attr(datetime)").get(),
                "url": response.urljoin(row.css("h3 a::attr(href)").get()),
            }
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Run it to a JSON Lines file, which appends safely and survives interruptions:

bash
scrapy crawl finding_aid -O records.jsonl

What does AUTOTHROTTLE and caching actually save you?

AUTOTHROTTLE watches server latency and slows down when the archive struggles, so you stay courteous without hand-tuning delays. HTTPCACHE_ENABLED stores every response on disk; during development you re-run the spider against the cache instead of re-hitting the server, which makes iterating on selectors free and invisible to the institution. For long jobs, set a JOBDIR so a crawl can pause and resume:

bash
scrapy crawl finding_aid -s JOBDIR=crawls/letters-01 -O records.jsonl

When is Scrapy the wrong tool?

Scrapy struggles with sites that render content entirely in JavaScript — the default downloader sees the empty shell. You then need scrapy-playwright or a headless browser, at which point a simpler Playwright script may be clearer. Scrapy is also overkill for one-off scrapes, and its FilesPipeline, while capable of downloading PDFs and IIIF images, is rarely the cleanest path for large image harvests where a IIIF manifest gives you canonical URLs directly.

Key Takeaways

  • Crawl with Scrapy only when you must follow links across many pages at meaningful scale; otherwise requests plus BeautifulSoup is simpler.
  • Always check for an API, OAI-PMH feed, or bulk download before crawling — it is faster and kinder to the server.
  • Keep ROBOTSTXT_OBEY = True, set a DOWNLOAD_DELAY, identify yourself with a contact email, and enable AUTOTHROTTLE.
  • HTTPCACHE makes selector development free and stops you re-fetching pages during testing.
  • Export to JSON Lines and use a JOBDIR so long crawls are resumable and the dataset is reproducible.
  • JavaScript-heavy archives need scrapy-playwright; pure-image harvests are often better served by IIIF manifests than by a crawl.

Frequently Asked Questions

When should I use Scrapy instead of requests plus BeautifulSoup?

Reach for Scrapy once you need to follow links across many pages, manage concurrency, retries and rate limits, or harvest more than a few thousand records. For a single page or a handful of known URLs, requests and BeautifulSoup are simpler and faster to write.

It depends on the archive's terms of use, copyright on the digitised items, and your jurisdiction. Always read robots.txt and the terms page, prefer a documented API or bulk download, and email the institution if intent is unclear.

How do I avoid getting my crawler blocked or banned?

Set a real DOWNLOAD_DELAY (1 to 3 seconds), enable AUTOTHROTTLE, send a descriptive User-Agent with a contact email, obey robots.txt, and cache responses with HTTPCACHE so you never re-fetch a page during development.

Can Scrapy download IIIF images or PDFs as well as metadata?

Yes. Scrapy's FilesPipeline and ImagesPipeline download binary assets and store checksummed paths, but for large IIIF or PDF harvests a dedicated downloader or the IIIF manifest is often cleaner than a full crawl.

What is the difference between a Spider and a CrawlSpider?

A plain Spider requires you to yield each follow-up request yourself, giving precise control. A CrawlSpider uses Rule and LinkExtractor objects to follow links automatically by pattern, which suits broad, uniform catalogue structures.

How do I make a crawl reproducible for a research project?

Pin Scrapy in a requirements file, commit the spider and settings, enable HTTPCACHE, record the crawl date and the JOBDIR for resumable runs, and export to a versioned JSON Lines file so the dataset can be regenerated.