Appearance
Use Scrapy to crawl historical archives when you need to traverse many linked pages, harvest thousands of records, and keep concurrency, retries, and rate limiting under control. For a single page or a dozen known URLs, requests plus BeautifulSoup is simpler. Crawl only after you have checked for a bulk download or API and confirmed the archive's terms allow automated access.
When does Scrapy beat requests and BeautifulSoup?
The dividing line is structure and scale. Scrapy is a full crawling framework: it gives you an asynchronous engine, a request scheduler, automatic retries, throttling, response caching, and item pipelines for cleaning and storing data. That machinery pays off when your task involves following links rather than fetching a fixed list.
| Situation | Better tool |
|---|---|
| One catalogue page, known URL | requests + BeautifulSoup |
| 10-50 known item URLs | requests in a loop |
| Paginated finding aid, thousands of records | Scrapy |
| Follow links across a whole collection | Scrapy CrawlSpider |
| Bulk OAI-PMH or REST API exists | Neither — use the API |
If the archive publishes an OAI-PMH endpoint, a CSV export, or a documented API, use that first. A crawl is the method of last resort, not the default.
Should you crawl this archive at all?
Before writing a spider, run a short due-diligence checklist. Crawling a fragile institutional server is an ethical and sometimes legal matter, not just a technical one.
- Read
robots.txtand the site's terms of use. Scrapy obeys robots by default viaROBOTSTXT_OBEY = True— keep it on. - Check copyright status of the digitised items; metadata and public-domain scans differ from in-copyright text.
- Look for a sitemap, an API, or a research-data download that makes crawling unnecessary.
- Email the archive. Many will hand you a database dump rather than have you hammer their site.
- Estimate load: pages multiplied by request size. A 50,000-page crawl at one request per second still takes 14 hours and that is deliberate.
How do you build a polite Scrapy spider?
Start a project, then write a spider that extracts records and follows pagination. The settings below are the polite baseline I use for every archive.
python
# settings.py
BOT_NAME = "archive_crawler"
ROBOTSTXT_OBEY = True
USER_AGENT = "ElaraReed-research (+https://digitalrelics.uk; mailto:[email protected])"
DOWNLOAD_DELAY = 2.0
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
HTTPCACHE_ENABLED = True
CONCURRENT_REQUESTS_PER_DOMAIN = 2python
# spiders/finding_aid.py
import scrapy
class FindingAidSpider(scrapy.Spider):
name = "finding_aid"
start_urls = ["https://archive.example.org/collection/letters?page=1"]
def parse(self, response):
for row in response.css("article.record"):
yield {
"ref": row.css("span.ref::text").get(),
"title": row.css("h3 a::text").get(),
"date": row.css("time::attr(datetime)").get(),
"url": response.urljoin(row.css("h3 a::attr(href)").get()),
}
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)Run it to a JSON Lines file, which appends safely and survives interruptions:
bash
scrapy crawl finding_aid -O records.jsonlWhat does AUTOTHROTTLE and caching actually save you?
AUTOTHROTTLE watches server latency and slows down when the archive struggles, so you stay courteous without hand-tuning delays. HTTPCACHE_ENABLED stores every response on disk; during development you re-run the spider against the cache instead of re-hitting the server, which makes iterating on selectors free and invisible to the institution. For long jobs, set a JOBDIR so a crawl can pause and resume:
bash
scrapy crawl finding_aid -s JOBDIR=crawls/letters-01 -O records.jsonlWhen is Scrapy the wrong tool?
Scrapy struggles with sites that render content entirely in JavaScript — the default downloader sees the empty shell. You then need scrapy-playwright or a headless browser, at which point a simpler Playwright script may be clearer. Scrapy is also overkill for one-off scrapes, and its FilesPipeline, while capable of downloading PDFs and IIIF images, is rarely the cleanest path for large image harvests where a IIIF manifest gives you canonical URLs directly.
Key Takeaways
- Crawl with Scrapy only when you must follow links across many pages at meaningful scale; otherwise
requestsplusBeautifulSoupis simpler. - Always check for an API, OAI-PMH feed, or bulk download before crawling — it is faster and kinder to the server.
- Keep
ROBOTSTXT_OBEY = True, set aDOWNLOAD_DELAY, identify yourself with a contact email, and enableAUTOTHROTTLE. HTTPCACHEmakes selector development free and stops you re-fetching pages during testing.- Export to JSON Lines and use a
JOBDIRso long crawls are resumable and the dataset is reproducible. - JavaScript-heavy archives need
scrapy-playwright; pure-image harvests are often better served by IIIF manifests than by a crawl.
Frequently Asked Questions
When should I use Scrapy instead of requests plus BeautifulSoup?
Reach for Scrapy once you need to follow links across many pages, manage concurrency, retries and rate limits, or harvest more than a few thousand records. For a single page or a handful of known URLs, requests and BeautifulSoup are simpler and faster to write.
Is it legal to crawl a digital archive with Scrapy?
It depends on the archive's terms of use, copyright on the digitised items, and your jurisdiction. Always read robots.txt and the terms page, prefer a documented API or bulk download, and email the institution if intent is unclear.
How do I avoid getting my crawler blocked or banned?
Set a real DOWNLOAD_DELAY (1 to 3 seconds), enable AUTOTHROTTLE, send a descriptive User-Agent with a contact email, obey robots.txt, and cache responses with HTTPCACHE so you never re-fetch a page during development.
Can Scrapy download IIIF images or PDFs as well as metadata?
Yes. Scrapy's FilesPipeline and ImagesPipeline download binary assets and store checksummed paths, but for large IIIF or PDF harvests a dedicated downloader or the IIIF manifest is often cleaner than a full crawl.
What is the difference between a Spider and a CrawlSpider?
A plain Spider requires you to yield each follow-up request yourself, giving precise control. A CrawlSpider uses Rule and LinkExtractor objects to follow links automatically by pattern, which suits broad, uniform catalogue structures.
How do I make a crawl reproducible for a research project?
Pin Scrapy in a requirements file, commit the spider and settings, enable HTTPCACHE, record the crawl date and the JOBDIR for resumable runs, and export to a versioned JSON Lines file so the dataset can be regenerated.