Skip to content
Python for Historians

To extract text from PDFs in Python, first decide which kind of PDF you have. Born-digital PDFs carry a real text layer you can read with pdfplumber or pypdf in a few lines. Scanned PDFs are just images, so you must render each page and run OCR with pytesseract. Getting this distinction right at the start saves hours of fighting empty output.

Here is the full workflow, from detecting the PDF type to handling columns, tables, and OCR.

Step 1: Is this a born-digital PDF or a scan?

The fastest test is to attempt extraction and see if anything comes back:

python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    first = pdf.pages[0].extract_text() or ""
    print("Text layer present" if first.strip() else "Likely a scan -> needs OCR")

If pages return empty strings, the PDF has no text layer and you are in OCR territory. A born-digital PDF returns clean text immediately.

Step 2: Extracting text from a born-digital PDF

For real-text PDFs, pdfplumber gives you page-level control and keeps you closer to the layout than pypdf:

python
with pdfplumber.open("report.pdf") as pdf:
    pages = [page.extract_text() or "" for page in pdf.pages]
full_text = "\n\n".join(pages)

Save the result per-page rather than as one blob — it makes citing a specific page trivial later.

Step 3: How do I handle multi-column pages?

A two-column journal or directory page is the classic trap: naive extraction reads straight across, interleaving the columns. Crop each column with a bounding box and extract them separately:

python
page = pdf.pages[0]
w, h = page.width, page.height
left  = page.crop((0,   0, w/2, h)).extract_text() or ""
right = page.crop((w/2, 0, w,   h)).extract_text() or ""
ordered = left + "\n" + right

Adjust the split point to the actual gutter; printed directories rarely divide exactly at the centre.

Step 4: OCR for scanned historical documents

When there is no text layer, render each page to an image and pass it to Tesseract via pytesseract. Pre-processing is where the accuracy comes from:

python
import pytesseract
from pdf2image import convert_from_path

pages = convert_from_path("scan.pdf", dpi=300)   # 300 dpi is the sweet spot
for i, img in enumerate(pages):
    text = pytesseract.image_to_string(img, lang="eng")  # try 'deu', 'lat', etc.
    open(f"page_{i:03d}.txt", "w", encoding="utf-8").write(text)

Render at 300 DPI — too low loses fine type, too high wastes time without gains. Deskew and threshold damaged pages before OCR, and pick the right language model; Latin and Fraktur need their own trained data.

Which library should I choose?

LibraryBest forNotes
pdfplumberLayout-aware text and tablesSlower, but precise bounding boxes
pypdfQuick bulk text, metadataLightweight; weaker on layout
pdf2image + pytesseractScanned documentsOCR pipeline; needs Tesseract installed
camelotRuled tablesPairs well for tabular sources

Step 5: Extracting tables from the page

Tabular sources — census schedules, price lists — are common in history. pdfplumber finds them directly:

python
for table in pdf.pages[0].extract_tables():
    for row in table:
        print(row)

Always spot-check against the original page. Merged cells, rotated headers, and ruled-versus-whitespace tables routinely need a manual correction pass before the data is trustworthy.

Key Takeaways

  • Detect the PDF type first: born-digital has a text layer, scans need OCR.
  • Use pdfplumber for layout-aware extraction and pypdf for quick bulk text.
  • Multi-column pages must be cropped per column or the reading order scrambles.
  • For scans, render at 300 DPI and OCR with pytesseract, choosing the correct language model.
  • Pre-processing (deskew, threshold) improves OCR accuracy more than swapping libraries.
  • Verify extracted tables against the page; merged cells and rotated headers need manual fixes.

Frequently Asked Questions

Which Python library should I use to extract PDF text?

For born-digital PDFs with a real text layer, pdfplumber or pypdf work well. For scanned PDFs with no text layer you need OCR via pytesseract after rendering pages to images — the library alone cannot read pixels.

Why does my extracted text come out empty or garbled?

An empty result usually means the PDF is a scanned image with no text layer, so you need OCR. Garbled text often comes from custom font encodings or two-column layouts that linearise wrongly; pdfplumber with layout awareness helps.

How do I tell if a PDF needs OCR?

Try extracting text first; if a page returns nothing or only whitespace, it is almost certainly a scan. You can also check whether the file size is large relative to the page count, which signals embedded images rather than text.

How do I keep the reading order of multi-column pages?

Use pdfplumber and extract by detecting columns, or crop each column with a bounding box and extract them in order. Naive extraction reads straight across both columns and scrambles the text.

Can I extract tables from a historical PDF?

Yes. pdfplumber's extract_tables() and the camelot library both detect ruled and whitespace-separated tables. Always verify the output against the page, since merged cells and rotated headers often need manual fixes.

How accurate is OCR on old printed documents?

Modern Tesseract reaches well above 95 percent character accuracy on clean 19th-century print, but drops sharply on damaged, skewed, or Gothic-type pages. Pre-processing — deskewing, thresholding, and the right language model — matters more than the library choice.