How to Extract text from PDFs in Python

Q: Which Python library should I use to extract PDF text?

For born-digital PDFs with a real text layer, `pdfplumber` or `pypdf` work well. For scanned PDFs with no text layer you need OCR via `pytesseract` after rendering pages to images — the library alone cannot read pixels.

Q: How do I keep the reading order of multi-column pages?

Use `pdfplumber` and extract by detecting columns, or crop each column with a bounding box and extract them in order. Naive extraction reads straight across both columns and scrambles the text.

Q: Can I extract tables from a historical PDF?

Yes. `pdfplumber`'s `extract_tables()` and the `camelot` library both detect ruled and whitespace-separated tables. Always verify the output against the page, since merged cells and rotated headers often need manual fixes.

To extract text from PDFs in Python, first decide which kind of PDF you have. Born-digital PDFs carry a real text layer you can read with pdfplumber or pypdf in a few lines. Scanned PDFs are just images, so you must render each page and run OCR with pytesseract. Getting this distinction right at the start saves hours of fighting empty output.

Here is the full workflow, from detecting the PDF type to handling columns, tables, and OCR.

Step 1: Is this a born-digital PDF or a scan?

The fastest test is to attempt extraction and see if anything comes back:

python

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    first = pdf.pages[0].extract_text() or ""
    print("Text layer present" if first.strip() else "Likely a scan -> needs OCR")

If pages return empty strings, the PDF has no text layer and you are in OCR territory. A born-digital PDF returns clean text immediately.

Step 2: Extracting text from a born-digital PDF

For real-text PDFs, pdfplumber gives you page-level control and keeps you closer to the layout than pypdf:

python

with pdfplumber.open("report.pdf") as pdf:
    pages = [page.extract_text() or "" for page in pdf.pages]
full_text = "\n\n".join(pages)

Save the result per-page rather than as one blob — it makes citing a specific page trivial later.

Step 3: How do I handle multi-column pages?

A two-column journal or directory page is the classic trap: naive extraction reads straight across, interleaving the columns. Crop each column with a bounding box and extract them separately:

python

page = pdf.pages[0]
w, h = page.width, page.height
left  = page.crop((0,   0, w/2, h)).extract_text() or ""
right = page.crop((w/2, 0, w,   h)).extract_text() or ""
ordered = left + "\n" + right

Adjust the split point to the actual gutter; printed directories rarely divide exactly at the centre.

Step 4: OCR for scanned historical documents

When there is no text layer, render each page to an image and pass it to Tesseract via pytesseract. Pre-processing is where the accuracy comes from:

python

import pytesseract
from pdf2image import convert_from_path

pages = convert_from_path("scan.pdf", dpi=300)   # 300 dpi is the sweet spot
for i, img in enumerate(pages):
    text = pytesseract.image_to_string(img, lang="eng")  # try 'deu', 'lat', etc.
    open(f"page_{i:03d}.txt", "w", encoding="utf-8").write(text)

Render at 300 DPI — too low loses fine type, too high wastes time without gains. Deskew and threshold damaged pages before OCR, and pick the right language model; Latin and Fraktur need their own trained data.

Which library should I choose?

Library	Best for	Notes
`pdfplumber`	Layout-aware text and tables	Slower, but precise bounding boxes
`pypdf`	Quick bulk text, metadata	Lightweight; weaker on layout
`pdf2image` + `pytesseract`	Scanned documents	OCR pipeline; needs Tesseract installed
`camelot`	Ruled tables	Pairs well for tabular sources

Step 5: Extracting tables from the page

Tabular sources — census schedules, price lists — are common in history. pdfplumber finds them directly:

python

for table in pdf.pages[0].extract_tables():
    for row in table:
        print(row)

Always spot-check against the original page. Merged cells, rotated headers, and ruled-versus-whitespace tables routinely need a manual correction pass before the data is trustworthy.

Key Takeaways

Detect the PDF type first: born-digital has a text layer, scans need OCR.
Use pdfplumber for layout-aware extraction and pypdf for quick bulk text.
Multi-column pages must be cropped per column or the reading order scrambles.
For scans, render at 300 DPI and OCR with pytesseract, choosing the correct language model.
Pre-processing (deskew, threshold) improves OCR accuracy more than swapping libraries.
Verify extracted tables against the page; merged cells and rotated headers need manual fixes.

Frequently Asked Questions

Which Python library should I use to extract PDF text?

For born-digital PDFs with a real text layer, pdfplumber or pypdf work well. For scanned PDFs with no text layer you need OCR via pytesseract after rendering pages to images — the library alone cannot read pixels.

Why does my extracted text come out empty or garbled?

An empty result usually means the PDF is a scanned image with no text layer, so you need OCR. Garbled text often comes from custom font encodings or two-column layouts that linearise wrongly; pdfplumber with layout awareness helps.

How do I tell if a PDF needs OCR?

Try extracting text first; if a page returns nothing or only whitespace, it is almost certainly a scan. You can also check whether the file size is large relative to the page count, which signals embedded images rather than text.

How do I keep the reading order of multi-column pages?

Use pdfplumber and extract by detecting columns, or crop each column with a bounding box and extract them in order. Naive extraction reads straight across both columns and scrambles the text.

Can I extract tables from a historical PDF?

Yes. pdfplumber's extract_tables() and the camelot library both detect ruled and whitespace-separated tables. Always verify the output against the page, since merged cells and rotated headers often need manual fixes.

How accurate is OCR on old printed documents?

Modern Tesseract reaches well above 95 percent character accuracy on clean 19th-century print, but drops sharply on damaged, skewed, or Gothic-type pages. Pre-processing — deskewing, thresholding, and the right language model — matters more than the library choice.

Step 1: Is this a born-digital PDF or a scan? ​

Step 2: Extracting text from a born-digital PDF ​

Step 3: How do I handle multi-column pages? ​

Step 4: OCR for scanned historical documents ​

Which library should I choose? ​

Step 5: Extracting tables from the page ​

Key Takeaways ​

Frequently Asked Questions ​

Which Python library should I use to extract PDF text? ​

Why does my extracted text come out empty or garbled? ​

How do I tell if a PDF needs OCR? ​

How do I keep the reading order of multi-column pages? ​

Can I extract tables from a historical PDF? ​

How accurate is OCR on old printed documents? ​

Related reading ​

Step 1: Is this a born-digital PDF or a scan?

Step 2: Extracting text from a born-digital PDF

Step 3: How do I handle multi-column pages?

Step 4: OCR for scanned historical documents

Which library should I choose?

Step 5: Extracting tables from the page

Key Takeaways

Frequently Asked Questions

Which Python library should I use to extract PDF text?

Why does my extracted text come out empty or garbled?

How do I tell if a PDF needs OCR?

How do I keep the reading order of multi-column pages?

Can I extract tables from a historical PDF?

How accurate is OCR on old printed documents?

Related reading