Appearance
OCR historical newspapers by solving layout and article segmentation first, then recognising text within each region, and exporting to ALTO/METS so structure and coordinates survive. The hard part of newspaper OCR is almost never the characters — it is the page: dense multi-column layouts, headlines spanning columns, advertisements, and worn type all conspire to scramble reading order. Get the segmentation right and a searchable, article-level archive follows; get it wrong and you have a keyword soup where columns interleave and a hit can't be traced to its story.
Why is newspaper OCR harder than book OCR?
A book page is usually one column, one font, one reading order. A newspaper page is the opposite: several columns, headlines in display type, body text in tiny serifs, captions, advertisements with their own layouts, and often bleed-through and broken type from cheap newsprint and microfilm. The recognition engine can read the characters; what defeats a naive pipeline is reading order — without column detection, line 1 of column 1 is followed by line 1 of column 2, and the article is gibberish.
How do I split a page into articles?
Article segmentation is a layout problem in two stages:
text
1. Region detection → columns, headlines, body blocks, ads, images, captions
2. Article grouping → assemble blocks into articles via column flow + heading cuesLayout models trained on newspapers — dhSegment, Newspaper Navigator-style detectors, or Transkribus' layout analysis — produce the regions. Grouping then walks the reading order: a headline begins an article, body blocks below and within its column span belong to it until the next headline. This is what turns a page into retrievable stories.
Handling broken type and microfilm noise
Much of the surviving record is microfilmed, which adds contrast loss and speckle on top of already worn type. Preprocessing, tuned per collection, recovers a surprising amount:
bash
# Per-page adaptive binarisation + despeckle + deskew before OCR
python preprocess.py page.tif \
--binarise sauvola --window 41 \
--despeckle --deskew \
-o page_clean.pngAdaptive (Sauvola) binarisation beats a global threshold on uneven microfilm. A model trained on degraded newsprint then handles the residual broken glyphs far better than a clean-print model.
Should headlines and body text be treated the same?
No. Headlines are large display type in a different font; feeding them through a body-text model wastes accuracy and can confuse line segmentation. Detect headlines as their own region class and recognise them separately — sometimes with different settings or a model tuned for display type. This also gives you the heading cues that drive article grouping.
| Region type | Challenge | Handling |
|---|---|---|
| Body text | Small, dense, worn | Tuned binarisation + newsprint model |
| Headline | Large display font | Separate region, separate recognition |
| Advertisement | Irregular mini-layouts | Detect and often exclude from article flow |
| Caption | Tiny, near images | Associate with adjacent image region |
| Table / listing | Tabular structure | Cell-aware extraction |
What output format keeps it all together?
Use ALTO XML with METS. ALTO stores every word with pixel coordinates and reading order; METS records the structure that binds pages and articles into issues. Together they enable the two features readers expect: search hits highlighted on the image, and retrieval at article level rather than whole pages.
xml
<String CONTENT="Parliament" WC="0.94"
HPOS="412" VPOS="880" WIDTH="190" HEIGHT="28"/>That word-level WC (confidence) and bounding box are exactly what a search index needs to draw a highlight box over the scanned word.
Making the archive searchable
Index the ALTO text with its coordinates into a search engine, store the article segmentation so each hit returns a story, and link every match back to its image region. The result is the experience of the major newspaper archives: type a name, get articles, and see the term highlighted on the original page. Run a final post-correction pass with a period lexicon to fix systematic OCR errors before indexing, since corrections are cheap to apply once and expensive to retrofit across millions of indexed tokens.
Key Takeaways
- Newspaper OCR is limited by layout and reading order, not character recognition — segment first.
- Detect regions (columns, headlines, ads) then group blocks into articles using heading and column cues.
- Treat headlines as a separate region and recognition task so display type doesn't confuse body text.
- Tune per-collection preprocessing (adaptive binarisation, despeckle, deskew) for microfilm and worn type.
- Export to ALTO XML + METS so word coordinates, reading order and article structure survive.
- Index word-level coordinates and segment by article so search returns highlighted, story-level hits.
Frequently Asked Questions
Why is newspaper OCR so much harder than book OCR?
Newspapers combine multi-column layouts, mixed fonts, headlines, advertisements and worn type at high density, so the reading-order and segmentation problem is far harder than a single-column book page. Layout analysis, not character recognition, is usually the limiting factor.
How do I split a newspaper page into separate articles?
Use a layout model that detects columns, headlines and article blocks, then group blocks into articles using column flow and heading cues. Tools like dhSegment, Newspaper Navigator-style models, or Transkribus layout analysis produce these regions.
What format should newspaper OCR output use?
ALTO XML paired with METS is the standard for newspaper archives because it preserves word coordinates, reading order and article structure, enabling highlighted search hits and article-level retrieval.
How do I make a newspaper archive searchable?
Index the OCR text with word-level coordinates from ALTO into a search engine, link results back to image regions, and segment by article so hits return a meaningful unit rather than a whole page.
Can I improve OCR on broken or worn newspaper type?
Yes — targeted preprocessing (binarisation tuned to the page, despeckling, deskew) and a model trained on degraded newsprint help most. Post-correction with a period lexicon then cleans systematic residual errors.
Should headlines and body text be OCR'd the same way?
Not necessarily — headlines use larger, different fonts and benefit from being detected as separate regions, sometimes with a different model or settings, so their size and styling do not confuse body-text recognition.