Skip to content
Python for Historians

To parse TEI and XML with Python reliably, use the lxml library, register the TEI namespace so your XPath actually matches, and wrap extraction in a per-file function you map across the whole collection while logging failures. The two mistakes that derail almost every beginner are forgetting that TEI lives in a namespace (so plain //title returns nothing) and reading only an element's .text (which silently drops mixed content). Get those two right and the rest is a disciplined, repeatable checklist.

Why use lxml rather than the standard library?

Python ships with xml.etree.ElementTree, but for TEI you want lxml. It offers full XPath 1.0, proper namespace handling, fast parsing of multi-megabyte files, and schema validation. Install it with pip install lxml and parse a document like this:

python
from lxml import etree

tree = etree.parse("letters/letter_001.xml")
root = tree.getroot()
print(root.tag)   # {http://www.tei-c.org/ns/1.0}TEI

Notice the tag prints with a namespace in braces. That brace prefix is the source of nearly every "my XPath finds nothing" problem.

How do you make XPath work with TEI namespaces?

TEI elements belong to the namespace http://www.tei-c.org/ns/1.0. Register a prefix once, then use it in every query:

python
NS = {"tei": "http://www.tei-c.org/ns/1.0"}

# correct — returns matches
titles = tree.xpath("//tei:title", namespaces=NS)

# wrong — silently returns []
titles = tree.xpath("//title")

Make NS a module-level constant and pass it everywhere. This single habit eliminates the most common class of TEI parsing bug.

How do you extract clean text from mixed content?

TEI is full of mixed content — a persName sitting inside a sentence, a date mid-paragraph. Reading element.text only captures text before the first child. To get everything, use itertext() and normalise whitespace:

python
import re

def clean_text(el):
    text = "".join(el.itertext())
    return re.sub(r"\s+", " ", text).strip()

This collapses the line breaks and indentation that pretty-printed XML introduces, giving you the readable string a reader would actually see.

How do you process a whole collection?

Write one extractor for a single document, then apply it across a folder. Keeping the per-file logic separate makes it testable and keeps batch handling clean:

python
from glob import glob
import pandas as pd

def extract(path):
    tree = etree.parse(path)
    return {
        "file": path,
        "title": clean_text(tree.xpath("//tei:titleStmt/tei:title", namespaces=NS)[0]),
        "sender": clean_text(tree.xpath("//tei:correspAction[@type='sent']/tei:persName", namespaces=NS)[0]),
        "people": [clean_text(p) for p in tree.xpath("//tei:body//tei:persName", namespaces=NS)],
    }

rows, failures = [], []
for path in glob("letters/*.xml"):
    try:
        rows.append(extract(path))
    except Exception as e:
        failures.append((path, str(e)))

df = pd.DataFrame(rows)
print(f"parsed {len(rows)}, failed {len(failures)}")

Collecting failures rather than crashing means one malformed file out of 800 does not lose you the other 799.

Should you validate before you parse?

Where a schema exists, validating pays for itself. A quick comparison of approaches:

ApproachCatchesCost
Schema/RelaxNG validationStructural errors, missing required elementsSetup time, a schema file
Try/except around parseMalformed XML, missing nodesMinimal
Assertions on field countsUnexpected emptiness, encoding lossA few lines

At minimum, assert that key fields are non-empty after extraction — a column that is unexpectedly all blank usually signals a namespace or path mistake, not genuinely empty sources.

What does a defensible TEI checklist look like?

  • Parse with lxml, never assume the default library.
  • Register the TEI namespace and use the prefix in every XPath.
  • Extract text with itertext() plus whitespace normalisation.
  • Guard missing nodes so a batch never dies on one file.
  • Log failures with the filename and error message.
  • Record which TEI elements you targeted, so the extraction is reproducible.

Key Takeaways

  • lxml is the right tool: full XPath, namespaces, speed, validation.
  • TEI's namespace is the number-one cause of empty results — always register and prefix it.
  • Use itertext() to capture mixed content; .text alone drops it.
  • Build a per-file extractor and map it over the collection, logging failures.
  • Validate against a schema where possible, and assert key fields are non-empty.
  • Document the elements you extracted to keep the process auditable.

Frequently Asked Questions

Which library should I use to parse TEI in Python?

Use lxml. It supports full XPath 1.0 and namespaces, handles large files efficiently, and is far more capable than the standard library's ElementTree for the namespaced documents TEI produces.

Why do my XPath queries return nothing on a TEI file?

Almost always because of namespaces. TEI elements live in the http://www.tei-c.org/ns/1.0 namespace, so a plain //title finds nothing — you must register the namespace and write //tei:title.

How do I handle a whole folder of TEI files consistently?

Write one function that extracts your target fields from a single file, then map it over every file with a glob, collecting results into a list of dicts and finally a pandas DataFrame. Log any file that fails rather than stopping.

Should I validate TEI before parsing it?

Yes where you can. Validating against the project's RelaxNG or schema catches structural problems early, but at minimum wrap parsing in error handling so a single malformed file does not abort a batch of hundreds.

How do I get clean text out of mixed-content elements?

Use the element's itertext() to gather all descendant text, or lxml's text_content(), then normalise whitespace. Reading only .text misses content that follows child elements.

Can I parse TEI without learning XPath?

You can iterate elements by tag, but XPath is worth the small investment: it expresses 'every persName inside the body' in one readable line and makes your extraction logic auditable.