Appearance
Add search to a digital edition by indexing a clean, normalised reading text — not raw TEI and not styled HTML — and by carrying a stable anchor (an xml:id or character offset) on every indexed chunk so each hit can scroll to and highlight its exact place in the source. Decide early whether the corpus is small enough for a client-side index (Pagefind or Lunr.js, good to a few thousand short documents) or large enough to need a server (Solr, Elasticsearch, OpenSearch). Everything else — facets, original-versus-normalised spelling, citable results — hangs off those two decisions.
What should you actually index?
The single most common mistake is feeding the indexer your TEI files or your final HTML. TEI buries the reading text between thousands of tags, so a search for Constantinople fails when a lb element splits the word. HTML carries CSS class noise and apparatus you never wanted searched.
Build an intermediate extraction step. For each text, emit a JSON record with separate fields:
json
{
"id": "letters/0142",
"citation": "Reed Letters 142",
"anchor": "#p_0142_07",
"text_reg": "the council met at Constantinople in autumn",
"text_orig": "the councell mett att Constantinople in autumne",
"lang": "en",
"date": "1591"
}Strip <note>, <app> and editorial <add>/<del> into their own fields so a reader can choose whether annotations are searchable.
Client-side or server-side index?
| Concern | Client-side (Pagefind / Lunr) | Server (Solr / OpenSearch) |
|---|---|---|
| Corpus size | up to ~5,000 short docs | millions of docs |
| Hosting | static files, zero backend | needs a running service |
| Fuzzy / stemming | basic | rich analysers per language |
| Facets | limited | first-class |
| Long-term cost | near zero | ongoing maintenance |
Pagefind is the pragmatic default for static editions: it builds a fragmented index at deploy time and downloads only the shards a query needs, so a 10 MB edition adds little to first paint.
How do I support original and normalised spelling together?
Encode the variation once in TEI and let the build derive both fields:
xml
<choice><orig>councell</orig><reg>council</reg></choice>Emit text_orig from the orig branch and text_reg from the reg branch, keep the token offsets aligned, then query both fields and merge by document id. A reader looking for early-modern mett and a reader looking for met both find the same line.
Why do hits land on the wrong place?
Because the index threw away position. Every chunk you index must keep a pointer back into the rendered page. Carry the nearest xml:id, or a stable character offset, and store it as anchor. On the results page, link to text.html#p_0142_07 and run a small highlighter that wraps the matched term on arrival.
How do you keep search quality from rotting?
Treat search as a regression test. Maintain a fixture file:
text
Constantinople -> 17
councell -> 4
"in autumn" -> 9Run it on every rebuild and fail CI if a count moves without an explanation. When you re-encode a manuscript or change your tokeniser, these numbers tell you instantly whether you helped or broke retrieval.
Making results citable
A search hit that cannot be quoted is half useless to a scholar. Put the canonical reference (a CTS URN, a shelfmark plus folio, or your project's citation string) into each record and render it beside the snippet. Pair it with the stable fragment URL so a reader copies both the words and where they came from.
Key Takeaways
- Index a normalised reading text extracted from TEI, never raw XML or styled HTML.
- Keep a stable anchor (
xml:idor offset) on every chunk so hits highlight in place. - Use Pagefind or Lunr for small static editions; move to Solr/OpenSearch above ~5,000 docs.
- Emit parallel
origandregfields from TEIchoiceto search both spellings. - Lock search quality with a fixture of known queries and expected counts in CI.
- Render a canonical citation and fragment URL with every result.
- Separate notes and apparatus into their own fields so readers can opt them in.
Frequently Asked Questions
Should I index the TEI source or the rendered HTML?
Index a normalised reading text derived from the TEI, not the raw XML and not the styled HTML. Strip apparatus, notes and markup into clean fields so a search for a word does not fail because a tag sits between two characters.
Do I need a server, or can search run in the browser?
For editions under roughly 5,000 short documents, a client-side index built with Lunr.js or Pagefind works without any backend. Above that, or when you need fuzzy matching and facets at scale, move to a server such as Elasticsearch, OpenSearch or Solr.
How do I let users search the original spelling and the normalised spelling?
Store both forms as separate analysed fields that point at the same token positions, then query both and merge hits. Encode the link in TEI with choice, orig and reg so the index build can emit parallel fields.
Why do my search hits land on the wrong line of the page?
Your index lost positional anchors. Carry an xml:id or a character offset for every indexed chunk so a hit can scroll to and highlight the exact pb, lb or seg it came from.
How should I test that search is actually working?
Build a fixed set of 20 to 40 known queries with expected hit counts, run them on every rebuild, and fail the build if counts drift. This turns search quality into a regression test rather than a vibe.
Can I make search results citable?
Yes. Encode the canonical citation and a stable fragment URL (for example a CTS URN or an xml:id anchor) in each indexed record, and render it in the results list so a reader can quote a hit directly.