Appearance
To extract and normalise historical dates, capture the date phrase verbatim with its offset, classify it (precise, range, approximate, regnal, feast-relative), then convert it to a machine-readable value using ISO 8601 for exact dates and EDTF for everything uncertain. The single most important rule: never discard the original string. Normalisation is an interpretation, and you must be able to show your working.
Historical dates are not clean. You will meet Old Style years, "the morrow of All Souls", "anno regni 12 Eliz", double-dated documents, and centuries-long approximations. A pipeline that assumes YYYY-MM-DD will silently corrupt your timeline.
What format handles fuzzy historical dates?
Plain ISO 8601 covers precise dates but cannot express "around 1650" or "sometime in the 1540s". The Extended Date/Time Format (EDTF), standardised within ISO 8601-2, was designed for this:
| Source phrase | EDTF | Meaning |
|---|---|---|
| 12 May 1649 | 1649-05-12 | exact |
| about 1650 | 1650~ | approximate |
| 1640 or 1641 | 1640/1641 | one of these |
| in the 1540s | 154X | unspecified within decade |
| before 1700 | ../1700 | open start |
| uncertain 1649 | 1649? | possibly this year |
Adopt EDTF as your normalised field and you can represent almost any historical date honestly.
How do I extract dates from messy text?
A model-only approach misses too much. Layer three detectors and union their results:
python
import re
# numeric and month-name dates
NUMERIC = re.compile(r"\b\d{1,2}\s+(January|February|March|April|May|June|July|"
r"August|September|October|November|December)\s+\d{3,4}\b")
# regnal years, e.g. "12 Eliz" or "3 Edward VI"
REGNAL = re.compile(r"\b\d{1,2}\s+(Eliz|Hen|Edw|Edward|Henry|Geo|George|Car)[a-z]*\.?\s*[IVX]*\b")
for m in NUMERIC.finditer(text):
print(m.group(), m.start())Add a feast-day table so "the feast of St Michael" resolves to 29 September, and "the Monday after" can be computed from it.
How do I deal with Old Style and the 1752 calendar change?
Two distinct problems live here, and conflating them causes year-off errors:
- Calendar system — Julian (Old Style) versus Gregorian (New Style). Convert to a proleptic Gregorian value for sorting, but flag which calendar the source used.
- Year-start — in England the legal year began on 25 March until 1752. A document dated "10 February 1648" by contemporaries is February 1649 in modern reckoning.
Store both: the original year, the modern year, and a calendar flag.
json
{
"source_text": "10 February 1648",
"edtf": "1649-02-10",
"calendar": "julian",
"year_start": "lady-day",
"note": "OS year-start before Lady Day; modern year 1649"
}What does a good date record contain?
Every normalised date should carry: the verbatim string, character offsets, an EDTF value, a calendar flag, a precision/uncertainty marker, and a short note on any interpretation. Offsets let you trace the date back to the page; the note documents judgement calls so a reviewer can disagree intelligently.
How do I keep results consistent across a whole project?
Write the rules down before you start, not after. A short normalisation spec — which calendar you target, how you treat approximate phrases, your feast-day and regnal tables — turns ad-hoc decisions into a repeatable process. Run a validation pass that re-parses every EDTF string and rejects malformed ones; the edtf Python library does this in a couple of lines.
python
from edtf import parse_edtf
parse_edtf("154X") # raises if invalidA practical extraction-and-normalisation checklist
- Capture the verbatim phrase and offsets first.
- Classify: exact, range, approximate, regnal, feast-relative, unknown.
- Convert to EDTF; never force fuzziness into a hard date.
- Flag the calendar and year-start convention.
- Resolve regnal and feast references via lookup tables.
- Keep an interpretation note for every non-trivial conversion.
- Validate all EDTF values programmatically before you ship.
Key Takeaways
- Normalise to ISO 8601 for exact dates and EDTF for everything uncertain.
- Always retain the original verbatim string and its offsets.
- Combine regex, regnal/feast tables, and a model — none alone suffices.
- Separate calendar system from year-start; both cause off-by-one years.
- Store a calendar flag and an interpretation note per date.
- Validate EDTF strings programmatically to catch malformed values.
- Write a normalisation spec up front for project-wide consistency.
Frequently Asked Questions
What standard should I normalise historical dates to?
Use ISO 8601 for precise dates and the EDTF (Extended Date/Time Format) extension for everything imprecise — uncertain, approximate, ranges, and unknown components. EDTF is built for exactly the fuzziness historical dates carry.
How do I handle Old Style versus New Style dates?
Record the date as written, then add a normalised proleptic Gregorian value and flag the calendar. For England before 1752, dates between 1 January and 24 March also need the year adjusted, so store both the original and converted year.
Can NER models extract dates reliably?
They find modern-looking dates well but stumble on regnal years, feast days, and phrases like "the Tuesday after Michaelmas". Pair the model with rule-based patterns and a feast-day lookup table for historical sources.
How should I store an uncertain date like "about 1650"?
In EDTF that is 1650~ (approximate). A decade you are unsure of is 165X or an interval 1645/1655. Never silently collapse "about 1650" to a hard 1650-01-01.
What about regnal years such as "3 Edward VI"?
Convert using a regnal-year table that maps each monarch's accession date to calendar years. "3 Edward VI" is 1549–1550; keep the original string and the resolved range together.
Why keep the original date string after normalising?
Normalisation is interpretation and can be wrong. Keeping the verbatim source string lets anyone re-check your conversion and lets you reprocess if your rules improve.