Appearance
To normalise historical dates in OpenRefine, keep the original wording in one column and add a second column holding a consistent machine-readable form — usually ISO 8601, YYYY-MM-DD. Parse with an explicit format using toDate(), and where a date is approximate, record a range or a flag rather than forcing a false precision. The golden rule is: never destroy the original.
Why normalise dates at all?
A column containing 4 July 1850, 04/07/1850, July 4th, 1850 and 1850 is unusable for sorting or filtering, because the computer sees four unrelated strings. Normalising rewrites them into one comparable form so you can sort chronologically, filter to a decade, or plot a timeline. For historians the catch is that real sources are fuzzy — "circa 1850", "before Easter 1340" — and good normalisation respects that fuzziness instead of erasing it.
What format should normalised dates use?
Use ISO 8601: YYYY-MM-DD, with shorter forms YYYY-MM and YYYY when the day or month is unknown. It sorts correctly as plain text, is unambiguous worldwide, and every tool understands it.
text
4 July 1850 -> 1850-07-04
July 1850 -> 1850-07
1850 -> 1850Avoid local formats like 04/07/1850; they are ambiguous and sort wrongly.
The cardinal rule: never overwrite the original
Before any transformation, duplicate the source column: Edit column > Add column based on this column, naming it date_norm. You now have:
date_raw— exactly as written in the source, untouched.date_norm— your machine-readable best value.- optionally
date_flag—circa,before,range,uncertain.
This separation means a later reader can always check your interpretation against the original, which is non-negotiable in scholarly work.
How do you parse dates without month/day chaos?
OpenRefine's toDate() guesses formats and frequently swaps day and month. Always state the format explicitly inside Edit cells > Transform on date_norm:
text
toDate(value, "d MMMM yyyy").toString("yyyy-MM-dd")Here d MMMM yyyy matches 4 July 1850, and toString writes clean ISO text. For numeric input use toDate(value, "dd/MM/yyyy") so 03/04/1850 becomes 3 April, not 4 March.
A small worked example
Suppose date_raw mixes formats. Work in passes, each on a facet of one style:
- Text-facet
date_rawto see the distinct shapes. - For rows like
4 July 1850, applytoDate(value,"d MMMM yyyy").toString("yyyy-MM-dd"). - For
04/07/1850, applytoDate(value,"dd/MM/yyyy").toString("yyyy-MM-dd"). - For bare years, copy them through unchanged (already valid as
YYYY). - Facet
date_normby error or blank to find the rows no pass handled, and treat those by hand.
Working in passes beats one giant expression because each format is explicit and auditable.
How do you represent uncertainty?
ISO 8601 has no native "circa". Two pragmatic approaches:
| Source says | date_norm | date_start | date_end | flag |
|---|---|---|---|---|
1850 | 1850 | 1850-01-01 | 1850-12-31 | year |
circa 1850 | 1850 | 1845-01-01 | 1855-12-31 | circa |
19th century | (blank) | 1800-01-01 | 1899-12-31 | century |
before 1340 | (blank) | (blank) | 1340-12-31 | before |
A start/end pair captures the range a single field cannot.
What about calendar changes?
Dates before the Gregorian reform of 1582 (and later in some countries) sit in the Julian calendar, which differs from Gregorian by ten or more days. Do not silently convert. Keep date_raw verbatim, add a calendar column noting Julian or Gregorian, and document any conversion you make. Transparency matters more than a tidy single column.
Key Takeaways
- Normalise to ISO 8601 (
YYYY-MM-DD,YYYY-MM,YYYY) so dates sort and filter reliably. - Always keep the original date text in its own column; never overwrite it.
- Pass an explicit format to
toDate()to stop day/month swaps on numeric dates. - Clean in per-format passes using facets rather than one fragile mega-expression.
- Represent uncertainty with a start/end range and a flag, not a falsely precise single value.
- Record the calendar (Julian or Gregorian) for early dates and document any conversion.
Frequently Asked Questions
What does normalising a date actually mean?
It means rewriting dates in many human formats into one consistent machine-readable form, usually ISO 8601 (YYYY-MM-DD). A normalised column can be sorted, filtered by range and compared reliably, which mixed free-text dates cannot.
How do I keep an approximate or uncertain date?
Keep the original text in one column and add a separate normalised column for the best machine value, plus a flag column noting the uncertainty. Never overwrite circa or before/after information just to fit a single format.
Why do my dates come out with the wrong month?
Because OpenRefine's toDate guessed a US month-day order. Always pass the format explicitly, for example toDate(value, 'dd/MM/yyyy'), so a value like 03/04/1850 parses as the day you intend.
How should I handle year-only or century dates?
Store the year alone as YYYY, or use a start and end column to express a range such as 1800 to 1899 for the nineteenth century. ISO 8601 does not represent fuzziness, so a range pair is the practical solution.
What about dates before 1582 and calendar changes?
Record which calendar the source uses, because the switch from Julian to Gregorian shifts dates by several days. Keep the original verbatim and note the calendar; do not silently convert without documenting it.
Should I store dates as text or as OpenRefine date objects?
For most archival work store them as ISO 8601 text, which survives export to CSV cleanly. Use true date objects only when you need OpenRefine's timeline facet, then convert back to text before exporting.