Best Practices to Code categorical historical variables

Q: What do I do with values that fit no category?

Never force them; use an explicit 'other' or 'unclassifiable' code and keep the verbatim original string in a separate column. A bloated 'other' category is a signal that your scheme needs revision, which the verbatim column lets you do later.

To code categorical historical variables well, lean on an established classification scheme wherever one exists, write an explicit codebook with a definition and example for every category, and always preserve the verbatim source string alongside your code. Keep missing distinct from not-applicable, code to the meaning contemporary with the record, and test your coding for consistency. These habits make results reproducible and let you revise the scheme later without re-reading every source. Below is the full set of practices with examples.

Why not just invent your own categories?

Because a private scheme is incomparable, undocumented and usually inconsistent. Where an established standard exists, use it: HISCO (Historical International Standard Classification of Occupations) for occupations, standard taxonomies for religious affiliation, cause of death, or administrative status. A standard scheme is already defined, already used by others, and lets you join your data to wider datasets. Invent your own only when nothing fits, and then publish a codebook so others can replicate or critique it.

How do I keep coding consistent at scale?

Consistency comes from a codebook, not from memory. Every category needs a one-sentence definition and at least one worked example. If you cannot define a category that crisply, it will be applied differently on Tuesday than on Friday.

yaml

# codebook excerpt: occupation_status
codes:
  master:
    def: "Independent practitioner with own workshop or holding"
    example: "master cordwainer, of his own shop"
  journeyman:
    def: "Skilled worker employed by a master, not yet independent"
    example: "journeyman tailor, in service to..."
  apprentice:
    def: "Bound learner under indenture"
    example: "apprentice to a wheelwright, term 7 years"
  other:
    def: "Documented status fitting none above"
  unclassifiable:
    def: "Status named but ambiguous or illegible"

Then test it: have a second person code a sample of 100 records and compute inter-coder agreement (Cohen's kappa). A kappa below about 0.6 means the scheme is too vague and needs tighter definitions, not more effort.

What do I do with values that fit nothing?

Never force a square peg. Use an explicit other for documented-but-unlisted values and unclassifiable for ambiguous or illegible ones, and always keep the verbatim original in its own column:

code	verbatim_source	meaning
`master`	"master smith"	fits cleanly
`other`	"common brewer"	documented, not yet a category
`unclassifiable`	"...wright (torn)"	illegible

A swelling other bucket is useful information: it tells you the scheme is missing a real category. Because you kept the verbatim string, you can split other into a new code later without touching the sources again.

Missing versus not-applicable: why the difference matters

These are genuinely different states and must have different codes. Missing means the information existed but was not recorded, lost, or illegible. Not-applicable means the category cannot apply, an infant has no occupation, an unmarried person has no spouse's status. Collapsing them into one blank destroys information and biases any analysis that filters on the variable: you cannot tell "we do not know" from "there is nothing to know."

text

occupation_code = "missing"   # entry blank or illegible
occupation_code = "n/a"       # record is an infant burial

How do I handle meanings that drifted over time?

Code to the meaning contemporary with the record, not the modern sense, and stamp each record with a period or scheme-version column. "Merchant," "labourer" and "engineer" did not mean in 1700 what they meant in 1900; a single flat code silently merges different realities. With a version column, an analysis can either treat each period on its own terms or apply a documented crosswalk between scheme versions, and a reader can see which you chose.

Should codes be numbers or labels?

Store the stable code as a short string or factor with a lookup table, never a bare integer whose meaning lives only in your head. Integers invite disaster when files are merged or reopened months later, 3 becomes apprentice in one file and widow in another. A self-describing string (apprentice) plus a codebook is unambiguous and survives spreadsheet round-trips.

A working checklist

Adopt an established scheme (HISCO, etc.) before inventing one.
Write a codebook: definition plus example per category.
Keep the verbatim source string in its own column, always.
Use explicit other and unclassifiable; never force a fit.
Code missing and not-applicable distinctly.
Code to the contemporary meaning; add a period/version column.
Store self-describing codes with a lookup table, not bare integers.
Test inter-coder reliability; tighten definitions if kappa is low.

Key Takeaways

Prefer established schemes like HISCO; they make your data comparable and are already documented.
A codebook with a definition and example per category is what makes coding reproducible.
Always preserve the verbatim source string so you can revise the scheme later.
Keep other, unclassifiable, missing and not-applicable as distinct codes.
Code to the meaning contemporary with the record and stamp a period or version.
Store self-describing string codes with a lookup table, never bare integers.
Test inter-coder reliability; low agreement means vague definitions, not careless coders.

Frequently Asked Questions

Should I code categories myself or use an existing classification scheme?

Use an established scheme like HISCO for occupations or a standard religious-affiliation taxonomy where one exists, because it makes your work comparable to others and is already documented. Invent your own only when no scheme fits, and then publish your codebook.

How do I keep coding consistent across thousands of records?

Write an explicit codebook with a definition and at least one example for every category, then test inter-coder reliability on a sample. A category you cannot define in a sentence with an example will be applied inconsistently.

What do I do with values that fit no category?

Never force them; use an explicit "other" or "unclassifiable" code and keep the verbatim original string in a separate column. A bloated "other" category is a signal that your scheme needs revision, which the verbatim column lets you do later.

Should missing and not-applicable be the same code?

No. Missing means the information existed but was not recorded or is illegible; not-applicable means the category cannot apply to this record. Collapsing them destroys information and biases any analysis that conditions on the variable.

How do I handle categories whose meaning changed over time?

Code to the meaning contemporary with the record, not the modern sense, and add a period or scheme-version column so analysis can account for the shift. A label like "merchant" or "labourer" did not mean the same thing in 1700 and 1900.

Should I store codes as numbers or as labels?

Store the stable code as a short string or factor, never a bare integer whose meaning lives only in your head, and keep a lookup table mapping codes to definitions. Bare integers invite silent miscoding when files are merged or reopened months later.

Why not just invent your own categories? ​

How do I keep coding consistent at scale? ​

What do I do with values that fit nothing? ​

Missing versus not-applicable: why the difference matters ​

How do I handle meanings that drifted over time? ​

Should codes be numbers or labels? ​

A working checklist ​

Key Takeaways ​

Frequently Asked Questions ​

Should I code categories myself or use an existing classification scheme? ​

How do I keep coding consistent across thousands of records? ​

What do I do with values that fit no category? ​

Should missing and not-applicable be the same code? ​

How do I handle categories whose meaning changed over time? ​

Should I store codes as numbers or as labels? ​

Related reading ​