Skip to content
Preservation

The single most consequential decision you make as an archaeological data producer is not which scanner you buy or which photogrammetry pipeline you run. It is which file format you write the result into. Hardware will be replaced every five years. Software stacks will be rebuilt. The bytes on disk are the only thing that has any chance of crossing the half-century horizon that our funders increasingly write into our grants.

This is a working note — not an exhaustive standards review — on the formats the Sydney lab currently writes to, the formats we have moved away from, and the small set of principles we apply when a new format proposal lands on the desk.

What “will still be readable” actually means

A format will be readable in fifty years if three conditions hold. There must be a published, open specification that does not depend on a vendor for interpretation. There must be more than one independent reference implementation, ideally in different programming languages and from different organisational maintainers. And the format must be in active scholarly or commercial use by a community large enough that a reader will continue to be maintained even if the original authors disappear.

PDF/A satisfies all three. JPEG (the original baseline JPEG, not the variants) satisfies all three. Plain UTF-8 text satisfies all three trivially. Most of the formats that arrived in the last decade satisfy at most two of the three and the third one is the killer.

The current Sydney lab table

This is the table that lives on the wall above the lab’s photogrammetry workstation, updated quarterly. The status column is our own judgement and we change it when the evidence does.

Data classWorking formatArchival formatStatus notes
Raster image (photo)RAW (camera native)TIFF 6.0, uncompressedBoring on purpose. TIFF will outlive us.
Raster image (multispectral)TIFF (16-bit per band)TIFF (16-bit per band)Same file in and out.
Vector imageSVG 1.1SVG 1.1 + PDF/A-2We export both.
Point cloudLAS 1.4LAS 1.4 + ASCII XYZ-RGBThe ASCII fallback is large and ugly but it will read.
Mesh (low poly)PLY (binary little-endian)PLY (ASCII)The ASCII version is preserved as the canonical.
Mesh (textured)glTF 2.0 + PNG texturesOBJ + MTL + PNG texturesWe keep both. OBJ is older and less expressive but reads everywhere.
Tabular dataCSV (UTF-8, RFC 4180)CSV (UTF-8, RFC 4180)Plus a Datapackage manifest.
Time-seriesParquetCSVThe CSV is the canonical archival.
GIS vectorGeoPackageGeoPackage + ShapefileWe are uncomfortable with the Shapefile dependency but it remains the lingua franca.
GIS rasterGeoTIFF (Cloud Optimized)GeoTIFF (uncompressed)The COG is for working, the uncompressed is for the archive.
DocumentMarkdown + LaTeXPDF/A-2 + Markdown sourceBoth, always.
Long-form proseMarkdownPDF/A-2 + UTF-8 plain textBoth.
Audio (oral history)WAV (PCM, 96/24)WAV (PCM, 96/24) + FLACBoring on purpose, again.
Video (field documentation)ProRes 422 HQ in MOVFFV1 in MKV + JPEG 2000 stillsFFV1 is the only modern lossless codec we trust.

Four formats we have stopped using since 2023

E57 for point clouds. The format is technically open but the reference implementation has been functionally maintained by a single vendor for years, and our reading of the ASTM committee minutes suggests the maintenance burden is not stably shared. We still accept E57 from collaborators — the scanners that produce it are everywhere — but we convert to LAS the moment the data lands in the lab.

FBX for meshes. Proprietary, controlled by Autodesk, and the format has changed in ways that have broken our older readers more than once. glTF 2.0 has matured and we have no remaining reason to use FBX outside of a specific deliverable to a games-engine studio that requests it.

XLSX as a primary tabular store. We will still publish derivative spreadsheets in XLSX for collaborators who ask for them, but the canonical is always CSV plus a Frictionless Datapackage manifest. The CSV reads with any text editor in any decade. The XLSX may not.

HEIC for camera images. The compression is excellent and the format is technically standardised, but the patent and licensing situation is genuinely opaque and we do not want to bet a fifty-year archive on a format whose decoder may stop shipping in a future operating system. We shoot RAW and write to TIFF and we accept the storage cost.

On the temptation to use the newest thing

Every year a new format arrives that promises a meaningful improvement on one of the boring choices above. JPEG XL, Draco-compressed glTF, USDZ for the spatial-computing pipeline. Some of them are technically excellent. We watch them carefully and we will adopt them when they have crossed all three of the readability thresholds at the top of this post. Almost none of them have yet, and the cost of a wrong bet is not the storage we spend — it is the dataset we lose in 2055 because the decoder is no longer being maintained.

The boring choice almost always wins on the half-century horizon. That is the entire principle.

Two things we have not figured out

There are two open questions on the table at Sydney and we do not have a confident answer to either.

The first is interactive web experiences. Our VR walk-through of the Karnak hypostyle hall is built as a Unity binary. The binary will not run in 2076 in any meaningful sense. We currently archive the source mesh in PLY and the texture atlases as TIFF, and we accept that the experience is ephemeral while the data underneath the experience is preserved. We are not sure this is the right answer.

The second is machine-learning model weights. Our Linear B transformer is a small enough model that we can plausibly write the weights to ONNX and pickle and HDF5 in parallel and one of them will probably still load in fifty years. But the training data and the training configuration matter as much as the weights for any future re-run, and the formats we use to capture those (JSONL for the data manifest, YAML for the config) are stable, but the dependency closure of the training runtime is not. We have no clean answer here either.

If you have a stronger view than we do on either question, write to me. I publish a corrections column when I am wrong.

— Elara