Which file formats are still going to be readable in 2076?

The single most consequential decision you make as an archaeological data producer is not which scanner you buy or which photogrammetry pipeline you run. It is which file format you write the result into. Hardware will be replaced every five years. Software stacks will be rebuilt. The bytes on disk are the only thing that has any chance of crossing the half-century horizon that our funders increasingly write into our grants.

This is a working note — not an exhaustive standards review — on the formats the Sydney lab currently writes to, the formats we have moved away from, and the small set of principles we apply when a new format proposal lands on the desk.

What “will still be readable” actually means

A format will be readable in fifty years if three conditions hold. There must be a published, open specification that does not depend on a vendor for interpretation. There must be more than one independent reference implementation, ideally in different programming languages and from different organisational maintainers. And the format must be in active scholarly or commercial use by a community large enough that a reader will continue to be maintained even if the original authors disappear.

PDF/A satisfies all three. JPEG (the original baseline JPEG, not the variants) satisfies all three. Plain UTF-8 text satisfies all three trivially. Most of the formats that arrived in the last decade satisfy at most two of the three and the third one is the killer.

The current Sydney lab table

This is the table that lives on the wall above the lab’s photogrammetry workstation, updated quarterly. The status column is our own judgement and we change it when the evidence does.

Data class	Working format	Archival format	Status notes
Raster image (photo)	RAW (camera native)	TIFF 6.0, uncompressed	Boring on purpose. TIFF will outlive us.
Raster image (multispectral)	TIFF (16-bit per band)	TIFF (16-bit per band)	Same file in and out.
Vector image	SVG 1.1	SVG 1.1 + PDF/A-2	We export both.
Point cloud	LAS 1.4	LAS 1.4 + ASCII XYZ-RGB	The ASCII fallback is large and ugly but it will read.
Mesh (low poly)	PLY (binary little-endian)	PLY (ASCII)	The ASCII version is preserved as the canonical.
Mesh (textured)	glTF 2.0 + PNG textures	OBJ + MTL + PNG textures	We keep both. OBJ is older and less expressive but reads everywhere.
Tabular data	CSV (UTF-8, RFC 4180)	CSV (UTF-8, RFC 4180)	Plus a Datapackage manifest.
Time-series	Parquet	CSV	The CSV is the canonical archival.
GIS vector	GeoPackage	GeoPackage + Shapefile	We are uncomfortable with the Shapefile dependency but it remains the lingua franca.
GIS raster	GeoTIFF (Cloud Optimized)	GeoTIFF (uncompressed)	The COG is for working, the uncompressed is for the archive.
Document	Markdown + LaTeX	PDF/A-2 + Markdown source	Both, always.
Long-form prose	Markdown	PDF/A-2 + UTF-8 plain text	Both.
Audio (oral history)	WAV (PCM, 96/24)	WAV (PCM, 96/24) + FLAC	Boring on purpose, again.
Video (field documentation)	ProRes 422 HQ in MOV	FFV1 in MKV + JPEG 2000 stills	FFV1 is the only modern lossless codec we trust.

Four formats we have stopped using since 2023

E57 for point clouds. The format is technically open but the reference implementation has been functionally maintained by a single vendor for years, and our reading of the ASTM committee minutes suggests the maintenance burden is not stably shared. We still accept E57 from collaborators — the scanners that produce it are everywhere — but we convert to LAS the moment the data lands in the lab.

FBX for meshes. Proprietary, controlled by Autodesk, and the format has changed in ways that have broken our older readers more than once. glTF 2.0 has matured and we have no remaining reason to use FBX outside of a specific deliverable to a games-engine studio that requests it.

XLSX as a primary tabular store. We will still publish derivative spreadsheets in XLSX for collaborators who ask for them, but the canonical is always CSV plus a Frictionless Datapackage manifest. The CSV reads with any text editor in any decade. The XLSX may not.

HEIC for camera images. The compression is excellent and the format is technically standardised, but the patent and licensing situation is genuinely opaque and we do not want to bet a fifty-year archive on a format whose decoder may stop shipping in a future operating system. We shoot RAW and write to TIFF and we accept the storage cost.

On the temptation to use the newest thing

Every year a new format arrives that promises a meaningful improvement on one of the boring choices above. JPEG XL, Draco-compressed glTF, USDZ for the spatial-computing pipeline. Some of them are technically excellent. We watch them carefully and we will adopt them when they have crossed all three of the readability thresholds at the top of this post. Almost none of them have yet, and the cost of a wrong bet is not the storage we spend — it is the dataset we lose in 2055 because the decoder is no longer being maintained.

The boring choice almost always wins on the half-century horizon. That is the entire principle. The same logic shows up in totally unrelated fields — the finance literature on compounding is essentially the same argument expressed in monetary terms, and the working bias towards an index fund over the latest actively-managed product is the same bias towards the format that the largest readable community will still maintain in fifty years.

Two things we have not figured out

There are two open questions on the table at Sydney and we do not have a confident answer to either.

The first is interactive web experiences. Our VR walk-through of the Karnak hypostyle hall is built as a Unity binary. The binary will not run in 2076 in any meaningful sense. We currently archive the source mesh in PLY and the texture atlases as TIFF, and we accept that the experience is ephemeral while the data underneath the experience is preserved. We are not sure this is the right answer.

The second is machine-learning model weights. Our Linear B transformer is a small enough model that we can plausibly write the weights to ONNX and pickle and HDF5 in parallel and one of them will probably still load in fifty years. But the training data and the training configuration matter as much as the weights for any future re-run, and the formats we use to capture those (JSONL for the data manifest, YAML for the config) are stable, but the dependency closure of the training runtime is not. We have no clean answer here either. The pace at which the broader ML stack rotates underneath you is, frankly, well-tracked at the AI/TLDR weekly digest — reading it for a month is the fastest way to internalise why “ML model archival” is currently a half-solved problem at best.

If you have a stronger view than we do on either question, write to me. I publish a corrections column when I am wrong.

— Elara

What “will still be readable” actually means ​

The current Sydney lab table ​

Four formats we have stopped using since 2023 ​

On the temptation to use the newest thing ​

Two things we have not figured out ​

What “will still be readable” actually means

The current Sydney lab table

Four formats we have stopped using since 2023

On the temptation to use the newest thing

Two things we have not figured out