Why a digital document is a piece of software, and what that means for your freedom

Most people, including many competent software developers, think of a digital document the way they think of a sheet of paper: an inert object that holds words and pictures, indifferent to the tool used to open it. This intuition is wrong, and the consequences of getting it wrong shape everything from vendor lock-in to cybersecurity to the long-term readability of public records.

A digital document is not paper. It is a piece of software.

The HTML parallel

The clearest way to see this is to think about a web page. When you visit a website, your browser receives a file – an HTML document – and executes it. It parses the markup, applies styling rules, runs embedded scripts, fetches additional resources, and assembles the result into something you can read. The page you see on screen is not a static image transmitted from the server, it is the output of a small program that your browser ran on your behalf.

Nobody disputes that a web browser is software. Yet the HTML file it consumes is also, in a meaningful sense, software: a set of instructions describing what should happen when the file is opened. Change the instructions, and the rendered page changes. Withhold the specification of how the instructions should be interpreted, and only the party holding the specification can guarantee a faithful rendering.

It is worth remembering that the openness of HTML did not happen by accident, and was nearly lost. In the early 2000s, Internet Explorer 6 commanded around ninety per cent of the browser market, and Microsoft used that dominance to push proprietary extensions to HTML, CSS, and the document model: non-standard tags, behaviours, and filters that worked only in their browser.

Web developers, desperate to reach users, began coding both to Internet Explorer and to the standard, carrying the cost of that double work themselves, while the vendor reaped the benefit of lock-in either way. The open web did not fragment, but only because developers absorbed the cost of holding it together. Had they stopped, HTML would have quietly become whatever Microsoft shipped next.

It took a sustained effort by the W3C, by competing browsers such as Firefox, and by the community of standards-conscious developers to pull the web back onto open ground. Had that effort failed, HTML today would not be a shared language, but a Microsoft product. The web survived because the standard was defended. Document formats have not always been so lucky.

An office document – a DOCX, an ODT, a PPTX, a PDF – works exactly the same way. It is a structured file containing instructions: this text in this font at this size, this image embedded here, this table laid out this way, this field recalculated automatically, this macro executed on opening. When you “open” the document, an application reads those instructions and runs them. The page you see on screen is the output of a program – the office suite – executing the instructions contained in the document.

The document is the code. The office suite is the interpreter. Together they are a software system, and the user is the one running it, usually without realising.

Why this matters: lock-in is a software property

Once you see a document as software, the question of file formats becomes the question of programming languages. A proprietary file format is a programming language whose specification is owned, controlled, and modifiable at will by a single vendor. The “programs” written in that language – your contracts, your invoices, your books, your public administration archives – can only be reliably executed by software that vendor authorises.

This is the structural mechanism of lock-in. It is not a side effect of user habit or training cost. It is the direct consequence of writing your documents in a language whose grammar belongs to someone else. The moment the vendor changes the grammar – and proprietary formats change constantly, at least with each new product release, but often even more frequently – your existing documents may render differently, lose features, or stop opening altogether. You do not own the language in which your own records are written.

Open standards such as ODF exist precisely to break this dependency. ODF is a publicly specified, independently maintained format whose grammar belongs to no single vendor. Any developer can build a faithful interpreter. Your documents, written in an open language, remain readable regardless of what any single company decides.

Why this matters: attack surface is a software property

The second consequence is security. Software has vulnerabilities, paper does not. The moment we admit that a document is software, the long catalogue of OOXML-related security advisories becomes unsurprising, and inevitable, indeed.

Office document formats are ferociously complex. OOXML in particular runs to thousands of pages of specification, with macro languages, embedded OLE objects, external references, conditional formatting logic, and a substantial layer of binary legacy compatibility. Each of these is a way in for an attacker. A document that arrives by email and “just opens” can run hidden code, download malicious content from the internet, exploit weaknesses in how the file is read, and from there take control of the computer itself. The pattern recurs year after year, vulnerability after vulnerability, because the document is doing what software does: running.

A simpler, more rigorously specified format is harder to weaponise. This is not a guarantee – any sufficiently expressive format has risks – but the principle holds: complexity is the friend of the attacker, and proprietary complexity, never fully documented to outside parties, is the best friend of all.

Why this matters: freedom is a software property

If a digital document is software, then the framework we apply to software ethics applies to documents. The Free Software Foundation defines four freedoms: the freedom to use the program for any purpose, to study and modify it, to redistribute copies, and to distribute modified versions. The second and the fourth – Freedom 1 and Freedom 3 – require access to the source.

A document in a proprietary format violates these freedoms in exactly the way proprietary software does. You cannot fully study how it will be interpreted, because the specification of the format is either secret, partial, or subject to unilateral change. You cannot reliably build or share modified tools to interpret it, because the format’s owner retains the right to declare your interpreter non-conformant. The “source code” of the document – the full and stable specification of what its instructions mean – is not in your hands.

This is not a metaphor. It is the same dependency, structurally, that makes proprietary software unacceptable for any organisation serious about digital sovereignty. The document, as software, inherits the politics of the format it is written in.

The conclusion is unavoidable

A digital document is a small program. It runs every time it is opened. The language it is written in determines who controls it, who can attack it, and whether its readers are free.

Treating documents as paper has allowed a generation of policymakers, public administrators, and even technologists to overlook the fact that the choice of document format is a choice of software dependency, and a choice of whose grammar governs our written record. There is no neutral format, just as there is no neutral programming language. There are only formats whose specifications are open, stable, and collectively governed, and formats that are not.

We have learned, slowly and at cost, to demand openness in our software. The document is software. The demand is the same.

Leave a Reply