A Technical Dive into ODF

To write this article, I went beyond the limits of my technical knowledge, which is that of an advanced user who has studied standard formats and their characteristics in depth, to understand why standard formats – one of the pillars of digital sovereignty – and proprietary formats – their opposite, and one of the biggest obstacles to digital sovereignty – are not perceived as a problem by most PC users, who continue to use Microsoft’s proprietary formats and place the access and availability of their content in the hands of the US company.

To try to remedy this problem, I will try to explain as simply as possible, using non-technical language (which may shock developers, but this article is not aimed at them), some technical features of the Open Document Format (ODF), which make it the cornerstone of an open and vendor-independent ecosystem for office documents, defending the digital freedoms of all users and the governance of their content.

I will begin by explaining how to unpack an ODF file, which is nothing more than a set of XML files and other files (for images and videos) contained within a ZIP folder, in order to examine its internal components and, in particular, the content.xml file, which is the one that contains the body of the document (i.e., the user’s intellectual property).

The aim is not so much to assess conformity (compliance with specifications) and interoperability (the ability to exchange files consistently between tools), as these aspects will always be dealt with by specialists, but rather to understand the advantages for the user of the open and standard format over the closed and proprietary format (which is falsely standard, since it was approved by ISO/IEC in defiance of “their” definitions of standards).

For this reason, I will make a brief concluding digression on the characteristics of the OOXML (Office Open XML) format used by Microsoft Office and Microsoft 365, again to clarify to users the risks they face and the harm they do to themselves and other users when they use DOCX, XLSX and PPTX formats, as well as the ‘gift’ they are giving to Microsoft, to whom they are effectively entrusting the management and future of their content.

Analysing an ODF file

Take any document you have created with LibreOffice. For convenience, I recommend starting with a text document created with LibreOffice Writer, with the ODT extension. Before doing anything else, duplicate the file, because an error in the procedure could make it unreadable, and move the original to another folder.

Rename the copy, replacing the ODT extension with the ZIP extension, without deleting the dot. The file icon will become that of a compressed file. If it becomes white or empty, you have done something wrong or deleted the dot. Check all the steps until the icon becomes that of a compressed file.

At this point, right-click on the icon and select “unzip” or “expand” to extract the contents of the compressed file into a folder with the same name as the file without the extension.

The folder will contain the following items:

  • the META-INF folder, which will contain the manifest.xml file
  • the Thumbnails folder, which may contain the thumbnail.png file
  • the content.xml file, which contains the body of the document
  • the styles.xml file, which contains the style definitions
  • the meta.xml file, which contains the file metadata (author, creation date, last modification date, etc.)
  • the settings.xml file, which contains the application settings

Each XML file within an ODF document must comply with the RelaxNG XML schema, or REgular LAnguage for XML Next Generation, created by OASIS in 2001 and 2002, which is simpler – and therefore more accessible to non-technical users – than other XML schemas. The packaging rules are defined by the OpenDocument Packaging specifications.

In addition to schema validation, it must meet a number of conditions.

  • Structural compliance: the elements of the ZIP and manifest.xml files
  • Functionality compliance: all standard and optional functionality (metadata, styles, tables, graphics, etc.)
  • Formula compliance: spreadsheet formulas must be compatible with OpenFormula semantics
  • Security compliance: ODF profiles, encryption, digital signature

The manifest.xml file contained in the META-INF folder must list all the files in the ZIP file, with their media type:

<manifest:manifest xmlns:manifest=”urn:oasis:names:tc:opendocument:xmlns:manifest:1.0″>
     <manifest:file-entry manifest:full-path=”/” manifest:media-type=”application/vnd.oasis.opendocument.text”/>
     <manifest:file-entry manifest:full-path=”content.xml” manifest:media-type=”text/xml”/>
     <manifest:file-entry manifest:full-path=”styles.xml” manifest:media-type=”text/xml”/>
     <!– thumbnails, settings, etc. –>
</manifest:manifest>

Simply omitting a file or making an error in the description of its media type is enough to make the ODF file structurally non-compliant.

ODF: the importance of the content.xml file

To understand the user benefits of an open standard format such as ODF over a proprietary format, even one that is theoretically open such as OOXML, a quick analysis of the content.xml file of ODF files and its equivalent in OOXML files, which differs depending on the file type (and this alone is a sign that the development of OOXML did not take user needs into account at all, but focused on artificially increasing complexity), is sufficient.

Let’s take a first example, based on one of the most famous phrases in the history of world literature, namely “to be, or not to be, that is the question” uttered by the protagonist of William Shakespeare’s Hamlet.

The content.xml file of a text document containing only this sentence is 32 lines long: the first 18 provide references to all the standards used (such as X-Forms and MathML), list the fonts used in the document styles, and define the styles (in this case only one, given the length of the text and the absence of formatting).

The next 13 lines are as follows:

<office:body>
     <office:text>
          <office:forms form:automatic-focus=”false” form:apply-design-mode=”false”/>
          <text:sequence-decls>
               <text:sequence-decl text:display-outline-level=”0″ text:name=”Illustration”/>
               <text:sequence-decl text:display-outline-level=”0″ text:name=”Table”/>
               <text:sequence-decl text:display-outline-level=”0″ text:name=”Text”/>
               <text:sequence-decl text:display-outline-level=”0″ text:name=”Drawing”/>
               <text:sequence-decl text:display-outline-level=”0″ text:name=”Figure”/>
          </text:sequence-decls>
          <text:p text:style-name=”P1″>To be, or not to be, that is the question</text:p>
     </office:text>
</office:body>

The first lines define the body of the document and the fact that it is a text. The following lines are declarations that, in this case, do not add anything, but in other contexts would provide information about other elements of the document.

The key line is this: <text:p text:style-name=‘P1’>To be, or not to be, that is the question</text:p>, which defines a paragraph, declares its style (P1) and provides the content: To be, or not to be, that is the question. Clear and readable by any user, who now has the keys to access the document and manage its contents, i.e. the product of their brain.

Of course, more complex documents and contents would correspond to a more complex content.xml file, but always respecting the readability of the contents and the simplicity of the XML schema.

OOXML: what happens inside the file

Let’s see what happens in the case of the same document saved in DOCX format, closed and proprietary, and artificially complex. The file is called document.xml and not content.xml, and this – obviously – would not be significant if it were not a further sign of the complexity of the format, given that in the case of Excel the file is called workbook.xml and in the case of PowerPoint it is called slide1.xml, and so on.

The document.xml file of a text document containing only the phrase “To be, or not to be, that is the question” is 41 lines long: the first provides references to all the proprietary elements used (such as wordprocessingCanvas, VML and WordML), and all the subsequent lines relate to the content:

<w:body>
     <w:p xmlns:wp14=”http://schemas.microsoft.com/office/word/2010/wordml” wp14:paraId=”2DC08235″ wp14:textId=”776AF5CB”>
          <w:r w:rsidR=”6B254FF6″>
               <w:rPr/>
               <w:t xml:space=”preserve”>To be, or </w:t>
          </w:r>
          <w:r w:rsidR=”6B254FF6″>
               <w:rPr/>
               <w:t>not</w:t>
          </w:r>
          <w:r w:rsidR=”6B254FF6″>
               <w:rPr/>
               <w:t xml:space=”preserve”> to be, </w:t>
          </w:r>
          <w:r w:rsidR=”6B254FF6″>
               <w:rPr/>
               <w:t>that</w:t>
          </w:r>
          <w:r w:rsidR=”6B254FF6″>
               <w:rPr/>
               <w:t xml:space=”preserve”> </w:t>
          </w:r>
          <w:r w:rsidR=”6B254FF6″>
               <w:rPr/>
               <w:t>is</w:t>
          </w:r>
          <w:r w:rsidR=”6B254FF6″>
               <w:rPr/>
               <w:t xml:space=”preserve”> the question</w:t>
          </w:r>
     </w:p>
     <w:sectPr>
          <w:pgSz w:w=”11906″ w:h=”16838″ w:orient=”portrait”/>
          <w:pgMar w:top=”1440″ w:right=”1440″ w:bottom=”1440″ w:left=”1440″ w:header=”720″ w:footer=”720″ w:gutter=”0″/>
          <w:cols w:space=”720″/>
          <w:docGrid w:linePitch=”360″/>
     </w:sectPr>
</w:body>

Obscure and unreadable. I challenge any user to reconstruct a text of any complexity from an XML document like this, if the original file is damaged. In the case of ODF, we were able to reconstruct even documents of hundreds of pages, or presentations of dozens of slides, because the content was readable by any user, even non-technical ones.

Let’s try to imagine the size of the content.xml file and the document.xml file if, instead of Prince Hamlet’s sentence, there were all 5,566 lines of the entire tragedy, in the original version written by William Shakespeare. In this case, the difference speaks for itself: content.xml is 5,598 lines long (32 lines more than the text), document.xml is 93,289 lines long (87,723 lines more than the text).

File complexity as the new lock-in strategy

This file complexity is intentionally hidden from the user, who sees a normal-looking document on the screen and has no idea that they are writing a file on their hard drive or in the cloud that has characteristics very similar to those of the proprietary files used in the last century, which are unreadable without the software with which they were written.

A user who believes they have made significant progress in terms of digital sovereignty because they use a format they believe to be open and standard but which, on the contrary, is even worse than the binary formats of the 1900s – which were nothing more than the writing of what was in memory – because, being based on XML, it is the offspring of an algorithm that can be modified remotely with a routine update (as happens in reality, where the same document is written in DOCX format but with a completely different XML syntax each time, based on parameters known only to the vendor, i.e. Microsoft).

So, it is an even more closed and proprietary format than the binary formats it replaced in 2006. The latter, being the result of writing what was in memory to files, were predictable and could be emulated, while OOXML is unpredictable due to the algorithm, and therefore almost impossible to emulate without constant study of its many evolutions.

OOXML is a theoretically open and standard format, which in reality is closed and proprietary, and represents the latest evolution of the lock-in strategy that underpins all Microsoft products for individual productivity, defending an estimated turnover of over $25 billion per year, with an estimated net profit of over $20 billion per year (all figures are estimates, as analysts’ figures are no longer available and are probably lower than the actual figures).

Perhaps the time has come for supranational organisations, central and local governments, and probably also individual users, to open their eyes and take a simple step forward towards digital sovereignty, i.e. the governance of documents and their content independent of the commercial choices of a single company, by adopting ODF and abandoning OOXML.

Leave a Reply