Blog Post

PDF vs. DOCX: How Textual Contents Can Be So Different

Illustration: PDF vs. DOCX: How Textual Contents Can Be So Different

In everyday business, it’s common for people to convert between Word documents and PDFs. And though PDFs can be very expressive and present far more flexibility than Word documents, many of the PDFs we see are originally converted from some type of Word format. This shows us that both formats have their place and need to be well understood.

In today’s blog post, I’ll explore the differences between the Office Open XML format (Microsoft Office format) and the PDF format. We’ll analyze a single text example, and in doing so, we’ll highlight the key differences and philosophies between the two formats.

But first, a history lesson.

What Is the Office Open XML Format?

Office Open XML (OOXML) is a format Microsoft began developing in 2000. Now, it’s pretty much the defacto standard when it comes to word processing, spreadsheets, and more.

When you see the X in DOCX extensions, that means it’s OOXML. You’ll also find XLSX and PPTX — among other extensions — and they all fall under the OOXML umbrella. But today, we’ll mainly be focusing on word processing, thus we’ll discuss the DOCX extension.

Even though OOXML has been around for more than two decades, it’s only been an ISO standard since 2008, and it has passed through a couple of revisions since. And these standards have been adopted by many software vendors other than Microsoft.

Behind the DOCX extension is technically just a ZIP file. This ZIP file contains many XML files describing the contents of a document. Without digging too deep into the file structure, you’ll find descriptions of the document body, header, and footer, along with resources such as images, fonts, and styles. These all come together to describe the contents of a document (note the use of the word contents, which is an important differentiator).

What Is PDF?

PDF is the Portable Document Format, and it targets a use case that’s slightly different to OOXML.

It was originally developed in 1992 by Adobe, and it has since been through many versions. The format is aimed at preserving the visual representation of text, images, and drawings, regardless of the environment a document is in (note the use of the words visual representation, and not contents, as with PDFs, it’s all about graphic fidelity).

PDFs are used in many situations due to the flexibility of the format, and many applications and businesses will export files to PDFs to transmit documents to their destinations. That’s because — with PDFs — it’s almost certain that what the sender sees in their environment will be perfectly replicated in the recipient’s environment.

The ability of PDFs to replicate this visual fidelity across different environments comes from the format’s descriptive makeup. Rather than utilizing a human-readable XML format like OOXML, a subset of PostScript — which is a visual representation descriptive language — is used. The PostScript language instructs the renderer how to draw the contents exactly, thereby lending itself to visual fidelity. In addition to PostScript, an object-based structural format is used to describe data and create relationships between objects to create a highly flexible expressive format. Such objects can describe images, form fields, and signatures, and even hold JavaScript to allow the document to be expressive and interactive.

Obviously, with all that flexibility comes complexity, and that’s why PSPDFKit is even a company! We solve the PDF space for you.

Exploring Different Representations

Now we have a basic understanding of how both the formats look under the hood. So, for the example, we’ll narrow the scope to focus on text in a Word processing-like document. Both the DOCX and PDF formats have the capability to express more complicated data, but for the purpose of exploring the differences, it’s easy to stick to a simple domain.

How Text Is Represented

In the following sections, I’ll show a simplistic example of how text is represented in OOXML and PDF. Each example will display “Hello World” in their respective formats and discuss how formatting and placement is handled.

OOXML

I won’t outline the entire OOXML format in this blog post; instead, I’ll just focus on the data that describes contents and representations.

In OOXML, most of the contents of a document are described in a single XML file, and within that file, you’ll find many nested elements, all of which inherit certain aspects from one another. Here, let’s outline the fundamentals of describing a text string:

<w:p>
  <w:r>
    <w:rPr>
      <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica"/>
    </w:rPr>
    <w:t>Hello World</w:t>
  </w:r>
</w:p>

First, we see the w:p element, which denotes a paragraph section. This can contain attributes, but for a simple example, they’re omitted.

The paragraph element holds a child of w:r, commonly named a “run” — or, in other words, a text output. This element is the description of textual data and how it’s styled. w:rPr is the tag holding the styling information, and in our example, we can see that the font Helvetica is used — but more on that later.

Lastly, we get to the real contents of the text: w:t. This element holds the text “Hello World,” which is the content of this textual description.

The important part to note is that, other than the font information, there’s no information instructing the software on how to draw the contents. w:rPr, along with other styling directives, can give extra hints to the software, but it’s left to the application to interpret drawing commands.

PDF

Again, the following example is a small subsection of a PDF, but it provides a good example of how the OOXML and PDF formats differ:

BT
/F1 24 Tf
1 0 0 1 260 600 Tm
(Hello World)Tj
ET

The contents of the text data are wrapped in the BT and ET tags, which denote the beginning and end of a text command, respectively.

Next, we have a name, /F1, followed by a number and the operator Tf. This line describes the font to be used, which we’ll discuss later.

Then, we see a bunch of numbers followed by the Tm operator. This is a dimensional transformation — specifically a text transformation — that applies to the following contents. The six numbers represent a matrix, and for this simple example, 1 0 0 1 describes no transformation, and 260 600 instructs the render to draw at position 260 x and 600 y.

At this point, you may have realized a fundamental difference between OOXML and PDF: The latter is actually instructing the renderer where to draw the text. Notice that OOXML still left a lot of this up to the interpretation of the software, and although there are additional styling directives to give, fundamentally, OOXML doesn’t tell the software how to draw contents.

To round this out, we see the string “Hello World,” which is the text to draw.

How Fonts Are Defined

In both the OOXML and PDF sections, I mentioned that I’d be coming back to the subject of fonts. I wanted to highlight this point because, although both formats have similar capabilities, due to the domain the two formats work in, we often end up with different results.

For OOXML, we saw that it defined the font to be used as Helvetica. It’s a common font, and it’s often shipped with macOS/iOS, but it doesn’t mean that the font is available everywhere. Nor does it mean that Helvetica resolved in one environment is the same as the Helvetica resolved in another environment, because technically, I could change the Helvetica font in my environment.

Now extend that theory and say I wanted to use my custom new font called myAwesomeFont. If a Word document requested the use of this awesome font, it’s unlikely to be available on some other environment. What does the software reading this document do now? We’ll leave that question hanging for a moment.

Fonts in PDFs can have similar problems. In our example, we saw the font description /F1 24 Tf. This operation is instructing the software to resolve font /F1 — a font that will be described in the document — and apply a font size of 24. Without digging too deep into the syntax of font descriptions, /F1 can be described as a name with extra information to resolve the font. But critically, the PDF specification directs all PDF renderers to ship a set of standard fonts that can be used, one of which is Helvetica. This means any text string using Helvetica should be rendered the same in any PDF software.

Getting back to myAwesomeFont: How can we ensure different environments render text with this font? Well, for both formats, there’s the concept of font embedding. This means we can include the font as part of the document to reference when rendering.

With font embedding, we get into the tricky subject of font licensing. Do you have the right to distribute the given font? When it comes to OOXML, the majority of documents originate from Microsoft Office, where we have access to many fonts shipped along with the software. Therefore, font embedding usually isn’t required, because another installation of Microsoft Office will likely have access to the font used.

When it comes to PDF, what software will the end user have? Adobe? PSPDFKit? Chrome? And what OS will the software be running on? All of these questions means that font embedding decisions often push PDF software developers to lean on the side of embedding fonts, so as to ensure render fidelity across environments.

In conclusion, font resolution comes back to the philosophy of the original formats. PDF looks the same everywhere, whereas OOXML aides content production, but not fidelity.

Conclusion

Although we only looked at a simple text content example, I hope this blog post introduced you to the philosophies behind the PDF and OOXML formats. They both have their place in today’s use cases, and they’re both fundamental to a company’s operation.

If you’d like to learn more about how PSPDFKit could help in both domains, you can explore our PDF and Office (OOXML) solutions supported on multiple platforms and frameworks.

Related Products
Share Post
Free 60-Day Trial Try PSPDFKit in your app today.
Free Trial

Related Articles

Explore more
DEVELOPMENT  |  Web • C++

My Experience with Web Development from a Systems Programming Perspective

BLOG  |  Web • C++ • WebAssembly

Render Performance Improvements in PSPDFKit for Web

DEVELOPMENT  |  C++ • Office • Tips

Testing Subjective Office Conversion Results