Compare PDF Text Using JavaScript

Programmatic text comparison allows for the analysis of textual content between different documents. It’s particularly useful for documents that have undergone edits, enabling users to spot changes swiftly.

Information

Comparing documents and text is available when using the PSPDFKit for Web Standalone operational mode.

Text comparison is possible in PSPDFKit for Web with the corresponding license component (only in Standalone mode). Contact Sales if you’re interested in this functionality.

To perform a text comparison operation, you need to provide two documents and a set of options. The options are used to configure the comparison operation.

Describing Your Documents

The PSPDFKit.DocumentDescriptor class is used to provide all the necessary details about your documents for comparison:

  • filePath — Path to the document or an ArrayBuffer.

  • password— Optional password if the document is encrypted.

  • pageIndexes — An array of page indexes, or an array of ranges where an array is [min, max]. If omitted, all pages will be staged for comparison.

const originalDocument = new PSPDFKit.DocumentDescriptor({
  filePath: "document-comparison/static/documentA.pdf",
  pageIndexes: [0]
});

const changedDocument = new PSPDFKit.DocumentDescriptor({
  filePath: "document-comparison/static/documentB.pdf",
  pageIndexes: [0]
});

Defining the Comparison Operation

The PSPDFKit.ComparisonOperation class outlines the comparison type and optional settings:

  • type — Type of comparison. The default is ComparisonOperationType.TEXT. Use PSPDFKit.ComparisonOperationType to check for available comparison types. As of now, only ComparisonOperationType.TEXT is supported.

  • options — The settings for the operation. Currently only numberOfContextWords, which specifies the number of context words for the comparison, is supported.

const textComparisonOperation = new PSPDFKit.ComparisonOperation(
  PSPDFKit.ComparisonOperationType.TEXT,
  {
    numberOfContextWords: 2
  }
);

Text Comparison

The final step is to call the instance#compareDocuments method:

const comparisonResult = await instance.compareDocuments(
  { originalDocument, changedDocument },
  textComparisonOperation
);

console.log(comparisonResult);

Understanding the Comparison Result

The comparison provides a PSPDFKit.ComparisonResult, which outlines:

  • type — The type of comparison (currently only ComparisonOperationType.TEXT is supported).

  • hunks — Hunks of detected text changes.

A hunk groups operations that describe how to transform the original text to the changed text. For instance, if a word is replaced, the hunk will include operations to delete the original word and insert the changed word. The structure of a hunk is:

  • originalRange — The range the hunk represents on the original page.

  • changedRange — The range the hunk represents on the changed page.

  • operations — The operations the hunk contains.

An operation represents a single insertion, a single deletion, or no change between the original and changed text. It’s composed of:

  • type — The operation type (“insert”, “delete”, or “equal”).

  • text — The text the operation is based upon.

  • originalTextBlocks — The rectangles the operation relates to in the original document.

  • changedTextBlocks — The rectangles the operation relates to in the changed document.

A text block relates text to a specific region in a document:

  • range — The range in the document page the text block relates to.

  • rects — The rectangles on the document page the text block refers to.

Example Result

The result will be structured similarly to the following:

[{
  "documentComparisonResults": [{
    "changedPageIndex": 1,
    "comparisonResults": [{
      "hunks": [{
        "changedRange": {
          "length": 1,
          "position": 1
        },
        "operations": [{
          "changedTextBlocks": {
            "range": {
              "length": 1,
              "position": 0
            },
            "rects": [
              [
                341.1,
                265.2,
                0,
                0
              ]
            ],
          },
          "originalTextBlocks": {
            "range": {
              "length": 1,
              "position": 1
            },
            "rects": [
              [
                341.1,
                265.2,
                74.4,
                288.0
              ]
            ],
          },
          "text": "1",
          "type": "delete"
        }],
        "originalRange": {
          "length": 1,
          "position": 1
        }
      }],
      "type": "text"
    }],
    "originalPageIndex": 0
  }]
}]

These steps allow you to pinpoint changes between documents with ease, and to build your own custom user interface (UI) to display the results, as demonstrated in this sample project. Refer to our public API documentation to read more technical details about the Text Comparison API and learn how to use it in your implementation.