Extract Text, Tables, and More from PDFs

Information

PSPDFKit Processor has been deprecated and replaced by PSPDFKit Document Engine. All PSPDFKit Processor licenses will work as before and be supported until 15 May 2024 (we will contact you about license migration). To start using Document Engine, refer to the migration guide. With Document Engine, you’ll have access to robust new capabilities (read the blog for more information).

This guide explains how to extract data from PDFs using Processor.

You can extract the following pieces of information from a PDF document:

Before you get started, make sure Processor is up and running.

You can download and use either of the following sample documents for the examples in this guide:

You’ll be sending multipart POST requests with instructions to Processor’s /build endpoint. To learn more about multipart requests, refer to our blog post on the topic, A Brief Tour of Multipart Requests.

Check out the API Reference to learn more about the /build endpoint and all the actions you can perform on PDFs with PSPDFKit Processor.

Sending the Request to Extract Data

To extract data on all pages of a document, post a multipart request to the /build API endpoint. In the instructions, specify the following output parameters:

  • type specifies the output type. Set this to json-content.

  • plainText is a Boolean value that determines whether to extract data as plain text.

  • structuredText is a Boolean value that determines whether to extract data as structured text. Enabling this option gives you information about characters, lines, paragraphs, and words.

  • keyValuePairs is a Boolean value that determines whether to extract key-value pairs.

  • tables is a Boolean value that determines whether to extract table data.

  • language specifies the language used for recognizing text with optical character recognition (OCR). Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. PSPDFKit’s OCR engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.

curl -X POST http://localhost:5000/build \
  -F document=@/path/to/example-document.pdf \
  -F instructions='{
  "parts": [
    {
      "file": "document"
    }
  ],
  "output": {
    "type": "json-content",
    "plainText": true,
    "structuredText": true,
    "keyValuePairs": true,
    "tables": true,
    "language": "english"
  }
}' \
  -o result.pdf
POST /process HTTP/1.1
Content-Type: multipart/form-data; boundary=customboundary

--customboundary
Content-Disposition: form-data; name="document"; filename="example-document.pdf"
Content-Type: application/pdf

<PDF data>
--customboundary
Content-Disposition: form-data; name="instructions"
Content-Type: application/json

{
  "parts": [
    {
      "file": "document"
    }
  ],
  "output": {
    "type": "json-content",
    "plainText": true,
    "structuredText": true,
    "keyValuePairs": true,
    "tables": true,
    "language": "english"
  }
}
--customboundary--

For more information on the /build instructions, refer to the API Reference.

Interpreting the Data Extraction Response

The API response provides information about the data you included in the API request, such as:

  • Plain text

  • Structured text with information about characters, lines, paragraphs, and words

  • Extracted key-value pairs

  • Tables

Example Data Extraction Response

{
	"pages": [
		{
			"pageIndex": 0,
			"plainText": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa.\n",
			"structuredText": {
				"characters": [
					{
						"bbox": {
							"left": 0,
							"top": 0,
							"width": 100,
							"height": 100
						},
						"value": "T"
					}
				],
				"lines": [
					{
						"bbox": {
							"left": 0,
							"top": 0,
							"width": 100,
							"height": 100
						},
						"firstWordIndex": 0,
						"isRTL": false,
						"isVertical": false,
						"wordCount": 5
					}
				],
				"paragraphs": [
					{
						"bbox": {
							"left": 0,
							"top": 0,
							"width": 100,
							"height": 100
						},
						"firstLineIndex": 0,
						"lineCount": 3
					}
				],
				"words": [
					{
						"bbox": {
							"left": 0,
							"top": 0,
							"width": 100,
							"height": 100
						},
						"characterCount": 4,
						"firstCharacterIndex": 0,
						"isFromDictionary": true,
						"value": "word"
					}
				]
			},
			"keyValuePairs": [
				{
					"confidence": 95.4,
					"key": {
						"bbox": {
							"left": 0,
							"top": 0,
							"width": 100,
							"height": 100
						},
						"content": "#"
					},
					"value": {
						"bbox": {
							"left": 0,
							"top": 0,
							"width": 100,
							"height": 100
						},
						"content": "€",
						"dataType": "Currency"
					}
				}
			],
			"tables": [
				{
					"confidence": 95.4,
					"bbox": {
						"left": 0,
						"top": 0,
						"width": 100,
						"height": 100
					},
					"cells": [
						{
							"bbox": {
								"left": 0,
								"top": 0,
								"width": 100,
								"height": 100
							},
							"rowIndex": 0,
							"columnIndex": 0,
							"isHeader": true,
							"text": "Invoice number"
						}
					],
					"columns": [
						{
							"bbox": {
								"left": 0,
								"top": 0,
								"width": 100,
								"height": 100
							}
						}
					],
					"lines": [
						{
							"bbox": {
								"left": 0,
								"top": 0,
								"width": 100,
								"height": 100
							},
							"isVertical": false,
							"thickness": 0
						}
					],
					"rows": [
						{
							"bbox": {
								"left": 0,
								"top": 0,
								"width": 100,
								"height": 100
							}
						}
					]
				}
			]
		}
	]
}