Extract Text from PDFs and Images

This guide explains how to extract text from a PDF documents using PSPDFKit Document Engine.

Sending the Request to Extract Data

To extract text from a document, post a multipart request to the /api/build endpoint. In the instructions, specify the following output parameters:

  • type specifies the output type. Set this to json-content.

  • plainText is a Boolean value that determines whether to extract data as plain text.

  • structuredText is a Boolean value that determines whether to extract data as structured text. Enabling this option gives you information about characters, lines, paragraphs, and words.

  • language specifies the language used for recognizing text with optical character recognition (OCR). Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. PSPDFKit’s OCR engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.

curl -X POST http://localhost:5000/api/build \
  -H "Authorization: Token token=<API token>" \
  -F document=@/path/to/example-document.pdf \
  -F instructions='{
  "parts": [
    {
      "file": "document"
    }
  ],
  "output": {
    "type": "json-content",
    "plainText": true,
    "structuredText": true,
    "language": "english"
  }
}' \
  -o result.pdf
POST /api/build HTTP/1.1
Content-Type: multipart/form-data; boundary=customboundary
Authorization: Token token=<API token>

--customboundary
Content-Disposition: form-data; name="document"; filename="example-document.pdf"
Content-Type: application/pdf

<PDF data>
--customboundary
Content-Disposition: form-data; name="instructions"
Content-Type: application/json

{
  "parts": [
    {
      "file": "document"
    }
  ],
  "output": {
    "type": "json-content",
    "plainText": true,
    "structuredText": true,
    "language": "english"
  }
}
--customboundary--

For more information on the Build instructions, refer to the API Reference.

Example Data Extraction Response

{
	"pages": [
		{
			"pageIndex": 0,
			"plainText": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa.\n",
			"structuredText": {
				"characters": [
					{
						"bbox": {
							"left": 0,
							"top": 0,
							"width": 100,
							"height": 100
						},
						"value": "T"
					}
				],
				"lines": [
					{
						"bbox": {
							"left": 0,
							"top": 0,
							"width": 100,
							"height": 100
						},
						"firstWordIndex": 0,
						"isRTL": false,
						"isVertical": false,
						"wordCount": 5
					}
				],
				"paragraphs": [
					{
						"bbox": {
							"left": 0,
							"top": 0,
							"width": 100,
							"height": 100
						},
						"firstLineIndex": 0,
						"lineCount": 3
					}
				],
				"words": [
					{
						"bbox": {
							"left": 0,
							"top": 0,
							"width": 100,
							"height": 100
						},
						"characterCount": 4,
						"firstCharacterIndex": 0,
						"isFromDictionary": true,
						"value": "word"
					}
				]
			}
		}
	]
}