OCR a PDF in Linux

Information

PSPDFKit Processor has been deprecated and replaced by PSPDFKit Document Engine. All PSPDFKit Processor licenses will work as before and be supported until 15 May 2024 (we will contact you about license migration). To start using Document Engine, refer to the migration guide. With Document Engine, you’ll have access to robust new capabilities (read the blog for more information).

This guide provides an overview of the OCR API and how to use it. For information on what OCR can do, please see the OCR overview guide.

Before you get started, make sure Processor is up and running.

You can download and use either of the following sample documents for the examples in this guide:

You’ll be sending multipart POST requests with instructions to Processor’s /build endpoint. To learn more about multipart requests, refer to our blog post on the topic, A Brief Tour of Multipart Requests.

Check out the API Reference to learn more about the /build endpoint and all the actions you can perform on PDFs with PSPDFKit Processor.

Running OCR on All Pages

To perform OCR on all pages of a document, post a multipart request to the /build API endpoint, applying the ocr action to the document. Learn more about the schema for /build instructions in our API Reference.

curl -X POST http://localhost:5000/build \
  -F document=@/path/to/example-document.pdf \
  -F instructions='{
  "parts": [
    {
      "file": "document"
    }
  ],
  "actions": [
    {
      "type": "ocr",
      "language": "english"
    }
  ]
}' \
  -o result.pdf
POST /process HTTP/1.1
Content-Type: multipart/form-data; boundary=customboundary

--customboundary
Content-Disposition: form-data; name="document"; filename="example-document.pdf"
Content-Type: application/pdf

<PDF data>
--customboundary
Content-Disposition: form-data; name="instructions"
Content-Type: application/json

{
  "parts": [
    {
      "file": "document"
    }
  ],
  "actions": [
    {
      "type": "ocr",
      "language": "english"
    }
  ]
}
--customboundary--

Running OCR on Specific Pages of a Document

Running OCR on relevant pages of a large document instead of the entire document can significantly speed up OCR operations.

To perform OCR on the second and third page (indexes 1 and 2) of a document, use the page indexes to split the document into different parts, and perform OCR on the relevant portions of the document. The output of the /build request is the outcome of merging various parts of the instructions.

To learn more about instructions, go to the API Reference.

curl -X POST http://localhost:5000/build \
  -F document=@/path/to/example-document.pdf \
  -F instructions='{
  "parts": [
    {
      "file": "document",
      "pages": {
        "start": 0,
        "end": 0
      }
    },
    {
      "file": "document",
      "pages": {
        "start": 1,
        "end": 2
      },
      "actions": [
        {
          "type": "ocr",
          "language": "english"
        }
      ]
    },
    {
      "file": "document",
      "pages": {
        "start": 3,
        "end": 7
      }
    }
  ]
}' \
  -o result.pdf
POST /process HTTP/1.1
Content-Type: multipart/form-data; boundary=customboundary

--customboundary
Content-Disposition: form-data; name="document"; filename="example-document.pdf"
Content-Type: application/pdf

<PDF data>
--customboundary
Content-Disposition: form-data; name="instructions"
Content-Type: application/json

{
  "parts": [
    {
      "file": "document",
      "pages": {
        "start": 0,
        "end": 0
      }
    },
    {
      "file": "document",
      "pages": {
        "start": 1,
        "end": 2
      },
      "actions": [
        {
          "type": "ocr",
          "language": "english"
        }
      ]
    },
    {
      "file": "document",
      "pages": {
        "start": 3,
        "end": 7
      }
    }
  ]
}
--customboundary--

Running OCR on a Document from a URL

To specify the path of a document for OCR using a URL, use the following example:

curl -X POST http://localhost:5000/build \
  -F instructions='{
  "parts": [
    {
      "file": {
        "url": "https://pspdfkit.com/downloads/examples/paper.pdf"
      }
    }
  ],
  "actions": [
    {
      "type": "ocr",
      "language": "english"
    }
  ]
}' \
  -o result.pdf
POST /process HTTP/1.1
Content-Type: multipart/form-data; boundary=customboundary

--customboundary
Content-Disposition: form-data; name="instructions"
Content-Type: application/json

{
  "parts": [
    {
      "file": {
        "url": "https://pspdfkit.com/downloads/examples/paper.pdf"
      }
    }
  ],
  "actions": [
    {
      "type": "ocr",
      "language": "english"
    }
  ]
}
--customboundary--