OCR a PDF in Java
This guide covers how to use OCR with the PSPDFKit Java Library. If you haven’t done so already, please read and follow the steps of the OCR Integration guide to enable OCR functionality in the PSPDFKit Java Library. For more information about what OCR is and how it fits into PSPDFKit, please refer to our OCR Overview guide.
Performing OCR on a document with the PSPDFKit Java Library is simple. All the functionality of OCR is controlled via the
// Specify the file to perform OCR on and open it. File file = new File("ocrExample.pdf"); PdfDocument pdfDocument = PdfDocument.open(new FileDataProvider(file)); // Create a file to write the new document that will contain the OCR data extracted from the source document. File outputFile = File.createTempFile("ocrOutput", ".pdf"); FileDataProvider outputDataProvider = new FileDataProvider(outputFile); // Create the processor with the open document using the processor `Builder` and perform OCR. OcrProcessor ocrProcessor = new OcrProcessor.Builder(pdfDocument).build(); ocrProcessor.performOcr(outputDataProvider);
The code above will perform OCR on the
ocrExample.pdf PDF and output it to a new location with the name
ocrOutput.pdf. This means that text in the document that is part of an image — for example, all text found in a scanned document — will now be fully text searchable and annotatable.
Because it can take a long time to process a document for OCR, it can be useful to apply domain-specific knowledge, such as limiting the pages where OCR is performed:
// Create the processor with the open document and perform OCR only on page `0`. Set<Integer> pages = new HashSet<>(); pages.add(0); OcrProcessor ocrProcessor = new OcrProcessor.Builder(pdfDocument) .setPages(pages) .build(); ocrProcessor.performOcr(outputDataProvider);
We can see from the code above that it’s possible to set the page indices where OCR should be performed to page index
0 only. Pages are zero based, so the code instructs the
OCR Processor to only analyze and perform OCR on the first page. This can reduce processing times drastically.
Because the PSPDFKit Java Library cannot know the language of your document, this information needs to be passed into the
OCR Processor. For more information on all the languages PSPDFKit supports, please refer to our OCR Language Support guide.
By default, the PSPDFKit Java Library uses English. You can set a different OCR language with the
// Create the processor with the open document and perform OCR using the Finnish language. OcrProcessor ocrProcessor = new OcrProcessor.Builder(pdfDocument) .setLanguage(OcrLanguage.Finnish) .build(); ocrProcessor.performOcr(outputDataProvider);