Usage

This guide covers how to use OCR with the PSPDFKit .NET Library. If you haven’t done so already, please read and follow the steps of the OCR Integration guide to enable OCR functionality in the PSPDFKit .NET Library. For more information about what OCR is and how it fits into PSPDFKit, please refer to our OCR Overview guide.

Perform OCR on a PDF

Performing OCR on a document with the PSPDFKit .NET Library is simple. All the functionality of OCR is controlled via the OcrProcessor:

Copy
1
2
3
4
5
6
7
8
9
// Open the PDF you want to perform OCR on.
var document = new Document(new FileDataProvider("Assets/ocrExample.pdf"));

// Create a file to write the new document that will contain the OCR data extracted from the source document.
var outputDataProvider = new FileDataProvider(Path.GetTempFileName());

// Create the processor with the open document and perform OCR.
var ocrProcessor = new OcrProcessor(document);
ocrProcessor.PerformOcr(outputDataProvider);

The code above will perform OCR on the Assets/ocrExample.pdf PDF and output it to a new location. This means any text in the document that is part of an image — for example, all text found in a scanned document — will now be fully searchable and annotatable.

Limit to a Page Range

Because it can take a long time to process a document for OCR, it can be useful to apply domain-specific knowledge, such as limiting the pages where OCR is performed:

Copy
1
2
3
4
5
// Create the processor with the open document and perform OCR only on page `0`.
var ocrProcessor = new OcrProcessor(document) {
  Pages = new List<int> {0}
};
ocrProcessor.PerformOcr(outputDataProvider);

We can see from the code above that the public member Pages were set to a list with index 0 only. Pages are zero based, so this instructs the OCR Processor to only analyze and perform OCR on the first page. Doing so can reduce the process times drastically.

Language Selection

Because the PSPDFKit .NET Library cannot know the language of your document, this information needs to be passed into the OCR Processor. For more information on all the languages PSPDFKit supports, please refer to our OCR Language Support guide.

By default, the PSPDFKit .NET Library uses English. To change this, you can set the language you require via a public member in the OcrProcessor:

Copy
1
2
3
4
5
// Create the processor with the open document and perform OCR using the Finnish language.
var ocrProcessor = new OcrProcessor(document) {
  Language = OcrLanguage.Finnish
};
ocrProcessor.PerformOcr(outputDataProvider);