Blog Post

Introducing the OCR Component

Today we’re proud to be releasing the newest addition to the PSPDFKit component library: OCR. OCR, which stands for Optical Character Recognition, can be used to enhance raster and vector PDFs to unlock previously inaccessible text and make it available for text selection, annotation, search, accessibility, and more. For the first time, we’re releasing a component on five PSPDFKit platforms at the same time. OCR is available on the iOS, Android, Web, Java and .NET platforms.

Optical Character Recognition (OCR)

When working with PDFs, you might encounter documents that contain pages without accessible text. This is especially common when dealing with scanned documents or documents that contain photographed pages. In these kinds of documents, the text is visible on the PDF page, but it cannot be selected. It’s also not possible to search for this kind of text content or use PDF markup tools to annotate it. Since PDF editors simply don’t have access to the textual data, it’s also not possible to leverage accessibility features — such as screen readers — or to develop automation workflows that would extract textual data for future processing.

If your workflows include such documents, then OCR is the right tool for you. PSPDFKit’s OCR processor can detect typed and printed text on PDF pages that is not already represented by machine-readable real text objects and add the necessary text data as an overlay on top of the existing content. This unlocks the full suite of PDF tools available for working with text while preserving the original text appearance.

Integration with PSPDFKit

There’s no lack of choice when it comes to OCR solutions, which span from low-level libraries to dedicated SaaS products. But when it comes to ease of integration, selecting our OCR component has an unbeatable advantage over other solutions — it comes pre-bundled in your PSPDFKit Server Docker image and is an easy drop-in addition on our other platforms. This makes adding OCR to your existing PSPDFKit setups a breeze.

Since OCR integrates into existing PSPDFKit document processing functionality, it’s also easy to get started with the API. You’ll be able to leverage the familiar environment and make OCR just another step in your PSPDFKit document processing pipeline.

To learn more about how to integrate and use OCR with PSPDFKit, check out our integration guides: iOS, Android, Web, Java, and .NET.

Languages

PSPDFKit’s OCR component uses a machine-learning technique referred to as an LSTM neural network to detect and recognize inaccessible typed text in a PDF document. To be able to perform this operation, the OCR processor requires trained data models that are created to identify typed text in a particular language. As part of the OCR component, we’ve provided data models for 21 languages.

Should your needs extend outside of our list of supported languages, we’re more than happy to receive requests for additional languages through our support form.

Benefits

Documents processed with OCR have many benefits over their original counterparts. Here are just a few:

Improved Accessibility — All text in your PDF documents becomes machine readable.
Text Selection — Text can be copied and pasted into a different application.
Powerful Annotations — PDF text markup annotations can be used for highlighting, underlining, and striking through text.
Search — Search for text in the currently displayed document, or use our full text engine to index and find text across documents.
Automation and Extraction — Use our model API to extract text and perform structural detection and other text-processing operations.
Redaction — Integrate OCR with our Redaction component to automate redaction of sensitive information.

Accuracy

Just as is the case with most machine-learning technologies, the quality of OCR recognition can vary. Given that there are so many permutations to input documents — be it size, color, noise, font, etc. — it’s impossible to train a perfect recognition model for all documents.

That said, PDF quality has a strong correlation to OCR accuracy. If the PDF is pixelated, is noisy, or contains any type of camera images with glare, OCR results will naturally be reduced. Take the example of a perfectly scanned document: Performing OCR may result in a 100 percent success rate. But take a physically scanned book, with noise introduced because of the book binding, and it may result in an 80 percent success rate.

PSPDFKit employs various techniques to preprocess input pages to improve recognition accuracy. These include noise reduction, contrast enhancement, text alignment, orientation detection, and more. If you’re in charge of PDF creation, then ensuring good quality page capture will often times be the best thing you can do to guarantee great OCR results. PSPDFKit also smartly detects which parts of a PDF page already have real text objects and which don’t. This means that processing a document that has mixed inaccessible and interactive text won’t degrade the existing real text data in the document. Instead, it will enhance the parts with previously inaccessible text.

Try It!

We encourage you to give OCR a try and see if it can be a beneficial addition to your application. It’s now available for you to evaluate as part of our free trial. Please feel free to reach out to us if you have any further questions about the component, or ping our sales team to receive a quote if you would like to add OCR to your PSPDFKit license.

Author

Matej Bukovinski CTO

Matej is a software engineering leader from Slovenia. He began his career freelancing and contributing to open source software. Later, he joined PSPDFKit, where he played a key role in creating its initial products and teams, eventually taking over as the company’s Chief Technology Officer. Outside of work, Matej enjoys playing tennis, skiing, and traveling.