Scan and Convert to Searchable PDFs on Android

PSPDFKit ships with advanced OCR capabilities.

When working with PDFs, you might encounter documents that contain pages with inaccessible text. This is especially common when dealing with scanned documents or documents that contain photographed pages. With our OCR component, you can enhance those raster and vector PDFs to give you interactive text, thereby unlocking powerful PDF text functionality such as text markup annotations, text selection, text extraction, and search.

OCR is an additional component that can be added to your license. Please reach out to us if you’re interested in adding this to your license, if you want to learn more about the roadmap for OCR, or if you want to provide feedback and feature requests related to your use case.

OCR supports detecting text written in many different languages. For an extensive list of supported languages, see here.

Before following the next steps, please make sure you’ve set up OCR correctly, as described in the getting started guide.

With OCR, you can enhance raster and vector PDFs to unlock previously inaccessible text and make it available for text selection, annotation, search, accessibility, and more. OCR builds on top of the PdfProcessor APIs, which offer a range of input and output sources to work with.

Performing OCR

To perform OCR on a document, create a new PdfProcessorTask and use its performOcrOnPages() method to specify the indexes of pages that OCR should be performed on, along with the language that should be used:

// Set up a set of all pages that should be processed.
val pageIndexes: Set<Int> = (0 until document.pageCount).toSet()
// Create a task and configure it for OCR processing. Here, we'll detect English text.
val task = PdfProcessorTask.fromDocument(document)
    .performOcrOnPages(allPages, OcrLanguage.ENGLISH)

// Set up a set of all pages that should be processed.
final Set<Integer> allPages = new HashSet<Integer>();
for(int pageIndex = 0; pageIndex < document.getPageCount(); pageIndex++) {
    allPages.add(pageIndex);
}
// Create a task and configure it for OCR processing. Here, we'll detect English text.
final PdfProcessorTask task = PdfProcessorTask.fromDocument(document)
    .performOcrOnPages(allPages, OcrLanguage.ENGLISH);

Next, after setting up the processor task, you can start processing the document by passing the task to one of the existing document processing methods of PdfProcessor.

Since OCR processing speeds depend on various factors like the size of a document, the number of processed pages, and the device processing is performed on, make sure to run processing away from the main thread either by using processDocumentAsync() or by using any of the blocking processor methods on a background thread:

val outputFile = context.filesDir.resolve("processed-document.pdf")
val disposable = PdfProcessor.processDocumentAsync(task, outputFile).subscribe()

final File outputFile = new File(context.getFilesDir(), "processed-document.pdf");
final Disposable disposable = PdfProcessor.processDocumentAsync(task, outputFile).subscribe();

💡 Tip: The OCR processor is also capable of extracting text from partially detected pages. When processing pages that contain text streams for only parts of the visible text, the OCR processor will detect and embed text for the missing areas while leaving existing text streams untouched.

Language Selection

The OCR processor supports 21 different languages that each come as a separate downloadable language pack. When calling performOcrOnPages(), you have to pass it one of the existing OcrLanguage values that corresponds to the language of choice.

ℹ️ Note: When performing OCR processing for a particular language for the first time, PSPDFKit will extract the language pack data for that language from the app’s assets and copy it into the app’s private directory. This is done on the fly, and it’s only done once per language.