Text Extraction

Extracting text from a PDF can be a complex task, so we offer several abstractions to make this simpler. In a PDF, text usually consists of glyphs that are absolutely positioned. PSPDFKit heuristically splits these glyphs up into words and blocks of text. Our text selection component leverages this information to allow users to select and annotate text.

Reading the Text

PdfDocument offers methods that allow you to access text from a given PDF page. It also ensures that the text it extracts from a PDF follows any logical structure defined in the PDF (see section 14.7 of the PDF Specification), thereby enabling support for correctly handling different text-writing directions. Using PdfDocument#getPageText() like so allows you to get all text found on a single page:

1
2
val document: PdfDocument = ...
val pageText = document.getPageText(0)
1
2
PdfDocument document = ...
String pageText = document.getPageText(0);