Text Extraction

Extracting text from a PDF can be a complex task, so we offer several abstractions to make this simpler. In a PDF, text usually consists of glyphs that are absolutely positioned. PSPDFKit heuristically splits these glyphs up into words and blocks of text. Our user interface leverages this information to allow users to select and annotate text. You can read more about this in our Text Selection guide.

Text Parser

TextParser offers a simple API to get the text, glyphs (Glyph), words (Word), textBlocks (TextBlock), and even images (ImageInfo) from a given PDF page. Every page of a PDF has a text parser that returns information about the text on a page:

1
2
3
let document: Document = ...;
let textParser = document.textParserForPage(at: 0)!
let glyphs = textParser.glyphs
Copy
1
2
3
PSPDFDocument *document = ...;
PSPDFTextParser *textParser = [document textParserForPageAtIndex:0];
NSArray<PSPDFGlyph *> *glyphs = textParser.glyphs;

TextParser also ensures that the text it extracts from a PDF follows any logical structure defined in the PDF (see section 14.7 of the PDF specification), thereby enabling support for correctly handling different text-writing directions.

Glyphs, Text Blocks, Words, and Images

A PDF page with annotations for classes relevant to text extraction.

Glyphs

Glyph is the building block for all text extraction in PSPDFKit. It represents a single glyph on a PDF page. Its frame property specifies, in PDF coordinates, where it is located on the page, and its contents property returns the text it contains. The indexOnPage property specifies the index of the glyph on the page, in reading order. Consider a page with the following text:

1
2
The quick brown fox jumps over the lazy dog.
--------------------------^

The Glyph that represents the o in over will have an indexOnPage of 26. This index is unique to this glyph, and it can be used to directly access it from the glyphs array of a TextParser:

Copy
1
2
3
4
5
6
7
let document: Document = ...;
let textParser = document.textParserForPage(at: 0)!
let glyphs = textParser.glyphs
let glyph = glyphs[26]

// Guaranteed to be `true`.
let indexEqualTo26 = (glyph.indexOnPage == 26)
Copy
1
2
3
4
5
6
PSPDFDocument *document = ...;
PSPDFTextParser *textParser = [document textParserForPageAtIndex:0];
NSArray<PSPDFGlyph *> *glyphs = textParser.glyphs;
PSPDFGlyph *glyph = glyphs[26];
// Guaranteed to be `YES`.
BOOL indexEqualTo26 = (glyph.indexOnPage == 26);

This makes getting a particular glyph (and glyphs near it) much faster, as you already know the index. Ordering glyphs correctly is important if, for example, you wish to combine multiple glyphs and display something to the user.

Text Blocks

A TextBlock returned from the TextParser represents a contiguous group of glyphs, usually in a single line. For PDFs with multiple columns of text, a text block is a single line in a given column. TextBlock is backed by an NSRange (TextBlock.range) that describes the range of glyphs in TextParser.glyphs that the block represents. The same information is available for a Word via Word.range.

To fetch the glyphs associated with a given text block, simply retrieve them from TextParser.glyphs:

1
2
3
let block: TextBlock? = ...
let parser: TextParser = ...
let glyphs: [Glyph] = parser.glyphs(in: block.range)
1
2
3
PSPDFTextBlock *block = ...;
PSPDFTextParser *parser = ...;
NSArray<PSPDFGlyph *> *glyphsInBlock = [parser glyphsInRange:block.range];

Words

A Word, as the name suggests, represents a single word in a PDF. TextParser automatically generates these words when parsing the text blocks, and they can be retrieved via the words property. You can also access the words in a particular text block via the TextBlock.words property.

Images

Embedded images in a PDF page are represented by the ImageInfo class and can be retrieved via the images property on TextParser. ImageInfo also provides methods to extract an image from a PDF as a UIImage.