Extracting text from a PDF can be a complex task, so we offer several abstractions to make this simpler. In a PDF, text usually consists of glyphs that are absolutely positioned. PSPDFKit heuristically splits these glyphs up into words and blocks of text. Our user interface leverages this information to allow users to select and annotate text. You can read more about this in our Text Selection guide.
TextParser offers a simple API to get the
TextBlock), and even
ImageInfo) from a given PDF page. Every page of a PDF has a text parser that returns information about the text on a page:
1 2 3
let document: Document = ...; let textParser = document.textParserForPage(at: 0)! let glyphs = textParser.glyphs
1 2 3
PSPDFDocument *document = ...; PSPDFTextParser *textParser = [document textParserForPageAtIndex:0]; NSArray<PSPDFGlyph *> *glyphs = textParser.glyphs;
TextParser also ensures that the text it extracts from a PDF follows any logical structure defined in the PDF (see section 14.7 of the PDF specification), thereby enabling support for correctly handling different text-writing directions.
Glyphs, Text Blocks, Words, and Images
Glyph is the building block for all text extraction in PSPDFKit. It represents a single glyph on a PDF page. Its
frame property specifies, in PDF coordinates, where it is located on the page, and its
contents property returns the text it contains. The
indexOnPage property specifies the index of the glyph on the page, in reading order. Consider a page with the following text:
The quick brown fox jumps over the lazy dog. --------------------------^
1 2 3 4 5 6 7
let document: Document = ...; let textParser = document.textParserForPage(at: 0)! let glyphs = textParser.glyphs let glyph = glyphs // Guaranteed to be `true`. let indexEqualTo26 = (glyph.indexOnPage == 26)
1 2 3 4 5 6
PSPDFDocument *document = ...; PSPDFTextParser *textParser = [document textParserForPageAtIndex:0]; NSArray<PSPDFGlyph *> *glyphs = textParser.glyphs; PSPDFGlyph *glyph = glyphs; // Guaranteed to be `YES`. BOOL indexEqualTo26 = (glyph.indexOnPage == 26);
This makes getting a particular glyph (and glyphs near it) much faster, as you already know the index. Ordering glyphs correctly is important if, for example, you wish to combine multiple glyphs and display something to the user.
TextBlock returned from the
TextParser represents a contiguous group of glyphs, usually in a single line. For PDFs with multiple columns of text, a text block is a single line in a given column.
TextBlock is backed by an
TextBlock.range) that describes the range of glyphs in
TextParser.glyphs that the block represents. The same information is available for a
To fetch the glyphs associated with a given text block, simply retrieve them from
1 2 3
let block: TextBlock? = ... let parser: TextParser = ... let glyphs: [Glyph] = parser.glyphs(in: block.range)
1 2 3
PSPDFTextBlock *block = ...; PSPDFTextParser *parser = ...; NSArray<PSPDFGlyph *> *glyphsInBlock = [parser glyphsInRange:block.range];
Word, as the name suggests, represents a single word in a PDF.
TextParser automatically generates these words when parsing the text blocks, and they can be retrieved via the
words property. You can also access the words in a particular text block via the
Embedded images in a PDF page are represented by the
ImageInfo class and can be retrieved via the
images property on
ImageInfo also provides methods to extract an image from a PDF as a