Extracting text from a PDF can be a complex task, so we offer several abstractions to make this simpler. In a PDF, text usually consists of glyphs that are absolutely positioned. PSPDFKit heuristically splits these glyphs up into words and blocks of text. Our
PdfView leverages this information to allow users to select and annotate text.
var textParser = await doc.GetTextParserAsync(0); var text = await textParser.GetTextForRectsAsync(rects); var glyphs = await textParser.GetGlyphsAsync(); var words = TextParser.WordsFromGlyphs(glyphs); var textsBlocks = await textParser.GetTextAsync();
TextParser also ensures that the text it extracts from a PDF follows any logical structure defined in the PDF (see section 14.7 of the PDF Specification), thereby enabling support for correctly handling different text-writing directions.
Glyph is the building block for all text extraction in PSPDFKit. It represents a single glyph on a PDF page. Its
Rect property specifies, in PDF coordinates, where it is located on the page, and its
Contents property returns the text it contains. The
Index property specifies the index of the glyph on the page in reading order. Consider a page with the following text:
The quick brown fox jumps over the lazy dog. --------------------------^
var textParser = await doc.GetTextParserAsync(0); var glyphs = await textParser.GetGlyphsAsync(); // Guaranteed to be `true`. var indexEqualTo26 = glyphs.Index == 26
This makes getting a particular glyph (and glyphs near it) much faster, as you already know the index. Ordering glyphs correctly is important if, for example, you wish to combine multiple glyphs and display something to the user.
var textParser = await doc.GetTextParserAsync(0); var glyphs = await textParser.GetGlyphsAsync(); var words = TextParser.WordsFromGlyphs(glyphs);
Word has a
Frame that describes the area the word covers on the page. It also has a
Range that describes the range within the provided glyphs that make up the word, and a
Contents, which is the string of text that