Text Extraction

Extracting text from a PDF can be a complex task. For our libraries products, we distilled the text extraction process to a single function that provides all the information required to achieve whatever a user may require to accomplish any text processing task.

Text can be extracted on a per-page basis using the library page API. Each page’s text is split into individual text lines as TextBlock objects that contain the string and page coordinates for each line. The following code shows how you might concatenate all the text on the first page of a document and print it to the console:

var document = new Document(new FileDataProvider("documentWithText.pdf"));
var textLines = document.GetPage(0).GetTextLines();
var pageTextBuilder = new StringBuilder("The text on the first page reads:\n");
foreach (var textLine in textLines)
{
    pageTextBuilder.AppendLine(textLine.GetText());
}
Console.WriteLine(pageTextBuilder.ToString());

The location and size of each line in the form of a Rect can also be read from each of the page text lines:

var document = new Document(new FileDataProvider("documentWithText.pdf"));
var textLines = document.GetPage(0).GetTextLines();
var firstLineRect = textLines[0].GetRect();

See our API Reference for more specifics on getTextLines.