OCR PDF in C#

This guide explains how to convert a PDF file to a searchable PDF. GdPicture.NET’s optical character recognition (OCR) engine allows you to recognize text and save it in a separate PDF where you can both search and copy and paste the text.

To convert a PDF file to a searchable PDF, follow the steps outlined below.

  1. Create a GdPicturePDF object.

  2. Load the source document by passing its path to the LoadFromFile method of the GdPicturePDF object.

  3. Determine the number of pages with the GetPageCount method of the GdPicturePDF object.

  4. Loop through pages of the source document.

  5. For each page, run the OCR process with the OcrPage method of the GdPicturePDF object. Configure the OCR process by passing the following parameters to the OcrPage method:

    1. Set the code of the language that GdPicture.NET uses to recognize text in the source document. To specify several languages, separate the language codes with the + character — for example, eng+fra.

    2. Set the path to the OCR resource folder. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    3. Set the character allowlist. When scanning the document, the OCR engine only recognizes the characters included in the allowlist. When you set "", all characters are recognized.

    4. Set the dot-per-inch (DPI) resolution the OCR engine uses. It’s recommended to use 300 for the best combination of speed and accuracy.

  6. Save the result in a new PDF document.

The example below converts a PDF file to a searchable PDF:

using GdPicturePDF gdpicturePDF = new GdPicturePDF();
// Load the source document.
gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");
// Determine the number of pages.
int pageCount = gdpicturePDF.GetPageCount();
// Loop through the pages of the source document.
for (int i = 1; i <= pageCount; i++)
{
    // Select a page and run the OCR process on it.
    gdpicturePDF.SelectPage(i);
    gdpicturePDF.OcrPage("eng", @"C:\GdPicture.NET 14\Redist\OCR", "", 300);
}
// Save the result in a new PDF document.
gdpicturePDF.SaveToFile(@"C:\temp\output.pdf");
gdpicturePDF.CloseDocument();
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()
    ' Load the source document.
    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")
    ' Determine the number of pages.
    Dim pageCount As Integer = gdpicturePDF.GetPageCount()
    ' Loop through the pages of the source document.
    For i = 1 To pageCount
        ' Select a page and run the OCR process on it.
        gdpicturePDF.SelectPage(i)
        gdpicturePDF.OcrPage("eng", "C:\GdPicture.NET 14\Redist\OCR", "", 300)
    Next
    ' Save the result in a new PDF document.
    gdpicturePDF.SaveToFile("C:\temp\output.pdf")
    gdpicturePDF.CloseDocument()
End Using
Used Methods and Properties

Related Topics