C# OCR Scanning to Searchable PDFs

This guide explains how to scan a physical document with a scanner and then save the scanned image in a searchable PDF. GdPicture.NET’s optical character recognition (OCR) engine allows you to recognize text in an image and then save the text in a PDF. This guide uses the TWAIN protocol.

Information

Printing and scanning aren’t supported in the cross-platform .NET 6.0 assembly. For more information, see the system compatibility guide.

To get an image from a scanner and then save it in a searchable PDF, follow these steps.

  1. Create a GdPictureImaging object and a GdPicturePDF object.

  2. Store the handle of the active windows in a variable by calling the IntPtr.Zero structure.

  3. Select the scanner by passing the handle to the TwainSelectSource and the TwainOpenDefaultSource methods of the GdPictureImaging object.

  4. Optional: Hide the scanning user interface with the TwainSetHideUI method of the GdPictureImaging object. Use this setting when your application cannot communicate with the scanner.

  5. Create a new PDF document with the NewPDF method of the GdPicturePDF object. The parameter of this method sets the conformance level of the PDF document. This parameter is a member of the PdfConformance enumeration. For example, use PDF to create a common PDF document.

  6. Get the image from the scanner by passing the handle to the TwainAcquireToGdPictureImage method of the GdPictureImaging object.

  7. Add the scanned image to a new page in the destination document with the AddImageFromGdPictureImage method of the GdPicturePDF object.

  8. Run the OCR process with the RunOCR method of the GdPictureOCR object:

    1. Set the code of the language that GdPicture.NET uses to recognize text in the source document. To specify several languages, separate the language codes with the + character. For example, eng+fra.

    2. Set the path to the OCR resource folder. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    3. Set the character allowlist. When scanning the document, the OCR engine only recognizes the characters included in the allowlist. When you set "", all characters are recognized.

    4. Set the dot-per-inch (DPI) resolution the OCR engine uses. It’s recommended to use 300 for the best combination of speed and accuracy.

  9. Save the result in a PDF document.

  10. Close the TWAIN source handle.

The example below gets an image from a scanner and then saves it in a searchable PDF:

using GdPictureImaging gdpictureImaging = new GdPictureImaging();
using GdPicturePDF gdpicturePDF = new GdPicturePDF();
// Store the handle of the active windows in a variable.
IntPtr WINDOW_HANDLE = IntPtr.Zero;
// Select the scanner.
gdpictureImaging.TwainSelectSource(WINDOW_HANDLE);
gdpictureImaging.TwainOpenDefaultSource(WINDOW_HANDLE);
// (Optional) Hide the scanning user interface.
gdpictureImaging.TwainSetHideUI(true);
// Create the destination PDF document.
gdpicturePDF.NewPDF(PdfConformance.PDF);
// Get the image from the scanner.
int imageID = gdpictureImaging.TwainAcquireToGdPictureImage(WINDOW_HANDLE);
// Add the scanned image to a new page in the destination document.
gdpicturePDF.AddImageFromGdPictureImage(imageID, false, true);
// Run the OCR process.
gdpicturePDF.OcrPage("eng", @"C:\GdPicture.NET 14\Redist\OCR", "", 300);
// Save the result in a PDF document.
gdpicturePDF.SaveToFile(@"C:\temp\output.pdf");
// Release unnecessary resources.
gdpictureImaging.ReleaseGdPictureImage(imageID);
gdpictureImaging.TwainCloseSource();
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()
    ' Store the handle of the active windows in a variable.
    Dim WINDOW_HANDLE = IntPtr.Zero
    ' Select the scanner.
    gdpictureImaging.TwainSelectSource(WINDOW_HANDLE)
    gdpictureImaging.TwainOpenDefaultSource(WINDOW_HANDLE)
    ' (Optional) Hide the scanning user interface.
    gdpictureImaging.TwainSetHideUI(True)
    ' Create the destination PDF document.
    gdpicturePDF.NewPDF(PdfConformance.PDF)
    ' Get the image from the scanner.
    Dim imageID As Integer = gdpictureImaging.TwainAcquireToGdPictureImage(WINDOW_HANDLE)
    ' Add the scanned image to a new page in the destination document.
    gdpicturePDF.AddImageFromGdPictureImage(imageID, False, True)
    ' Run the OCR process.
    gdpicturePDF.OcrPage("eng", "C:\GdPicture.NET 14\Redist\OCR", "", 300)
    ' Save the result in a PDF document.
    gdpicturePDF.SaveToFile("C:\temp\output.pdf")
    ' Release unnecessary resources.
    gdpictureImaging.ReleaseGdPictureImage(imageID)
    gdpictureImaging.TwainCloseSource()
End Using
End Using
Used Methods

Related Topics