GdPicture.NET.14
GdPicture14 Namespace / GdPicturePDF Class / OcrPages Method / OcrPages(String,Int32,String,String,String,Single,OCRMode,Int32,Boolean) Method
The page range to be processed, for example, "1;4;5" to process pages 1, 4 and 5 or "1-5;10" to process pages from 1 to 5 and page 10. Set this parameter to "*" to process all pages of the current document.
The number of threads to use, asynchronously. Set this parameter to 0 to let the engine to automatically maximize the performance.
The prefix of the dictionary file to use, for example, "spa" for Spanish, "eng" for English, "fra" for French, etc.

The name of such dictionary file has a predefined format [LANGUAGE].traineddata, where [LANGUAGE] defines the used language. You can find these files within your standard installation usually in the directory @\GdPicture.Net 14\Redist\OCR or you can download additional language dictionary files here.

You can also combine multiple dictionaries with the "+" separator, for instance English with French is "eng+fra".

The path with all installed dictionary files the OCR engine will use. The proper path is usually within your standard installation and it looks like @\GdPicture.Net 14\Redist\OCR. Of course you can specify your own path as well.
So called white list of characters, in other words the restricted recognition characters. It means that the engine returns only the specified characters when processing. For example, if you want to only recognize numeric characters, set this parameter to "0123456789". If you want to only recognize uppercase letters, set it to "ABCDEFGHIJKLMNOPQRSTUVWXYZ". Set this parameter to the empty string to recognize all characters.
The dpi resolution the OCR engine will use. It is recommended to use 300 by default.

A value between 200 and 300 should give optimal results on A4-sized documents. Generally values over 300 will cause excessive memory usage.

The mode to be used during processing. You can choose between speed and accuracy.
The time interval, in other words timeout, in milliseconds, that specifies the maximum time allowed for the whole OCR process before it is automatically interrupted. Use 0 to specify no timeout.
In multi-threading context, this parameter specifies if the method must terminate when all threads are done.
Example





In This Topic
OcrPages(String,Int32,String,String,String,Single,OCRMode,Int32,Boolean) Method
In This Topic
Runs the optical character recognition (OCR) on the specified page range of the loaded PDF document using a defined number of threads. You can also set other parameters according to your preferences. The recognized text is added as invisible text on each processed page. The page orientation is automatically detected for each page as well.

This method involves a rasterization process so any existing visible text within the processed pages will become a part of the images of those pages before the OCR process starts. The same applies to the invisible text contained within pages. It is not kept because of the rasterization process, which simply means any invisible text is removed from processed pages before the OCR process starts.

This method is running asynchronously, in other words you have to wait for the OCR process ending before manipulating the document further. You can benefit from using several OCR related events like BeforePageOcr, OcrPagesProgress and OcrPagesDone.

Syntax
'Declaration
 
Public Overloads Function OcrPages( _
   ByVal PageRange As String, _
   ByVal ThreadCount As Integer, _
   ByVal Dictionary As String, _
   ByVal DictionaryPath As String, _
   ByVal CharWhiteList As String, _
   ByVal DPI As Single, _
   ByVal OcrMode As OCRMode, _
   ByVal TimeoutMillisec As Integer, _
   ByVal Sync As Boolean _
) As GdPictureStatus
public GdPictureStatus OcrPages( 
   string PageRange,
   int ThreadCount,
   string Dictionary,
   string DictionaryPath,
   string CharWhiteList,
   float DPI,
   OCRMode OcrMode,
   int TimeoutMillisec,
   bool Sync
)
public function OcrPages( 
    PageRange: String;
    ThreadCount: Integer;
    Dictionary: String;
    DictionaryPath: String;
    CharWhiteList: String;
    DPI: Single;
    OcrMode: OCRMode;
    TimeoutMillisec: Integer;
    Sync: Boolean
): GdPictureStatus; 
public function OcrPages( 
   PageRange : String,
   ThreadCount : int,
   Dictionary : String,
   DictionaryPath : String,
   CharWhiteList : String,
   DPI : float,
   OcrMode : OCRMode,
   TimeoutMillisec : int,
   Sync : boolean
) : GdPictureStatus;
public: GdPictureStatus OcrPages( 
   string* PageRange,
   int ThreadCount,
   string* Dictionary,
   string* DictionaryPath,
   string* CharWhiteList,
   float DPI,
   OCRMode OcrMode,
   int TimeoutMillisec,
   bool Sync
) 
public:
GdPictureStatus OcrPages( 
   String^ PageRange,
   int ThreadCount,
   String^ Dictionary,
   String^ DictionaryPath,
   String^ CharWhiteList,
   float DPI,
   OCRMode OcrMode,
   int TimeoutMillisec,
   bool Sync
) 

Parameters

PageRange
The page range to be processed, for example, "1;4;5" to process pages 1, 4 and 5 or "1-5;10" to process pages from 1 to 5 and page 10. Set this parameter to "*" to process all pages of the current document.
ThreadCount
The number of threads to use, asynchronously. Set this parameter to 0 to let the engine to automatically maximize the performance.
Dictionary
The prefix of the dictionary file to use, for example, "spa" for Spanish, "eng" for English, "fra" for French, etc.

The name of such dictionary file has a predefined format [LANGUAGE].traineddata, where [LANGUAGE] defines the used language. You can find these files within your standard installation usually in the directory @\GdPicture.Net 14\Redist\OCR or you can download additional language dictionary files here.

You can also combine multiple dictionaries with the "+" separator, for instance English with French is "eng+fra".

DictionaryPath
The path with all installed dictionary files the OCR engine will use. The proper path is usually within your standard installation and it looks like @\GdPicture.Net 14\Redist\OCR. Of course you can specify your own path as well.
CharWhiteList
So called white list of characters, in other words the restricted recognition characters. It means that the engine returns only the specified characters when processing. For example, if you want to only recognize numeric characters, set this parameter to "0123456789". If you want to only recognize uppercase letters, set it to "ABCDEFGHIJKLMNOPQRSTUVWXYZ". Set this parameter to the empty string to recognize all characters.
DPI
The dpi resolution the OCR engine will use. It is recommended to use 300 by default.

A value between 200 and 300 should give optimal results on A4-sized documents. Generally values over 300 will cause excessive memory usage.

OcrMode
The mode to be used during processing. You can choose between speed and accuracy.
TimeoutMillisec
The time interval, in other words timeout, in milliseconds, that specifies the maximum time allowed for the whole OCR process before it is automatically interrupted. Use 0 to specify no timeout.
Sync
In multi-threading context, this parameter specifies if the method must terminate when all threads are done.

Return Value

A member of the GdPictureStatus enumeration. If the method has been successfully followed, then the return value is GdPictureStatus.OK.

We strongly recommend always checking this status first.

Remarks
This method is only allowed for use with non-encrypted documents. At the same, be aware that this method is running asynchronously.

Just to inform you that this method uses the GdPicture OCR engine.

This method requires the OCR component to run.

Example
How to process OCR on your scanned document using different OCR modes.
Dim caption As String = "OcrPages"
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()
    'Expecting that the input pdf document includes scanned pages.
    If gdpicturePDF.LoadFromFile("test.pdf", False) = GdPictureStatus.OK Then
        If gdpicturePDF.OcrPages("*", 0, "eng", "C:\GdPicture.NET 14\Redist\OCR", "", 300, OCRMode.FavorAccuracy, 30000, True) = GdPictureStatus.OK Then
            'All threads are done.
            If gdpicturePDF.SaveToFile("test_accuracy.pdf") = GdPictureStatus.OK Then
                MessageBox.Show("Done!", caption)
            Else
                MessageBox.Show("The resulting document can't be saved. Status: " + gdpicturePDF.GetStat().ToString(), caption)
            End If
        Else
            MessageBox.Show("The OCR process has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption)
        End If
        gdpicturePDF.CloseDocument()
    Else
        MessageBox.Show("The file can't be loaded. Status: " + gdpicturePDF.GetStat().ToString(), caption)
    End If
            
    If gdpicturePDF.LoadFromFile("test.pdf", False) = GdPictureStatus.OK Then
        If gdpicturePDF.OcrPages("*", 0, "eng", "C:\GdPicture.NET 14\Redist\OCR", "", 300, OCRMode.FavorSpeed, 30000, True) = GdPictureStatus.OK Then
            'All threads are done.
            If gdpicturePDF.SaveToFile("test_speed.pdf") = GdPictureStatus.OK Then
                MessageBox.Show("Done!", caption)
            Else
                MessageBox.Show("The resulting document can't be saved. Status: " + gdpicturePDF.GetStat().ToString(), caption)
            End If
        Else
            MessageBox.Show("The OCR process has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption)
        End If
        gdpicturePDF.CloseDocument()
    Else
        MessageBox.Show("The file can't be loaded. Status: " + gdpicturePDF.GetStat().ToString(), caption)
    End If
End Using
string caption = "OcrPages";
using (GdPicturePDF gdpicturePDF = new GdPicturePDF())
{
    //Expecting that the input pdf document includes scanned pages.
    if (gdpicturePDF.LoadFromFile("test.pdf", false) == GdPictureStatus.OK)
    {
        if (gdpicturePDF.OcrPages("*", 0, "eng", "C:\\GdPicture.NET 14\\Redist\\OCR", "", 300, OCRMode.FavorAccuracy, 30000, true) == GdPictureStatus.OK)
        {
            //All threads are done.
            if (gdpicturePDF.SaveToFile("test_accuracy.pdf") == GdPictureStatus.OK)
                MessageBox.Show("Done!", caption);
            else
                MessageBox.Show("The resulting document can't be saved. Status: " + gdpicturePDF.GetStat().ToString(), caption);
        }
        else
            MessageBox.Show("The OCR process has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption);
        gdpicturePDF.CloseDocument();
    }
    else
        MessageBox.Show("The file can't be loaded. Status: " + gdpicturePDF.GetStat().ToString(), caption);
            
    if (gdpicturePDF.LoadFromFile("test.pdf", false) == GdPictureStatus.OK)
    {
        if (gdpicturePDF.OcrPages("*", 0, "eng", "C:\\GdPicture.NET 14\\Redist\\OCR", "", 300, OCRMode.FavorSpeed, 30000, true) == GdPictureStatus.OK)
        {
            //All threads are done.
            if (gdpicturePDF.SaveToFile("test_speed.pdf") == GdPictureStatus.OK)
                MessageBox.Show("Done!", caption);
            else
                MessageBox.Show("The resulting document can't be saved. Status: " + gdpicturePDF.GetStat().ToString(), caption);
        }
        else
            MessageBox.Show("The OCR process has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption);
        gdpicturePDF.CloseDocument();
    }
    else
        MessageBox.Show("The file can't be loaded. Status: " + gdpicturePDF.GetStat().ToString(), caption);
}
See Also