Extract Tables from PDFs and Images Using C#

GdPicture.NET’s table extraction engine is a native SDK that enables you to recognize tables in an unstructured document or image, parse the information, and export the tables to an external destination like a spreadsheet. It can detect and extract bordered, semi-bordered, and borderless tables in images, scanned PDFs, and digitally born PDFs. As a native SDK, it can be deployed on-premises or embedded in your application, and it works offline, without internet access.

Launch Demo

Extracting Table Data from a PDF to an Excel Spreadsheet

To read and extract table data from a PDF document to an Excel spreadsheet, follow these steps:

  1. Create a GdPictureOCR object and a GdPicturePDF object.

  2. Select the source document by passing its path to the LoadFromFile method of the GdPicturePDF object.

  3. Select the page from which to extract the table data with the SelectPage method of the GdPicturePDF object.

  4. Render the selected page to a 300 dots-per-inch (DPI) image with the RenderPageToGdPictureImageEx method of the GdPicturePDF object.

  5. Pass the image to the GdPictureOCR object with the SetImage method.

  6. Configure the table extraction process with the GdPictureOCR object in the following way:

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    • With the AddLanguage method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of the OCRLanguage enumeration.

    For more optional configuration parameters, see the GdPictureOCR class.

  7. Run the table extraction process with the RunOCR method of the GdPictureOCR object, and save the result ID in a list.

  8. Create a GdPictureOCR.SpreadsheetOptions object and configure the output spreadsheet. By default, tables from the same OCR result are saved in the same sheet. To save each table in a different sheet, set the SeparateTables property of the GdPictureOCR.SpreadsheetOptions object to true. For more optional configuration parameters, see the GdPictureOCR.SpreadsheetOptions class.

  9. Save the output in an Excel spreadsheet with the SaveAsXLSX method of the GdPictureOCR object. This method takes the following parameters:

    • The list containing the OCR result ID.

    • The path to the output file.

    • The GdPictureOCR.SpreadsheetOptions object.

  10. Release unnecessary resources.

The example below extracts table data from the first page of a document and saves the output in an Excel spreadsheet:

using GdPictureOCR gdpictureOCR = new GdPictureOCR();
using GdPicturePDF gdpicturePDF = new GdPicturePDF();
// Load the source document.
gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");
// Select the first page.
gdpicturePDF.SelectPage(1);
// Render the first page to a 300 DPI image.
int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);
// Pass the image to the `GdPictureOCR` object.
gdpictureOCR.SetImage(imageId);
// Configure the table extraction process.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
// Run the table extraction process and save the result ID in a list.
string result = gdpictureOCR.RunOCR();
List<string> resultsList = new List<string>() { result };
// Configure the output spreadsheet.
GdPictureOCR.SpreadsheetOptions spreadsheetOptions = new GdPictureOCR.SpreadsheetOptions()
    {
        SeparateTables = true
    };
// Save the output in an Excel spreadsheet.
gdpictureOCR.SaveAsXLSX(resultsList, @"C:\temp\output.xlsx", spreadsheetOptions);
// Release unnecessary resources.
gdpictureOCR.ReleaseOCRResults();
GdPictureDocumentUtilities.DisposeImage(imageId);
gdpicturePDF.CloseDocument();
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()
    ' Load the source document.
    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")
    ' Select the first page.
    gdpicturePDF.SelectPage(1)
    ' Render the first page to a 300 DPI image.
    Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True)
    ' Pass the image to the `GdPictureOCR` object.
    gdpictureOCR.SetImage(imageId)
    ' Configure the table extraction process.
    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
    gdpictureOCR.AddLanguage(OCRLanguage.English)
    ' Run the table extraction process and save the result ID in a list.
    Dim result As String = gdpictureOCR.RunOCR()
    Dim resultsList As List(Of String) = New List(Of String)()
    resultsList.Add(result)
    ' Configure the output spreadsheet.
    Dim spreadsheetOptions As gdpictureOCR.SpreadsheetOptions = New GdPictureOCR.SpreadsheetOptions() With {
        .SeparateTables = True
    }
    ' Save the output in an Excel spreadsheet.
    gdpictureOCR.SaveAsXLSX(resultsList, "C:\temp\output.xlsx", spreadsheetOptions)
    ' Release unnecessary resources.
    gdpictureOCR.ReleaseOCRResults()
    GdPictureDocumentUtilities.DisposeImage(imageId)
    gdpicturePDF.CloseDocument()
End Using
End Using
Used Methods and Properties

Related Topics

Extracting Table Data from a PDF to JSON Format

To read and extract table data from a PDF document to JSON format, follow these steps:

  1. Import the GdPicture14 and the Newtonsoft.Json.Linq namespaces.

  2. Create a GdPictureOCR object and a GdPicturePDF object.

  3. Select the source document by passing its path to the LoadFromFile method of the GdPicturePDF object.

  4. Select the page from which to extract the table data with the SelectPage method of the GdPicturePDF object.

  5. Render the selected page to a 300 dots-per-inch (DPI) image with the RenderPageToGdPictureImageEx method of the GdPicturePDF object.

  6. Pass the image to the GdPictureOCR object with the SetImage method.

  7. Configure the OCR process with the GdPictureOCR object in the following way:

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    • With the AddLanguage method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of the OCRLanguage enumeration.

  8. Run the OCR process with the RunOCR method of the GdPictureOCR object.

  9. Get the number of tables detected during the OCR process with the GetTableCount method of the GdPictureOCR object.

  10. Create the JSON object that contains the tables on the page and loop through the tables.

  11. For each table, get the number of columns and rows with the GetTableColumnCount and GetTableRowCount methods of the GdPictureOCR object.

  12. Create the JSON object that contains the rows in the table and loop through the rows.

  13. Create the JSON object that contains the cells in the row and loop through the cells.

  14. Get the detected value for each cell with the GetTableCellText method of the GdPictureOCR object and save it in the JSON object.

  15. Print the tables to the console in JSON format.

  16. Release unnecessary resources.

The example below extracts table data from the first page of a document and prints the output to the console in JSON format:

using GdPicture14;
using Newtonsoft.Json.Linq;
...
using GdPictureOCR gdpictureOCR = new GdPictureOCR();
using GdPicturePDF gdpicturePDF = new GdPicturePDF();
// Load the source document.
gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");
// Select the first page.
gdpicturePDF.SelectPage(1);
// Render the first page to a 300 DPI image.
int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);
// Pass the image to the `GdPictureOCR` object.
gdpictureOCR.SetImage(imageId);
// Configure the OCR process.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
// Run the OCR process.
string ocrResultId = gdpictureOCR.RunOCR();
// Create the JSON object that contains the tables on the page and loop through the tables.
int tableCount = gdpictureOCR.GetTableCount(ocrResultId);
dynamic[] tables = new JObject[tableCount];
for (int tableIndex = 0; tableIndex < tableCount; tableIndex++)
{
    int columnCount = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex);
    int rowCount = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex);
    // Create the JSON object that contains the rows in the table and loop through the rows.
    dynamic[] rows = new JObject[rowCount];
    for (int rowIndex = 0; rowIndex < rowCount; rowIndex++)
    {
        // Create the JSON object that contains the cells in the row and loop through the cells.
        dynamic[] cells = new JObject[columnCount];
        for (int columnIndex = 0; columnIndex < columnCount; columnIndex++)
        {
            cells[columnIndex] = new JObject();
            cells[columnIndex].RowIndex = rowIndex;
            cells[columnIndex].ColumnIndex = columnIndex;
            // Read the content of the cell and save it in the JSON object.
            cells[columnIndex].Text = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex);
        }
        rows[rowIndex] = new JObject();
        rows[rowIndex].Cells = new JArray(cells);
    }
    tables[tableIndex] = new JObject();
    tables[tableIndex].Rows = new JArray(rows);
}
dynamic tablesOnPage = new JObject();
tablesOnPage.Tables = new JArray(tables);
// Print the tables to the console in JSON format.
Console.WriteLine(tablesOnPage.ToString());
// Release unnecessary resources.
gdpictureOCR.ReleaseOCRResults();
GdPictureDocumentUtilities.DisposeImage(imageId);
gdpicturePDF.CloseDocument();
Imports GdPicture14
Imports Newtonsoft.Json.Linq
...
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()
    ' Load the source document.
    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")
    ' Select the first page.
    gdpicturePDF.SelectPage(1)
    ' Render the first page to a 300 DPI image.
    Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True)
    ' Pass the image to the `GdPictureOCR` object.
    gdpictureOCR.SetImage(imageId)
    ' Configure the OCR process.
    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
    gdpictureOCR.AddLanguage(OCRLanguage.English)
    ' Run the OCR process.
    Dim ocrResultId As String = gdpictureOCR.RunOCR()
    ' Create the JSON object that contains the tables on the page and loop through the tables.
    Dim tableCount As Integer = gdpictureOCR.GetTableCount(ocrResultId)
    Dim tables As Object() = New JObject(tableCount - 1) {}
    For tableIndex = 0 To tableCount - 1
        Dim columnCount As Integer = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex)
        Dim rowCount As Integer = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex)
        ' Create the JSON object that contains the rows in the table and loop through the rows.
        Dim rows As Object() = New JObject(rowCount - 1) {}
        For rowIndex = 0 To rowCount - 1
            ' Create the JSON object that contains the cells in the row and loop through the cells.
            Dim cells As Object() = New JObject(columnCount - 1) {}
            For columnIndex = 0 To columnCount - 1
                cells(columnIndex) = New JObject()
                cells(columnIndex).RowIndex = rowIndex
                cells(columnIndex).ColumnIndex = columnIndex
                ' Read the content of the cell and save it in the JSON object.
                cells(columnIndex).Text = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex)
            Next
            rows(rowIndex) = New JObject()
            rows(rowIndex).Cells = New JArray(cells)
        Next
        tables(tableIndex) = New JObject()
        tables(tableIndex).Rows = New JArray(rows)
    Next
    Dim tablesOnPage As Object = New JObject()
    tablesOnPage.Tables = New JArray(tables)
    ' Print the tables to the console in JSON format.
    Console.WriteLine(tablesOnPage.ToString())
    ' Release unnecessary resources.
    gdpictureOCR.ReleaseOCRResults()
    GdPictureDocumentUtilities.DisposeImage(imageId)
    gdpicturePDF.CloseDocument()
End Using
End Using
Used Methods and Properties

Related Topics

Extracting Table Data from a PDF to Markdown Format

To read and extract table data from a PDF document and print it to the console, follow these steps:

  1. Create a GdPictureOCR object and a GdPicturePDF object.

  2. Select the source document by passing its path to the LoadFromFile method of the GdPicturePDF object.

  3. Select the page from which to extract the table data with the SelectPage method of the GdPicturePDF object.

  4. Render the selected page to a 300 dots-per-inch (DPI) image with the RenderPageToGdPictureImageEx method of the GdPicturePDF object.

  5. Pass the image to the GdPictureOCR object with the SetImage method.

  6. Configure the OCR process with the GdPictureOCR object in the following way:

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    • With the AddLanguage method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of the OCRLanguage enumeration.

  7. Run the OCR process with the RunOCR method of the GdPictureOCR object.

  8. Get the number of tables detected during the OCR process with the GetTableCount method of the GdPictureOCR object, and loop through them.

  9. For each table, get the number of columns and rows with the GetTableColumnCount and GetTableRowCount methods of the GdPictureOCR object, and loop through them.

  10. Get the detected value for each cell with the GetTableCellText method of the GdPictureOCR object, and print it to the console.

  11. Release unnecessary resources.

The example below extracts table data from the first page of a document and prints the output to the console in Markdown syntax:

using GdPictureOCR gdpictureOCR = new GdPictureOCR();
using GdPicturePDF gdpicturePDF = new GdPicturePDF();
// Load the source document.
gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");
// Select the first page.
gdpicturePDF.SelectPage(1);
// Render the first page to a 300 DPI image.
int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);
// Pass the image to the `GdPictureOCR` object.
gdpictureOCR.SetImage(imageId);
// Configure the OCR process.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
// Run the OCR process.
string ocrResultId = gdpictureOCR.RunOCR();
for (int tableIndex = 0; tableIndex < gdpictureOCR.GetTableCount(ocrResultId); tableIndex++)
{
    int columnCount = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex);
    int rowCount = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex);

    // Print the table to the console.
    Console.Write($"\nTable {tableIndex}");
    for (int rowIndex = 0; rowIndex < rowCount; rowIndex++)
    {
        Console.Write("\n| ");
        for (int columnIndex = 0; columnIndex < columnCount; columnIndex++)
        {
            string cellContent = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, "");
            Console.Write($" {cellContent} |");
        }
    }
    Console.WriteLine("");
}
// Release unnecessary resources.
gdpictureOCR.ReleaseOCRResults();
GdPictureDocumentUtilities.DisposeImage(imageId);
gdpicturePDF.CloseDocument();
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()
    ' Load the source document.
    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")
    ' Select the first page.
    gdpicturePDF.SelectPage(1)
    ' Render the first page to a 300 DPI image.
    Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True)
    ' Pass the image to the `GdPictureOCR` object.
    gdpictureOCR.SetImage(imageId)
    ' Configure the OCR process.
    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
    gdpictureOCR.AddLanguage(OCRLanguage.English)
    ' Run the OCR process.
    Dim ocrResultId As String = gdpictureOCR.RunOCR()
    For tableIndex As Integer = 0 To gdpictureOCR.GetTableCount(ocrResultId) - 1
        Dim columnCount As Integer = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex)
        Dim rowCount As Integer = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex)
        ' Print the table to the console.
        Console.Write(vbLf & $"Table {tableIndex}")
        For rowIndex = 0 To rowCount - 1
            Console.Write(vbLf & "| ")
            For columnIndex = 0 To columnCount - 1
                Dim cellContent As String = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, "")
                Console.Write($" {cellContent} |")
            Next
        Next
        Console.WriteLine("")
    Next
    ' Release unnecessary resources.
    gdpictureOCR.ReleaseOCRResults()
    GdPictureDocumentUtilities.DisposeImage(imageId)
    gdpicturePDF.CloseDocument()
End Using
End Using
Used Methods and Properties

Related Topics

Extracting Table Data from an Image

To read and extract table data from an image, follow these steps:

  1. Create a GdPictureOCR object and a GdPictureImaging object.

  2. Select the image of the table by passing its path to the CreateGdPictureImageFromFile method of the GdPictureImaging object.

  3. Configure the OCR process with the GdPictureOCR object in the following way:

    • Set the image of the table with the SetImage method.

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    • With the AddLanguage method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of the OCRLanguage enumeration.

  4. Run the OCR process with the RunOCR method of the GdPictureOCR object.

  5. Get the number of tables detected during the OCR process with the GetTableCount method of the GdPictureOCR object, and loop through them.

  6. For each table, get the number of columns and rows with the GetTableColumnCount and GetTableRowCount methods of the GdPictureOCR object, and loop through them.

  7. Get the detected value for each cell with the GetTableCellText method of the GdPictureOCR object, and print it to the console.

  8. Release unnecessary resources.

The example below extracts data from the following table and prints the output to the console in Markdown syntax.

Sample table

Download the sample table and run the code below, or check out our demo.

using GdPictureOCR gdpictureOCR = new GdPictureOCR();
using GdPictureImaging gdpictureImaging = new GdPictureImaging();
// Load the source document.
int imageId = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png");
// Configure the OCR process.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
gdpictureOCR.SetImage(imageId);
// Run the OCR process.
string ocrResultId = gdpictureOCR.RunOCR();
for (int tableIndex = 0; tableIndex < gdpictureOCR.GetTableCount(ocrResultId); tableIndex++)
{
    int columnCount = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex);
    int rowCount = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex);

    // Print the table to the console.
    Console.Write($"\nTable {tableIndex}");
    for (int rowIndex = 0; rowIndex < rowCount; rowIndex++)
    {
        Console.Write("\n| ");
        for (int columnIndex = 0; columnIndex < columnCount; columnIndex++)
        {
            string cellContent = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, "");
            Console.Write($" {cellContent} |");
        }
    }
    Console.WriteLine("");
}
// Release unnecessary resources.
gdpictureImaging.ReleaseGdPictureImage(imageId);
gdpictureOCR.ReleaseOCRResults();
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()
    ' Load the source document.
    Dim imageId As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png")
    ' Configure the OCR process.
    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
    gdpictureOCR.AddLanguage(OCRLanguage.English)
    gdpictureOCR.SetImage(imageId)
    ' Run the OCR process.
    Dim ocrResultId As String = gdpictureOCR.RunOCR()
    For tableIndex As Integer = 0 To gdpictureOCR.GetTableCount(ocrResultId) - 1
        Dim columnCount As Integer = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex)
        Dim rowCount As Integer = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex)
        ' Print the table to the console.
        Console.Write(vbLf & $"Table {tableIndex}")
        For rowIndex = 0 To rowCount - 1
            Console.Write(vbLf & "| ")
            For columnIndex = 0 To columnCount - 1
                Dim cellContent As String = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, "")
                Console.Write($" {cellContent} |")
            Next
        Next
        Console.WriteLine("")
    Next
    ' Release unnecessary resources.
    gdpictureImaging.ReleaseGdPictureImage(imageId)
    gdpictureOCR.ReleaseOCRResults()
End Using
End Using
Used Methods and Properties

Related Topics

Format the output to obtain the following table:

No. Museum Name Location Visits in 2021 Change Since 2020
1. Louvre France, Paris 2,825,000 +5
2. Russian Museum Russia, Saint Petersburg 2,260,231 +88%
3. Multimedia Art Museum Russia, Moscow 2,242,405 +421%
4. Metropolitan Museum of Art United States, New York 1,958,000 +84%
5. National Gallery of Art United States, Washington, D.C. 1,704,606 +133%