Extract Tables from PDFs and Images Using C#
GdPicture.NET’s table extraction engine is a native SDK that enables you to recognize tables in an unstructured document or image, parse the information, and export the tables to an external destination like a spreadsheet. It can detect and extract bordered, semi-bordered, and borderless tables in images, scanned PDFs, and digitally born PDFs. As a native SDK, it can be deployed on-premises or embedded in your application, and it works offline, without internet access.
Extracting Table Data from a PDF to an Excel Spreadsheet
To read and extract table data from a PDF document to an Excel spreadsheet, follow these steps:
-
Create a
GdPictureOCR
object and aGdPicturePDF
object. -
Select the source document by passing its path to the
LoadFromFile
method of theGdPicturePDF
object. -
Select the page from which to extract the table data with the
SelectPage
method of theGdPicturePDF
object. -
Render the selected page to a 300 dots-per-inch (DPI) image with the
RenderPageToGdPictureImageEx
method of theGdPicturePDF
object. -
Pass the image to the
GdPictureOCR
object with theSetImage
method. -
Configure the table extraction process with the
GdPictureOCR
object in the following way:-
Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
With the
AddLanguage
method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of theOCRLanguage
enumeration.
For more optional configuration parameters, see the
GdPictureOCR
class. -
-
Run the table extraction process with the
RunOCR
method of theGdPictureOCR
object, and save the result ID in a list. -
Create a
GdPictureOCR.SpreadsheetOptions
object and configure the output spreadsheet. By default, tables from the same OCR result are saved in the same sheet. To save each table in a different sheet, set theSeparateTables
property of theGdPictureOCR.SpreadsheetOptions
object totrue
. For more optional configuration parameters, see theGdPictureOCR.SpreadsheetOptions
class. -
Save the output in an Excel spreadsheet with the
SaveAsXLSX
method of theGdPictureOCR
object. This method takes the following parameters:-
The list containing the OCR result ID.
-
The path to the output file.
-
The
GdPictureOCR.SpreadsheetOptions
object.
-
-
Release unnecessary resources.
The example below extracts table data from the first page of a document and saves the output in an Excel spreadsheet:
using GdPictureOCR gdpictureOCR = new GdPictureOCR(); using GdPicturePDF gdpicturePDF = new GdPicturePDF(); // Load the source document. gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf"); // Select the first page. gdpicturePDF.SelectPage(1); // Render the first page to a 300 DPI image. int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true); // Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId); // Configure the table extraction process. gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR"; gdpictureOCR.AddLanguage(OCRLanguage.English); // Run the table extraction process and save the result ID in a list. string result = gdpictureOCR.RunOCR(); List<string> resultsList = new List<string>() { result }; // Configure the output spreadsheet. GdPictureOCR.SpreadsheetOptions spreadsheetOptions = new GdPictureOCR.SpreadsheetOptions() { SeparateTables = true }; // Save the output in an Excel spreadsheet. gdpictureOCR.SaveAsXLSX(resultsList, @"C:\temp\output.xlsx", spreadsheetOptions); // Release unnecessary resources. gdpictureOCR.ReleaseOCRResults(); GdPictureDocumentUtilities.DisposeImage(imageId); gdpicturePDF.CloseDocument();
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() ' Load the source document. gdpicturePDF.LoadFromFile("C:\temp\source.pdf") ' Select the first page. gdpicturePDF.SelectPage(1) ' Render the first page to a 300 DPI image. Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True) ' Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId) ' Configure the table extraction process. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Run the table extraction process and save the result ID in a list. Dim result As String = gdpictureOCR.RunOCR() Dim resultsList As List(Of String) = New List(Of String)() resultsList.Add(result) ' Configure the output spreadsheet. Dim spreadsheetOptions As gdpictureOCR.SpreadsheetOptions = New GdPictureOCR.SpreadsheetOptions() With { .SeparateTables = True } ' Save the output in an Excel spreadsheet. gdpictureOCR.SaveAsXLSX(resultsList, "C:\temp\output.xlsx", spreadsheetOptions) ' Release unnecessary resources. gdpictureOCR.ReleaseOCRResults() GdPictureDocumentUtilities.DisposeImage(imageId) gdpicturePDF.CloseDocument() End Using End Using
Used Methods and Properties
Related Topics
Extracting Table Data from a PDF to JSON Format
To read and extract table data from a PDF document to JSON format, follow these steps:
-
Import the
GdPicture14
and theNewtonsoft.Json.Linq
namespaces. -
Create a
GdPictureOCR
object and aGdPicturePDF
object. -
Select the source document by passing its path to the
LoadFromFile
method of theGdPicturePDF
object. -
Select the page from which to extract the table data with the
SelectPage
method of theGdPicturePDF
object. -
Render the selected page to a 300 dots-per-inch (DPI) image with the
RenderPageToGdPictureImageEx
method of theGdPicturePDF
object. -
Pass the image to the
GdPictureOCR
object with theSetImage
method. -
Configure the OCR process with the
GdPictureOCR
object in the following way:-
Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
With the
AddLanguage
method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of theOCRLanguage
enumeration.
-
-
Run the OCR process with the
RunOCR
method of theGdPictureOCR
object. -
Get the number of tables detected during the OCR process with the
GetTableCount
method of theGdPictureOCR
object. -
Create the JSON object that contains the tables on the page and loop through the tables.
-
For each table, get the number of columns and rows with the
GetTableColumnCount
andGetTableRowCount
methods of theGdPictureOCR
object. -
Create the JSON object that contains the rows in the table and loop through the rows.
-
Create the JSON object that contains the cells in the row and loop through the cells.
-
Get the detected value for each cell with the
GetTableCellText
method of theGdPictureOCR
object and save it in the JSON object. -
Print the tables to the console in JSON format.
-
Release unnecessary resources.
The example below extracts table data from the first page of a document and prints the output to the console in JSON format:
using GdPicture14; using Newtonsoft.Json.Linq; ... using GdPictureOCR gdpictureOCR = new GdPictureOCR(); using GdPicturePDF gdpicturePDF = new GdPicturePDF(); // Load the source document. gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf"); // Select the first page. gdpicturePDF.SelectPage(1); // Render the first page to a 300 DPI image. int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true); // Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId); // Configure the OCR process. gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR"; gdpictureOCR.AddLanguage(OCRLanguage.English); // Run the OCR process. string ocrResultId = gdpictureOCR.RunOCR(); // Create the JSON object that contains the tables on the page and loop through the tables. int tableCount = gdpictureOCR.GetTableCount(ocrResultId); dynamic[] tables = new JObject[tableCount]; for (int tableIndex = 0; tableIndex < tableCount; tableIndex++) { int columnCount = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex); int rowCount = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex); // Create the JSON object that contains the rows in the table and loop through the rows. dynamic[] rows = new JObject[rowCount]; for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { // Create the JSON object that contains the cells in the row and loop through the cells. dynamic[] cells = new JObject[columnCount]; for (int columnIndex = 0; columnIndex < columnCount; columnIndex++) { cells[columnIndex] = new JObject(); cells[columnIndex].RowIndex = rowIndex; cells[columnIndex].ColumnIndex = columnIndex; // Read the content of the cell and save it in the JSON object. cells[columnIndex].Text = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex); } rows[rowIndex] = new JObject(); rows[rowIndex].Cells = new JArray(cells); } tables[tableIndex] = new JObject(); tables[tableIndex].Rows = new JArray(rows); } dynamic tablesOnPage = new JObject(); tablesOnPage.Tables = new JArray(tables); // Print the tables to the console in JSON format. Console.WriteLine(tablesOnPage.ToString()); // Release unnecessary resources. gdpictureOCR.ReleaseOCRResults(); GdPictureDocumentUtilities.DisposeImage(imageId); gdpicturePDF.CloseDocument();
Imports GdPicture14 Imports Newtonsoft.Json.Linq ... Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() ' Load the source document. gdpicturePDF.LoadFromFile("C:\temp\source.pdf") ' Select the first page. gdpicturePDF.SelectPage(1) ' Render the first page to a 300 DPI image. Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True) ' Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId) ' Configure the OCR process. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Run the OCR process. Dim ocrResultId As String = gdpictureOCR.RunOCR() ' Create the JSON object that contains the tables on the page and loop through the tables. Dim tableCount As Integer = gdpictureOCR.GetTableCount(ocrResultId) Dim tables As Object() = New JObject(tableCount - 1) {} For tableIndex = 0 To tableCount - 1 Dim columnCount As Integer = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex) Dim rowCount As Integer = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex) ' Create the JSON object that contains the rows in the table and loop through the rows. Dim rows As Object() = New JObject(rowCount - 1) {} For rowIndex = 0 To rowCount - 1 ' Create the JSON object that contains the cells in the row and loop through the cells. Dim cells As Object() = New JObject(columnCount - 1) {} For columnIndex = 0 To columnCount - 1 cells(columnIndex) = New JObject() cells(columnIndex).RowIndex = rowIndex cells(columnIndex).ColumnIndex = columnIndex ' Read the content of the cell and save it in the JSON object. cells(columnIndex).Text = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex) Next rows(rowIndex) = New JObject() rows(rowIndex).Cells = New JArray(cells) Next tables(tableIndex) = New JObject() tables(tableIndex).Rows = New JArray(rows) Next Dim tablesOnPage As Object = New JObject() tablesOnPage.Tables = New JArray(tables) ' Print the tables to the console in JSON format. Console.WriteLine(tablesOnPage.ToString()) ' Release unnecessary resources. gdpictureOCR.ReleaseOCRResults() GdPictureDocumentUtilities.DisposeImage(imageId) gdpicturePDF.CloseDocument() End Using End Using
Used Methods and Properties
Related Topics
Extracting Table Data from a PDF to Markdown Format
To read and extract table data from a PDF document and print it to the console, follow these steps:
-
Create a
GdPictureOCR
object and aGdPicturePDF
object. -
Select the source document by passing its path to the
LoadFromFile
method of theGdPicturePDF
object. -
Select the page from which to extract the table data with the
SelectPage
method of theGdPicturePDF
object. -
Render the selected page to a 300 dots-per-inch (DPI) image with the
RenderPageToGdPictureImageEx
method of theGdPicturePDF
object. -
Pass the image to the
GdPictureOCR
object with theSetImage
method. -
Configure the OCR process with the
GdPictureOCR
object in the following way:-
Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
With the
AddLanguage
method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of theOCRLanguage
enumeration.
-
-
Run the OCR process with the
RunOCR
method of theGdPictureOCR
object. -
Get the number of tables detected during the OCR process with the
GetTableCount
method of theGdPictureOCR
object, and loop through them. -
For each table, get the number of columns and rows with the
GetTableColumnCount
andGetTableRowCount
methods of theGdPictureOCR
object, and loop through them. -
Get the detected value for each cell with the
GetTableCellText
method of theGdPictureOCR
object, and print it to the console. -
Release unnecessary resources.
The example below extracts table data from the first page of a document and prints the output to the console in Markdown syntax:
using GdPictureOCR gdpictureOCR = new GdPictureOCR(); using GdPicturePDF gdpicturePDF = new GdPicturePDF(); // Load the source document. gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf"); // Select the first page. gdpicturePDF.SelectPage(1); // Render the first page to a 300 DPI image. int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true); // Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId); // Configure the OCR process. gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR"; gdpictureOCR.AddLanguage(OCRLanguage.English); // Run the OCR process. string ocrResultId = gdpictureOCR.RunOCR(); for (int tableIndex = 0; tableIndex < gdpictureOCR.GetTableCount(ocrResultId); tableIndex++) { int columnCount = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex); int rowCount = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex); // Print the table to the console. Console.Write($"\nTable {tableIndex}"); for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { Console.Write("\n| "); for (int columnIndex = 0; columnIndex < columnCount; columnIndex++) { string cellContent = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, ""); Console.Write($" {cellContent} |"); } } Console.WriteLine(""); } // Release unnecessary resources. gdpictureOCR.ReleaseOCRResults(); GdPictureDocumentUtilities.DisposeImage(imageId); gdpicturePDF.CloseDocument();
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() ' Load the source document. gdpicturePDF.LoadFromFile("C:\temp\source.pdf") ' Select the first page. gdpicturePDF.SelectPage(1) ' Render the first page to a 300 DPI image. Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True) ' Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId) ' Configure the OCR process. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Run the OCR process. Dim ocrResultId As String = gdpictureOCR.RunOCR() For tableIndex As Integer = 0 To gdpictureOCR.GetTableCount(ocrResultId) - 1 Dim columnCount As Integer = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex) Dim rowCount As Integer = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex) ' Print the table to the console. Console.Write(vbLf & $"Table {tableIndex}") For rowIndex = 0 To rowCount - 1 Console.Write(vbLf & "| ") For columnIndex = 0 To columnCount - 1 Dim cellContent As String = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, "") Console.Write($" {cellContent} |") Next Next Console.WriteLine("") Next ' Release unnecessary resources. gdpictureOCR.ReleaseOCRResults() GdPictureDocumentUtilities.DisposeImage(imageId) gdpicturePDF.CloseDocument() End Using End Using
Used Methods and Properties
Related Topics
Extracting Table Data from an Image
To read and extract table data from an image, follow these steps:
-
Create a
GdPictureOCR
object and aGdPictureImaging
object. -
Select the image of the table by passing its path to the
CreateGdPictureImageFromFile
method of theGdPictureImaging
object. -
Configure the OCR process with the
GdPictureOCR
object in the following way:-
Set the image of the table with the
SetImage
method. -
Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
With the
AddLanguage
method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of theOCRLanguage
enumeration.
-
-
Run the OCR process with the
RunOCR
method of theGdPictureOCR
object. -
Get the number of tables detected during the OCR process with the
GetTableCount
method of theGdPictureOCR
object, and loop through them. -
For each table, get the number of columns and rows with the
GetTableColumnCount
andGetTableRowCount
methods of theGdPictureOCR
object, and loop through them. -
Get the detected value for each cell with the
GetTableCellText
method of theGdPictureOCR
object, and print it to the console. -
Release unnecessary resources.
The example below extracts data from the following table and prints the output to the console in Markdown syntax.
Download the sample table and run the code below, or check out our demo.
using GdPictureOCR gdpictureOCR = new GdPictureOCR(); using GdPictureImaging gdpictureImaging = new GdPictureImaging(); // Load the source document. int imageId = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png"); // Configure the OCR process. gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR"; gdpictureOCR.AddLanguage(OCRLanguage.English); gdpictureOCR.SetImage(imageId); // Run the OCR process. string ocrResultId = gdpictureOCR.RunOCR(); for (int tableIndex = 0; tableIndex < gdpictureOCR.GetTableCount(ocrResultId); tableIndex++) { int columnCount = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex); int rowCount = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex); // Print the table to the console. Console.Write($"\nTable {tableIndex}"); for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { Console.Write("\n| "); for (int columnIndex = 0; columnIndex < columnCount; columnIndex++) { string cellContent = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, ""); Console.Write($" {cellContent} |"); } } Console.WriteLine(""); } // Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageId); gdpictureOCR.ReleaseOCRResults();
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() Using gdpictureImaging As GdPictureImaging = New GdPictureImaging() ' Load the source document. Dim imageId As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png") ' Configure the OCR process. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) gdpictureOCR.SetImage(imageId) ' Run the OCR process. Dim ocrResultId As String = gdpictureOCR.RunOCR() For tableIndex As Integer = 0 To gdpictureOCR.GetTableCount(ocrResultId) - 1 Dim columnCount As Integer = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex) Dim rowCount As Integer = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex) ' Print the table to the console. Console.Write(vbLf & $"Table {tableIndex}") For rowIndex = 0 To rowCount - 1 Console.Write(vbLf & "| ") For columnIndex = 0 To columnCount - 1 Dim cellContent As String = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, "") Console.Write($" {cellContent} |") Next Next Console.WriteLine("") Next ' Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageId) gdpictureOCR.ReleaseOCRResults() End Using End Using
Used Methods and Properties
Related Topics
Format the output to obtain the following table:
No. | Museum Name | Location | Visits in 2021 | Change Since 2020 |
---|---|---|---|---|
1. | Louvre | France, Paris | 2,825,000 | +5 |
2. | Russian Museum | Russia, Saint Petersburg | 2,260,231 | +88% |
3. | Multimedia Art Museum | Russia, Moscow | 2,242,405 | +421% |
4. | Metropolitan Museum of Art | United States, New York | 1,958,000 | +84% |
5. | National Gallery of Art | United States, Washington, D.C. | 1,704,606 | +133% |