Blog Post

Extracting Key-Value Pairs Using GdPicture.NET SDK

Illustration: Extracting Key-Value Pairs Using GdPicture.NET SDK

This article will cover what key-value pairs (KVPs) are, how they’re used, and how GdPicture.NET’s KVP extraction engine extracts them from documents that contain unstructured data.

What Are Key-Value Pairs?

In the context of a document, key-value pairs are a way of organizing data or information. These pairs are made up of keys and values. Depending on the type of document, the key-value pairs can be different.

For example, key-value pairs on invoices can be the following:

Key Value
Invoice Number No 00162
Billing Date 20/09/2022
Total 1,165.10€

Here’s an example of key-value pair fields on a government form:

Key Value
Company Name PSPDFKit GmbH
Registration Number FN 548939p
Date of Incorporation 04/10/2013

Extracting Key-Value Pairs

It’s easy to get key-value pairs from structured documents like Excel files because the values are all named. However, a majority of documents have unstructured data. For these documents, a KVP extraction tool is required to retrieve the information. Intelligent document processing (IDP) extracts data from unstructured and semi-structured documents using optical character recognition (OCR) and artificial intelligence (AI) technologies.

In such cases, the extraction of key-value pairs involves two tasks:

  • Using OCR technology to recognize unstructured information and text in a document.

  • Using machine learning (ML) and deep learning (DL) to make sense of the unstructured information by composing links between different parts of the extracted text.

The next sections will talk about why these two approaches have limitations when used separately, as well as why they should be combined for the most reliable results.

The Disadvantages of Only Using Traditional OCR

Extracting data with the traditional OCR approach is based on heuristics. The most important limitation of the traditional OCR approach is that it needs to use a different template for each document type. This works well for simple documents with structured data. However, extracting data with the traditional OCR approach doesn’t perform well with unstructured or semi-structured documents.

Extracting data with this approach suffers from the same limitations as traditional OCR engines that have difficulties recognizing text in the following contexts:

  • Colored backgrounds

  • Glaring

  • Skew

  • Text in tables and graphics

  • Handwritten text

The Disadvantages of Only Using Machine Learning and Deep Learning

Data extraction solutions that leverage ML and deep learning use AI technologies to mitigate the traditional OCR limitations. These deep learning approaches are usually a combination of different techniques such as convolutional neural networks, long short-term memory layers, transformers, and graph neural networks.

However, this approach has a few drawbacks compared to traditional OCR:

  • Time — You’ll need to spend a lot of time teaching ML/DL how to perform under the specific parameters you need.

  • Fixing errors — In the same vein, it can be hard and take a long time to “unteach” incorrect results.

  • Speed and resource usage — Traditional OCR can work more quickly on smaller systems with fewer resources needed compared to ML/DL approaches.

Additionally, data extraction relying only on machine and deep learning often fails for documents with a lot of noise and dotted lines.

GdPicture.NET’s Key-Value Pair Extraction Engine

A combination of the approaches above is necessary to achieve the best results in data extraction. For this reason, PSPDFKit GdPicture.NET recognizes text and key-value pairs based on a hybrid approach of the following methods:

  • Heuristics

  • Mathematics

  • Machine learning

This approach produces superior results compared to traditional OCR and pure ML approaches.

GdPicture.NET’s key-value pair extraction engine enables you to recognize related data items — such as IBANs and addresses — in a document and export them to an external destination like a spreadsheet. GdPicture.NET automatically recognizes the document type, such as a bank statement, and adapts to the context and determines the extraction approach that makes the best use of available resources. It also recognizes the document type based on adaptive layout understanding and natural language processing (NLP) technologies.

GdPicture.NET Data Model

The GdPicture.NET data model enables you to extract data from documents with excellent results. GdPicture.NET’s hybrid approach performs better than traditional OCR and pure ML engines, especially for documents with the following features:

  • Noise

  • Dotted lines

  • Broken characters

  • Text on colored backgrounds

  • Underlined text

  • Skewed text

  • Text in graphics and tables

Confidence Score — How We Ensure Extraction Accuracy

PSPDFKit GdPicture.NET’s key-value pair extraction engine calculates a confidence score, which expresses how confident the engine is in the accuracy of the extracted data.

The confidence score is calculated by considering the following factors, among others:

  • The confidence in the OCR result at the character level. Some characters are more difficult to recognize than others.

  • The confidence in the OCR result at the word level. Some words are more difficult to recognize than others.

  • The data type of the key. Some data types are more difficult to recognize than others. For example, dates and IBANs are relatively easy to recognize, while phone numbers and addresses are generally more difficult.

The confidence score enables you to filter results based on their assumed accuracy. For example, you can disregard data extraction results with a low confidence score, or flag them as data items that require manual checks.

Data Types We Can Categorize

PSPDFKit GdPicture.NET’s KVP extraction engine automatically detects the data type of values. The following data types are supported:

  • Business Identifier Code (BIC)

  • Credit card number

  • Currency

  • Date and time

  • Email address

  • International Bank Account Number (IBAN)

  • Number

  • Percentage

  • Phone number

  • Postal address

  • Postal code

  • String

  • Symbol

  • Time period

  • Unique Identifier (UID)

  • URL

  • Vat ID

How to Extract Data from Invoices

To extract data items from an invoice, follow these steps:

  1. Create a GdPictureOCR object and a GdPictureImaging object.

  2. Select the invoice by passing its path to the CreateGdPictureImageFromFile method of the GdPictureImaging object.

  3. Configure the OCR process with the GdPictureOCR object in the following way:

    • Set the invoice with the SetImage method.

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    • With the AddLanguage method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of the OCRLanguage enumeration.

  4. Run the OCR process with the RunOCR method of the GdPictureOCR object.

  5. Get the number of key-value pairs detected during the OCR process with the GetKeyValuePairCount method of the GdPictureOCR object, and loop through them.

  6. Get the key-value pairs, the data types, and the confidence scores with the following methods:

  7. Write the output to the console.

  8. Release unnecessary resources.

The example below retrieves key-value pairs from the following invoice.

Sample invoice

Download the sample invoice and run the code below, or check out our demo.

using GdPictureOCR gdpictureOCR = new GdPictureOCR();
using GdPictureImaging gdpictureImaging = new GdPictureImaging();
// Load the source document.
int imageId = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png");
// Configure the OCR process.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
gdpictureOCR.SetImage(imageId);
// Run the OCR process.
string ocrResultId = gdpictureOCR.RunOCR();
string keyValuePairsData = "";
for (int pairIndex = 0; pairIndex < gdpictureOCR.GetKeyValuePairCount(ocrResultId); pairIndex++)
{
    keyValuePairsData += $"| Key: {gdpictureOCR.GetKeyValuePairKeyString(ocrResultId, pairIndex)} | " +
                         $"Value: {gdpictureOCR.GetKeyValuePairValueString(ocrResultId, pairIndex)} | " +
                         $"Document Type: {gdpictureOCR.GetKeyValuePairDataType(ocrResultId, pairIndex).ToString()} | " + 
                         $"Confidence Level: {Math.Round(gdpictureOCR.GetKeyValuePairConfidence(ocrResultId, pairIndex), 1).ToString()}% |\n";
}
// Write the output to the console.
Console.WriteLine(keyValuePairsData);
// Release unnecessary resources.
gdpictureImaging.ReleaseGdPictureImage(imageId);
gdpictureOCR.ReleaseOCRResults();
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()
    ' Load the source document.
    Dim imageId As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png")
    ' Configure the OCR process.
    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
    gdpictureOCR.AddLanguage(OCRLanguage.English)
    gdpictureOCR.SetImage(imageId)
    ' Run the OCR process.
    Dim ocrResultId As String = gdpictureOCR.RunOCR()
    Dim keyValuePairsData = ""
    For pairIndex As Integer = 0 To gdpictureOCR.GetKeyValuePairCount(ocrResultId) - 1
        keyValuePairsData += $"| Key: {gdpictureOCR.GetKeyValuePairKeyString(ocrResultId, pairIndex)} | Value: {gdpictureOCR.GetKeyValuePairValueString(ocrResultId, pairIndex)} | Document Type: {gdpictureOCR.GetKeyValuePairDataType(ocrResultId, pairIndex).ToString()} | Confidence Level: {Math.Round(gdpictureOCR.GetKeyValuePairConfidence(CStr(ocrResultId), CInt(pairIndex)), CInt(1)).ToString()}% |" & vbLf
    Next
    ' Write the output to the console.
    Console.WriteLine(keyValuePairsData)
    ' Release unnecessary resources.
    gdpictureImaging.ReleaseGdPictureImage(imageId)
    gdpictureOCR.ReleaseOCRResults()
End Using
End Using
Used Methods and Properties

Related Topics

Format the output to obtain the following table:

Key Value Document Type Confidence Level
Billing date 20/09/2022 DateTime 100%
Order date 20/09/2022 DateTime 100%
Republic of PDF +100 847 738 227 PhoneNumber 77.2%
IBAN AT13 2060 4236 6111 5994 IBAN 100%
Customer Vandelay Industries Around the Corner 13 NBC City String 69.8%
Delivery address Vandelay Industries Around the Corner 13 NBC City String 69.9%
Invoice number No 00162 String 70.9%
Ref. number 34751 Number 92.9%
No 00162 Number 100%
Reference P00201 UID 100%
Quantity Total (excl. VAT) 320.00€ Currency 59%
Subtotal 1,220.00€ Currency 100%
Discount (10%) -122.00€ Currency 90.6%
VAT (5.5%) +6710€ Currency 66.9%
Shipping cost 0.00€ Currency 75%
TOTAL 1,165.10€ Currency 100%
Description Lake Mirror String 99.6%
VAT 5.5% Percentage 66.6%
Price per unit (excl. VAT) 320.00€ Currency 80%
Tax No. AT98765321 UID 73.8%
# info@bruuuk.com EmailAddress 65.6%
# www.bruuuk.com URL 65.6%

When the engine doesn’t recognize the data type of a value, it categorizes the value as a string, as shown in the table above.

This table also contains information about the data type and the confidence level for each key-value pair:

  • The data type describes the nature of the content. In this example, the engine recognizes the value info@bruuuk.com as an email address and the value +100 847 738 227 as a phone number.

  • The confidence level describes how confident the KVP engine is in the accuracy of the data extraction.

In this example, the KVP engine automatically detected all key-value pairs in the document with minimal code and without any preconfiguration. The engine supports more than 100 formats and languages, and it has no dependencies to external models, resources, and databases.

How to Extract Data from Bank Statements

To extract data items from a bank statement, follow these steps:

  1. Create a GdPictureOCR object and a GdPictureImaging object.

  2. Select the bank statement by passing its path to the CreateGdPictureImageFromFile method of the GdPictureImaging object.

  3. Configure the OCR process with the GdPictureOCR object in the following way:

    • Set the bank statement with the SetImage method.

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    • With the AddLanguage method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of the OCRLanguage enumeration.

  4. Run the OCR process with the RunOCR method of the GdPictureOCR object.

  5. Get the number of key-value pairs detected during the OCR process with the GetKeyValuePairCount method of the GdPictureOCR object, and loop through them.

  6. Get the key-value pairs, the data types, and the confidence scores with the following methods:

  7. Write the output to the console.

  8. Release unnecessary resources.

The example below retrieves key-value pairs from the following bank statement.

Sample bank statement

Download the sample bank statement and run the code below, or check out our demo.

using GdPictureOCR gdpictureOCR = new GdPictureOCR();
using GdPictureImaging gdpictureImaging = new GdPictureImaging();
// Load the source document.
int imageId = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png");
// Configure the OCR process.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
gdpictureOCR.SetImage(imageId);
// Run the OCR process.
string ocrResultId = gdpictureOCR.RunOCR();
string keyValuePairsData = "";
for (int pairIndex = 0; pairIndex < gdpictureOCR.GetKeyValuePairCount(ocrResultId); pairIndex++)
{
    keyValuePairsData += $"| Key: {gdpictureOCR.GetKeyValuePairKeyString(ocrResultId, pairIndex)} | " +
                         $"Value: {gdpictureOCR.GetKeyValuePairValueString(ocrResultId, pairIndex)} | " +
                         $"Document Type: {gdpictureOCR.GetKeyValuePairDataType(ocrResultId, pairIndex).ToString()} | " + 
                         $"Confidence Level: {Math.Round(gdpictureOCR.GetKeyValuePairConfidence(ocrResultId, pairIndex), 1).ToString()}% |\n";
}
// Write the output to the console.
Console.WriteLine(keyValuePairsData);
// Release unnecessary resources.
gdpictureImaging.ReleaseGdPictureImage(imageId);
gdpictureOCR.ReleaseOCRResults();
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()
    ' Load the source document.
    Dim imageId As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png")
    ' Configure the OCR process.
    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
    gdpictureOCR.AddLanguage(OCRLanguage.English)
    gdpictureOCR.SetImage(imageId)
    ' Run the OCR process.
    Dim ocrResultId As String = gdpictureOCR.RunOCR()
    Dim keyValuePairsData = ""
    For pairIndex As Integer = 0 To gdpictureOCR.GetKeyValuePairCount(ocrResultId) - 1
        keyValuePairsData += $"| Key: {gdpictureOCR.GetKeyValuePairKeyString(ocrResultId, pairIndex)} | Value: {gdpictureOCR.GetKeyValuePairValueString(ocrResultId, pairIndex)} | Document Type: {gdpictureOCR.GetKeyValuePairDataType(ocrResultId, pairIndex).ToString()} | Confidence Level: {Math.Round(gdpictureOCR.GetKeyValuePairConfidence(CStr(ocrResultId), CInt(pairIndex)), CInt(1)).ToString()}% |" & vbLf
    Next
    ' Write the output to the console.
    Console.WriteLine(keyValuePairsData)
    ' Release unnecessary resources.
    gdpictureImaging.ReleaseGdPictureImage(imageId)
    gdpictureOCR.ReleaseOCRResults()
End Using
End Using
Used Methods and Properties

Related Topics

Format the output to obtain the following table:

Key Value Document Type Confidence Level
IBAN FR7611808009101234567890147 IBAN 100%
Phone 786-315-0313 PhoneNumber 100%
BIC 12345678901 Number 66.4%
Bank Code 11808 Number 99.4%
Counter Code 00914 Number 100%
Number Account 12345678901 Number 99.3%
Bank Key 47 Number 74.2%
River Bank 100 Number 74%
Account Owner David Bricklane String 100%
Domiciliation East Bank Summerfield String 97.5%

Conclusion

You should now have a basic understanding of how GdPicture.NET’s key-value pair (KVP) extraction engine works, and how it ensures accurate and reliable extraction results. If you have any questions or want to discuss how to implement it in your workflows and projects, you can reach out to our team.

Related Products
PSPDFKit GdPicture.NET

Product Page
Guides
API Docs
Example Projects

Share Post
Free 60-Day Trial Try PSPDFKit in your app today.
Free Trial

Related Articles

Explore more
PRODUCTS  |  .NET

PDF-to-Office Conversion with GdPicture.NET SDK

TUTORIALS  |  .NET • How To

How to Manipulate and Edit PDFs in .NET with PdfSharpCore

DEVELOPMENT  |  .NET • MAUI

Choosing Your .NET MAUI Document Strategy