Build a Custom Template

This guide describes the steps required to create a custom document data extraction template.

Example Use Case

A renowned company hosted a talent contest and awarded certificates of completion to all successful participants. According to company policies, it’s essential to maintain records of all rewards in the rewards archive system.

Due to the significant number of talented participants, this has led to the generation of a substantial volume of documents. Consequently, the HR department has expressed concerns regarding the labor-intensive nature of manually processing this high volume of data.

In response to this challenge, the engineering department has been tasked with swiftly developing an intelligent data processing system designed to efficiently capture and manage this data.

The most evident solution was to employ XtractFlow to create a tailored data extraction template to capture all the necessary information.

An example of certificate of completion

Download the example input image.

Building the Document Template

A DocumentTemplate object must be created.

This object will represent a document template, which serves as the comprehensive definition for a specific type of document.

It should clearly define a unique identifier and a public name, provide a semantic description, and outline a set of fields to be extracted. In this use case, the following information needs to be extracted:

The year of certificate delivery.
The person who received the certificate.
The mentor of the student.
The member of the jury.
The achievement of the student.
The postal address of the organization.

static DocumentTemplate buildOrpalisCertificateTemplate()
{
    return new DocumentTemplate()
    {
        Name = "ORPALIS certificate",
        Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060",
        SemanticDescription = "ORPALIS certificate of completion",
        Fields = new List<TemplateField>
        {
            new()
            {
                Name = "Year",
                Format = FieldDataFormat.Number,
                SemanticDescription = "The year of certificate delivery"
            },
            new()
            {
                Name = "Student",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The person who received the certificate"
            },
            new()
            {
                Name = "Mentor",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The mentor of the student"
            },
            new()
            {
                Name = "Jury member",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The member of the jury"
            },
            new()
            {
                Name = "Achievement",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The achievement of the student"
            },
            new()
            {
                Name = "Organization address",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The postal address of the organization",
                StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) }
            }
        }
    };
}

Building the Component

Create a ProcessorComponent object, which is a necessary component for the processor.

This object will encapsulate the document processing workflow’s logic:

static ProcessorComponent buildComponent()
{
    return new ProcessorComponent()
    {
        EnableClassifier = false, // Classification is not required, as a single class of documents will be processed.
        EnableFieldsExtraction = true, // Enabling extraction of fields specified from the previously defined template.
        Templates = new DocumentTemplate[] {
            buildOrpalisCertificateTemplate()
            }
    };
}

Processing a Document and Analyzing Results

At this point, it’s necessary to instantiate a DocumentProcessor object and invoke the Process method to initiate the inference process.

Subsequently, a ProcessorResult object will be returned, encompassing the processing outcome:

// Building the component.
ProcessorComponent component = buildComponent();
// Processing the document.
ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component);
// Analyzing results.
if (result.ExtractedFields != null)
{
    foreach (var item in result.ExtractedFields)
    {
        Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})");
    }
}

Results Output

Field name: 'Year' - Field value: '2023' - Validation state: (Undefined)
Field name: 'Student' - Field value: 'Fabio Escobar' - Validation state: (Undefined)
Field name: 'Mentor' - Field value: 'Loïc Carrère' - Validation state: (Undefined)
Field name: 'Jury member' - Field value: 'Olivier Houssin' - Validation state: (Undefined)
Field name: 'Achievement' - Field value: 'Successfully juggled with 3 bananas' - Validation state: (Undefined)
Field name: 'Organization address' - Field value: '52 Rue de Marclan, 31600 MURET, France' - Validation state: (Valid)

The Complete Solution

static void runExtraction()
{
    Configuration.RegisterGdPictureKey("GDPICTURE_KEY");
    Configuration.RegisterLLMProvider(new OpenAIProvider(OPENAI_KEY));
    Configuration.ResourcesFolder = "resources";
    // Building the component.
    ProcessorComponent component = buildComponent();
    // Processing the document.
    ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component);
    // Analyzing results.
    if (result.ExtractedFields != null)
    {
        foreach (var item in result.ExtractedFields)
        {
            Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})");
        }
    }
}

static ProcessorComponent buildComponent()
{
    return new ProcessorComponent()
    {
        EnableClassifier = false, // Classification isn't required, as a single class of documents will be processed.
        EnableFieldsExtraction = true, // Enabling extraction of fields specified from the previously defined template.
        Templates = new DocumentTemplate[] {
            buildOrpalisCertificateTemplate()
            }
    };
}

static DocumentTemplate buildOrpalisCertificateTemplate()
{
    return new DocumentTemplate()
    {
        Name = "ORPALIS certificate",
        Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060",
        SemanticDescription = "ORPALIS certificate of completion",
        Fields = new List<TemplateField>
    {
    new()
    {
        Name = "Year",
        Format = FieldDataFormat.Number,
        SemanticDescription = "The year of certificate delivery"
    },
    new()
    {
        Name = "Student",
        Format = FieldDataFormat.Text,
        SemanticDescription = "The person who received the certificate"
    },
    new()
    {
        Name = "Mentor",
        Format = FieldDataFormat.Text,
        SemanticDescription = "The mentor of the student"
    },
    new()
    {
        Name = "Jury member",
        Format = FieldDataFormat.Text,
        SemanticDescription = "The member of the jury"
    },
    new()
    {
        Name = "Achievement",
        Format = FieldDataFormat.Text,
        SemanticDescription = "The achievement of the student"
    },
    new()
    {
        Name = "Organization address",
        Format = FieldDataFormat.Text,
        SemanticDescription = "The postal address of the organization",
        StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) }
    }
}
    };
}