Blog Post

How to Combine OCR with Redaction in .NET

Now that our OCR feature is out in the wild, we can start doing powerful things with our document processing engine. OCR complements Redaction nicely, and our Libraries products are great for exploring these features.

For an introduction to OCR and how we’ve integrated it into our products, check out our OCR introduction blog post. You can also read up on Redaction on our feature page.

In this blog post, we’ll walk through the process of performing OCR on a scanned document and then using Redaction to remove text from that document. For this guide, we’ll use the PSPDFKit .NET Library. More detailed information on the .NET OCR API can be found in our .NET OCR usage guide. For information on how to perform a similar task with our Java library, head over to the corresponding Java blog post.

ℹ️ Note: Feel free to follow along with this post. A guide on how to integrate your .NET project with our recent OCR release is available here.

Use Case

For this example, we’re working with a scanned image of a credit card application containing some personal information that needs to be removed. We’ll use OCR to expose the text on the document, subsequently allowing us to apply redaction annotations by searching for particular patterns and text. To perform these redactions, we’ll use our RedactionRegEx and RedactionPreset components.

The credit card application (shown below) has already been scanned and saved as a PDF.

Pre-processed application

Performing OCR

First we need to grab the preprocessed document from our file system and create a Document object. We also want to prepare a path for the OCR processed document:

var document = new Document(new FileDataProvider("/Path/To/application-pre-process.pdf"));
var ocrProcessedFilePath = Path.GetTempFileName();

With our Document object, we can use the OcrProcessor to set the processing language and process the document. We know from the initial image that we’re dealing with a document in English:

var ocrProcessor = new OcrProcessor(document)
{
    Language = OcrLanguage.English
};
ocrProcessor.PerformOcr(new FileDataProvider(ocrProcessedFilePath));

If we had a larger document with multiple pages, we’d be able to specify which pages needed processing, which can be useful for optimizing performance. We could also choose from a multitude of languages supported in our shipped language data packs. See our API guides for more information on what can be done with OcrProcessor.

And that’s it! We’ve now processed the scanned document and should be able to extract the text from it in order to use the Redaction feature for our next step.

Redaction

Using RedactionProcessor, we can specify some details and search presets that we want to remove from the document. Let’s try to remove any references to the applicant’s name, “Joanna R. Simonitti.” To do this well, we need a pattern that’ll match combinations of her first name, last name, and initial. Here’s a quick-and-dirty™ regular expression that should do the trick:

// The following pattern will match the following combinations and is case insensitive:
// Joanna R. Simonitti
// Joanna R.
// Joanna
// Joanna Simonitti
// Simonitti
string nameRegex = @"(?i)(Joanna(\sR)?[.]?(\sSimonitti)?|Simonitti)";

The regular expression isn’t very expressive! We can make our code more readable with a function that takes a first name, last name, and initial, and outputs the regular expression string above:

/// <summary>
/// Build a regex pattern based on a first name, last name, and optional middle initial.
/// Searches either first and last name, or only last name.
/// </summary>
/// <param name="first">First name.
/// <param name="last">Last name.
/// <param name="initial">Optional middle initial.
/// <param name="caseInsensitive">Decides whether or not to search case insensitive. Defaults to `true`.
/// <returns>A regex pattern string.</returns>
string getNameRegex(string first, string last, string initial = "", bool caseInsensitive = true) {
    if (string.IsNullOrEmpty(first) || string.IsNullOrEmpty(last))
    {
        return "";
    }
    var builder = new StringBuilder();
    if (caseInsensitive)
    {
        builder.Append("(?i)");
    }
    builder.Append("(" + first);
    if (!string.IsNullOrEmpty(initial))
    {
        // Searches space character and initial with an optional period character appending.
        // The question marks mean zero or one occurrence(s).
        builder.Append(@"(\s" + initial + ")?[.]?");
    }
    builder.Append(@"(\s" + last + ")?");
    // Adding an OR operator so we also search the document for the last name only.
    builder.Append("|" + last + ")");

    return builder.ToString();
}

Using this function, we can hide away the nasty regex details!

But it’s not all tricky regex construction with PSPDFKit. We also have a bunch of handy regular expression presets which can be used to easily redact common patterns without needing to know what they are beforehand. In this next example, let’s remove any email addresses and phone numbers that might also be present:

var emailPreset = new RedactionPreset {Preset = RedactionPreset.Type.EmailAddress};
var phonePreset = new RedactionPreset {Preset = RedactionPreset.Type.NorthAmericanPhoneNumber};

Combining all this regex goodness, we can complete our redactions:

// First let’s load the OCR-processed document from the OCR step.
var ocrProcessedDocument = new Document(new FileDataProvider(ocrProcessedFilePath));

// And build our name regex.
var caseInsensitive = true;
var nameRegex = getNameRegex("Joanna", "Simonitti", "R", caseInsensitive);

// Create a redaction processor, and search for and redact all instances of the name, email addresses, and US phone numbers in the document.
RedactionProcessor.Create()
    .AddRedactionTemplates(new[] {new RedactionRegEx {Pattern = nameRegex}})
    .AddRedactionTemplates(new[] {new RedactionPreset {Preset = RedactionPreset.Type.EmailAddress}})
    .AddRedactionTemplates(new[] {new RedactionPreset {Preset = RedactionPreset.Type.NorthAmericanPhoneNumber}})
    .Redact(document);

Redaction complete! We can be sure that the underlying data relating to this information has been removed from the document, but it’s always good to check the result visually, which we’ll do after we’ve put everything together.

Putting It All Together

Let’s combine what we did above into an executable that does it all. Let’s not forget to import the relevant components as well:

using System;
using System.Text;
using PSPDFKit;
using PSPDFKit.Ocr;
using PSPDFKit.Redaction;
using PSPDFKit.Redaction.Description;

static void Main()
{
    var document = new Document(new FileDataProvider("/Path/To/application-pre-process.pdf"));
    var ocrProcessedFilePath = "/tmp/application-post-process.pdf";

    var ocrProcessor = new OcrProcessor(document)
    {
        Language = OcrLanguage.English
    };
    ocrProcessor.PerformOcr(new FileDataProvider(ocrProcessedFilePath));

    var processedDocument = new Document(new FileDataProvider(ocrProcessedFilePath));
    var nameRegex = getNameRegex("Joanna", "Simonitti", "R");

    RedactionProcessor.Create()
        .AddRedactionTemplates(new[] {new RedactionRegEx {Pattern = nameRegex}})
        .AddRedactionTemplates(new[] {new RedactionPreset {Preset = RedactionPreset.Type.EmailAddress}})
        .AddRedactionTemplates(new[] {new RedactionPreset {Preset = RedactionPreset.Type.NorthAmericanPhoneNumber}})
        .Redact(processedDocument);
}

Yay! That’s all there is to it: a few lines of code to do some document processing magic.

Let’s have a look at the final application once the processing is complete.

Post-processed application

Conclusion

In this blog post, we saw how easy it is to do powerful document processing using the PSPDFKit .NET Library. Being able to extract and redact text from a complex document that was scanned with a picture taken on a smart phone is a truly phenomenal achievement (really it is, when you stop and think about it). We showed how using our library can make such a task relatively easy, and hopefully we got you thinking about the countless other possibilities these features may provide.

If you think this would be useful and would like to experiment for yourself, head over to our trial page and grab a copy of the PSPDFKit .NET Library today.

How to Combine OCR with Redaction in .NET

Use Case

Performing OCR

Redaction

Putting It All Together

Conclusion

Related Products

PSPDFKit for .NET

Share Post

Related Articles

What Is PDF/A? A Guide to PDF/A and How to Convert to Any Type of PDF/A Programmatically

How to Edit PDFs in an iOS Application Using a PDF Library

PDF SDK: Build vs. Buy