Automated Document Redaction in Java

PSPDFKit lets you search the document for text matching predefined patterns and then create redactions on top of the matching text. After the redaction is applied, the text will be permanently and irreversibly removed.

Note that, by design, some of the preset patterns might overfit the criteria (i.e. include false positive results). This might happen since we strive for including all positive results and avoiding data loss. Make sure to review the matches found.

Pattern Name Description
CREDIT_CARD_NUMBER Catches credit card numbers with a number beginning with 1-6, and must be 13 to 19 digits long. Spaces and - are allowed anywhere in the number.
DATE Matches date formats such as mm/dd/yyyy, mm/dd/yy, dd/mm/yyyy, and dd/mm/yy. It will reject any days/months greater than 31 and will match if a leading zero is or is not used for a single digit day or month. The delimiter can either be -, ., or /.
TIME Matches time formats such as 00:00:00, 00:00, 00:00 PM. 12- and 24-hour formats are allowed. Seconds and 12 hour AM/PM denotation are both optional.
EMAIL_ADDRESS Matches an email address with the format of, where xyz can be any alpha numeric character or a dot. Find out more about the email pattern.
INTERNATIONAL_PHONE_NUMBER Matches international-style phone numbers with a prefix of + or 00, containing between 7 and 15 digits with spaces or - occurring anywhere within the number.
IP_V4 Matches an IPV4 address limited to number ranges of 0-255, with an optional mask.
IP_V6 Matches full and compressed IPv6 addresses as defined in RFC 2373.
MAC_ADDRESS Matches a MAC address with delimiters of either - or :
NORTH_AMERICAN_PHONE_NUMBER Matches an NANP style phone number. In general, this will match the US and Canadian and various Caribbean countries. The pattern will also match an optional international prefix of +1.
SOCIAL_SECURITY_NUMBER Matches a US social security number (SSN). The format of the number should be either XXX-XX-XXXX or XXXXXXXXX, with X denoting [0-9]. We expect the number to have word boundaries on either side, or to be the start/end of the string.
URL Matches a URL with a prefix of http
US_ZIP_CODE Matches a USA-style zip code. The format expected is 00000 or 00000-0000, where the delimiter can either be - or /.
VIN Matches US and ISO 3779 standard VINs. The format expects 17 characters, with the last 5 characters being numeric. I, O, Q, _ characters are not allowed in upper or lower case.

To apply a redaction to a document, create a redaction processor, add the redaction preset you’d like to remove, and redact the document:

    .addRedactionTemplates(new RedactionPreset.Builder(RedactionPreset.Type.EMAIL_ADDRESS).build())