Server API for Document Search

You can search for text in documents by providing the parameters below.

  • type (default: text) — determines the type of search, which can be one of the following:

    • text — searches for the provided piece of text. By default, the search is case insensitive. You can change this by setting the caseSensitive parameter to true.

    • regex — searches using the provided regular expression. The regular expression needs to comply with the ICU regex standard. By default, regular expressions are case sensitive. You can change this by setting the caseSensitive parameter to false.

    • preset — searches using one of the predefined patterns. When the preset type is used, the q parameter can be one of credit-card-number, date, email-address, international-phone-number, ipv4, ipv6, mac-address, north-american-phone-number, social-security-number, time, url, us-zip-code, or vin. See available patterns.

  • q — the search query, which is either a piece of text, a regular expression, or a preset name. Make sure the query is URL encoded.

  • (optional) start — this is the page index from where you want to start the search (default: 0).

  • (optional) limit — this is the number of pages you want to search for (default: page count of the document).

  • (optional) caseSensitive — overrides the default search case sensitivity: case insensitive for text search and case sensitive for regex/preset search.

Request

GET /api/documents/abc/search?q=email-address&type=preset&start=0&limit=10
Authorization: Token token="<secret token>"
$ curl http://localhost:5000/api/documents/:document_id/search?q=email-address&type=preset&start=0&limit=10 \
   -H "Authorization: Token token=<secret token>"

Response

If a document with the given ID exists, the server will return an HTTP response with the status 200 and the following JSON payload:

HTTP/1.1 200 OK
Content-Type: application/json
 {
  "data": [
    {
      "pageIndex": 0
      "previewText": "support@pspdfkit.com"
      "rangeInPreview": [2, 3]
      "rectsOnPage": [[48.45750427246094, 23.53656005859375, 26.207992553710938, 18.5250244140625]]
      "isAnnotation": false
    }
  ]
 }

The data is structured as follows:

  • pageIndex — This is the page index where the text is.

  • previewText — This is the surrounding text of the search query.

  • rangeInPreview — This is the location and length of the search query in the preview text. The first element is the location (character position) within the preview text, and the second element is the length of the query text itself.

  • rectsOnPage — This is the position and bounding box of the text in page coordinates [left, top, width, height].

  • isAnnotation — This is always false, for now. Searching for annotation content is not yet supported.

When there is an issue with the request (for example, when q is missing) you will receive a JSON response with an error:

HTTP/1.1 400 Bad Request
Content-Type: application/json
{
  "error": {
    "reason": "Missing required URL parameter 'q'."
  }
}

If there’s no such document, an HTTP response with the status 404 will be returned:

HTTP/1.1 404 Not Found
Content-Type: application/json

Search Presets

The following list describes the search patterns accepted by PSPDFKit Server in redaction creation and search APIs when the preset strategy is used:

  • credit-card-number — matches a number with 13 to 19 digits that begins with 1–6. Spaces and - are allowed anywhere in the number.

  • date — matches date formats such as mm/dd/yyyy, mm/dd/yy, dd/mm/yyyy, and dd/mm/yy. It rejects any days greater than 31 or months greater than 12 and accepts a leading 0 in front of a single-digit day or month. The delimiter can be -, ., or /.

  • email-address — matches an email address as defined here.

  • international-phone-number — matches international phone numbers. The number can have 7 to 15 digits with spaces or - occurring anywhere within the number, and it must have prefix of + or 00.

  • ipv4 — matches an IPv4 address with an optional mask at the end.

  • ipv6 — matches a full and compressed IPv6 address as defined in RFC 2373.

  • mac-address — matches a MAC address with either - or : as a delimiter.

  • north-american-phone-number — matches North American-style phone numbers. NANPA standardization is used with international support.

  • social-security-number — matches a valid social security number. Expects the format of XXX-XX-XXXX or XXXXXXXXX, with X denoting digits.

  • time — matches time formats such as 00:00:00, 00:00, and 00:00 PM. 12- and 24-hour formats are allowed. Seconds and AM/PM denotation are both optional.

  • url — matches a URL with a prefix of http or https, with an optional subdomain.

  • us-zip-code — matches a USA-style zip code. The format expected is XXXXX or XXXXX-XXXX, where the delimiter can either be - or /.

  • vin — matches US and ISO Standard 3779 VINs. The format expects 17 characters, with the last 5 characters being numeric. I, i, O, o ,Q, q, and _ characters are not allowed.