Using Custom Tokenizers

PSPDFKit uses SQLite to build the full-text index used in PDFLibrary and PDFDocumentPickerController, and also for various other data-saving operations (like the image cache metadata). PSPDFKit doesn’t ship with its own SQLite version, and instead uses the one that is already in iOS. PSPDFKit also supports custom SQLite builds.

By default, PDFLibrary uses its own tokenizer, which works well for many languages, including Chinese, Japanese, and Korean (CJK). It also enables searching for related words, like finding “dependencies” when searching for “depending.” This is implemented by the PSPDFLibraryPorterTokenizerName.

When should you ship your own build of SQLite?

  • When you want better indexing performance.
  • When you need features only available in a newer version of SQLite.
  • When you need better performance for exact word or phrase matches.

If you rely a lot on exact word or phrase matches, the default tokenizer set by PDFLibrary might not be optimal and you should consider switching to a custom one.

By default, PSPDFKit uses a custom tokenizer for building the full-text search (FTS) index that can deal with CJK characters as well. Alternatively, we ship another custom tokenizer, referenced by the PDFLibrary.UnicodeTokenizerName identifier. This tokenizer is a wrapper around SQLite’s unicode61 tokenizer, but it performs full case folding. This is useful in cases where the document being indexed has text like Straße, and you’d like it to match when searching for strasse.

You can also use the custom tokenizers shipped with SQLite itself, like the unicode61 or icu tokenizers.

Tokenizer Minimum FTS Version Minimum SQLite Version
PSPDFLibraryPorterTokenizerName FTS4 3.7.4
PDFLibrary.UnicodeTokenizerName FTS5 3.9.0
unicode61 FTS4 3.7.13

Note that simply linking the correct SQLite version with your application is not enough: You must ensure that the linked SQLite is built with the correct flags to enable FTS4 or FTS5. Trying to enable a tokenizer on an unsupported FTS version will result in the initialization of PDFLibrary failing:

do {
    let library = try PDFLibrary(path: PDFLibrary.defaultLibraryPath(), tokenizer: "unicode61")
    let documentPicker = PDFDocumentPickerController(directory: "/path/to/files", includeSubdirectories: true, library: library)
} catch {
    // Handle error.
PSPDFLibrary *library = [PSPDFLibrary libraryWithPath:PSPDFLibrary.defaultLibraryPath tokenizer:@"unicode61" error:NULL];
PSPDFDocumentPickerController *documentPicker = [[PSPDFDocumentPickerController alloc] initWithDirectory:@"/path/to/files" includeSubdirectories:YES library:library];

Optionally, you can also ship your own version of SQLite. To do so, please do the following. In the PSPDFKit.dmg you downloaded, you will find a current version of SQLite in the Extras folder already prepared to be linked. Add the SQLite.xcodeproj to your Xcode project, and then add libSQLite.a as a Target Dependency and under Link Binary with Libraries. Make sure that you don’t link the libsqlite3.tbd library.

You will have to delete your app or at least the library file so that the index is fully rebuilt after a different tokenizer has been set.