Custom Tokenizers for PDF Search in iOS

PSPDFKit uses SQLite to build the full-text index used in PDFLibrary and PDFDocumentPickerController, and also for various other data-saving operations (like the image cache metadata). PSPDFKit doesn’t ship with its own SQLite version, and instead it uses the one that’s already in iOS. PSPDFKit also supports custom SQLite builds.

By default, PDFLibrary uses its own tokenizer, which works well for many languages, including Chinese, Japanese, and Korean (CJK). It also enables searching for related words, e.g. finding “dependencies” when searching for “depending.” This is implemented by the PSPDFLibraryPorterTokenizerName.

When should you ship your own build of SQLite?

  • When you want better indexing performance

  • When you need features only available in a newer version of SQLite

  • When you need better performance for exact word or phrase matches

If you rely a lot on exact word or phrase matches, the default tokenizer set by PDFLibrary might not be optimal and you should consider switching to a custom one.

By default, PSPDFKit uses a custom tokenizer for building the full-text search (FTS) index that can deal with CJK characters as well. Alternatively, we ship another custom tokenizer, which is referenced by the PDFLibrary.UnicodeTokenizerName identifier. This tokenizer is a wrapper around SQLite’s unicode61 tokenizer, but it performs full case folding. This is useful in cases where the document being indexed has text like Straße and you’d like it to match when searching for strasse.

You can also use the custom tokenizers shipped with SQLite itself, like the unicode61 or icu tokenizers.

Tokenizer Minimum FTS Version Minimum SQLite Version
PSPDFLibraryPorterTokenizerName FTS4 3.7.4
PDFLibrary.UnicodeTokenizerName FTS5 3.9.0
unicode61 FTS4 3.7.13

Note that simply linking the correct SQLite version with your application isn’t enough: You must ensure that the linked SQLite is built with the correct flags to enable FTS4 or FTS5. Trying to enable a tokenizer on an unsupported FTS version will result in the initialization of PDFLibrary failing:

do {
    let library = try PDFLibrary(path: PDFLibrary.defaultLibraryPath(), tokenizer: "unicode61")
    let documentPicker = PDFDocumentPickerController(directory: "/path/to/files", includeSubdirectories: true, library: library)
} catch {
    // Handle error.
}
PSPDFLibrary *library = [PSPDFLibrary libraryWithPath:PSPDFLibrary.defaultLibraryPath tokenizer:@"unicode61" error:NULL];
PSPDFDocumentPickerController *documentPicker = [[PSPDFDocumentPickerController alloc] initWithDirectory:@"/path/to/files" includeSubdirectories:YES library:library];

Optionally, you can also ship your own version of SQLite. To do so, please do the following. In the PSPDFKit.dmg you downloaded, you’ll find a current version of SQLite in the Extras folder that’s already prepared to be linked. Add SQLite.xcodeproj to your Xcode project, and then add libSQLite.a as a Target Dependency and under Link Binary with Libraries. Make sure you don’t link the libsqlite3.tbd library.

ℹ️ Note: You’ll have to delete your app, or at least the library file, so that the index is fully rebuilt after a different tokenizer has been set.