Indexed Full-Text Search (FTS)

PSPDFKit supports efficient and fast full-text search in PDF documents through PDFLibrary. This document describes how to get started with PDFLibrary.

Getting Started

PDFLibrary relies on a data source to retrieve information about the documents that are to be indexed. The LibraryDataSource protocol specifies the methods the data source needs to implement. Generally, you will not need to implement your own data source, but instead use the LibraryFileSystemDataSource class provided to you. You use it as follows:

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
guard let library = PSPDFKit.SDK.shared.library else {
    // FTS feature isn't enabled in your license.
    return
}

// Assume that you have a directory of PDF documents you want to index.
let directoryURL = ...

let dataSource = LibraryFileSystemDataSource(library: library, documentsDirectoryURL: directoryURL) { document, stopPointer in
    // If you want to skip a specific document, return `false` here.
    // If you want to stop the directory enumeration, set `stopPointer.pointee` to `true`.
    return true
}

library.dataSource = dataSource // Note that `PDFLibrary` holds the data source with a strong reference.

// Begins the indexing operation. This method performs some initial work synchronously and then starts the indexing, which is asynchronous.
// For large amounts of documents, even the initial work could be slow, which is why this should always be called on a background queue.
DispatchQueue.global(qos: .background).async {
    library.updateIndex()
}
Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
PSPDFLibrary *library = PSPDFKitGlobal.sharedInstance.library;
if (!library) {
    // FTS feature isn't enabled in your license.
    return;
}

// Assume that you have a directory of PDF documents you want to index.
NSURL *directoryURL = ...;
PSPDFLibraryFileSystemDataSource *fileDataSource = [[PSPDFLibraryFileSystemDataSource alloc] initWithLibrary:library documentsDirectoryURL:directoryURL documentHandler:^(PSPDFDocument *document, BOOL *stop) {
    // If you want to skip a specific document, return `NO` here.
    // If you want to stop the directory enumeration, set `*stop` to `YES`.
    return YES;
}];

library.dataSource = fileDataSource; // Note that `PSPDFLibrary` holds the data source with a strong reference.

// Begins the indexing operation. This method performs some initial work synchronously and then starts the indexing, which is asynchronous.
// For large amounts of documents, even the initial work could be slow, which is why this should always be called on a background queue.
dispatch_async(dispatch_get_global_queue(QOS_CLASS_BACKGROUND), ^{
    [library updateIndexWithCompletionHandler:nil];
});

Note that you should always set the library’s data source, and not just when you want to update the index. A good place do to this is your app delegate’s application(_:willFinishLaunchingWithOptions:).

PDFLibrary posts notifications as the index status changes. The following notifications are available:

You’ll usually observe PSPDFLibraryDidFinishIndexingDocument to perform a search as more and more documents become available:

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Assume that `libraryDidFinishIndexing(_:)` has been registered with `NotificationCenter.default`.
func libraryDidFinishIndexing(notification: Notification) {
    guard let library = PSPDFKit.SDK.shared.library else {
        // FTS feature isn't enabled in your license.
        return
    }
    if !library.isIndexing {
        // All documents have been indexed.
    }
    library.documentUIDs(matching: "PSPDFKit", options: nil) { searchString, resultSet in
        for (UID, indexSet) in resultSet {
            print("Found the following matches in document \(UID): \(indexSet)")
        }
    }
}
Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Assume that `libraryDidFinishIndexing:` has been registered with `NSNotificationCenter.defaultNotificationCenter`.
- (void)libraryDidFinishIndexing:(NSNotification *)notification {
    PSPDFLibrary *library = PSPDFKitGlobal.sharedInstance.library;

    if (!library.isIndexing) {
        // All documents have been indexed.
    }

    [library documentUIDsMatchingString:@"PSPDFKit"
                                options:nil
                      completionHandler:^(NSString *searchString, NSDictionary *resultSet) {
        for (NSString *UID in resultSet) {
            NSIndexSet *indexSet = resultSet[UID];
            NSLog(@"Found the following matches in document %@: %@", UID, indexSet);
        }
    }];
}

You can decide to only query the library if all documents have been indexed by using isIndexing. You can also check the current status of individual documents by using indexStatus(forUID:withProgress:). The results are delivered to you in an Dictionary that maps the UID of documents as a String to a IndexSet. An index is set in the IndexSet of a given document if the search string occurs on that page.

Preview Generation

PDFLibrary can also generate a small snippet of text around each match it finds. When calling documentUIDs(matching:options:completionHandler:previewTextHandler:), if the previewTextHandler parameter is not nil, PDFLibrary will also generate previews. Note that preview generation will have a performance impact, as additional information needs to be extracted from the database. PDFDocumentPickerController leverages preview generation when displaying results, as shown below.

Advanced Matching Options

PDFLibrary offers advanced matching options. Pass these options in an Dictionary when calling documentUIDs(matching:options:completionHandler:previewTextHandler:).

Name Type Description
.maximumSearchResultsTotal UInt The maximum amount of search results for the total of all documents. Defaults to 500.
.maximumSearchResultsPerDocument UInt The maximum amount of search results per document.
.maximumPreviewResultsTotal UInt The maximum amount of preview search results of all documents. Defaults to 500.
.maximumPreviewResultsPerDocument UInt The maximum amount of preview search results per document.
.matchExactWordsOnly Bool Only matches exact words. For example, “something” would not match “some.”
.matchExactPhrasesOnly Bool Only matches exact phrases. For example, “this is a test” would not match “this is a quick test.”
.excludeAnnotations Bool Exclude annotations from the search. By default, indexed annotations will be searched.
.excludeDocumentText Bool Exclude document text from the search. By default, indexed document text will be searched.
.previewRange NSRange The range of the preview string. Defaults to 20/160.

Advanced Configuration

You can configure PDFLibrary to match your needs. The following properties on PDFLibrary are available.

Property Type Default Description
tokenizer String? nil The tokenizer used by the library. nil means PSPDFKit’s Porter tokenizer is used. You can learn more about this advanced topic by reading Using Custom Tokenizers.
saveReversedPageText Bool true This indicates whether or not the reversed text of a PDF document should be saved. This increases the size of the cache by about two, but it allows for “ends with” searches.
shouldIndexAnnotations Bool true This specifies whether contents of annotations in documents should be indexed as well.

You can also create your own instance of PDFLibrary. Simply use PDFLibrary(path:) to create a new instance. path must be the path to an empty directory. If path does not yet exist, the library will create it for you. The SQLite database cache will be stored there. Subsequent calls to PDFLibrary(path:) will always return the same object.

Indexing Priority

You can also specify the priority of the background queue used for indexing. This can only be changed on creation of the library, and it defaults to PDFLibrary.IndexingPriority.low. If you require faster indexing, you can do one of two things:

SQLite FTS Version

The default PDFLibrary (available via PSPDFKit.SDK.shared) uses the highest version of SQLite’s full-text search available. The version of SQLite shipping with iOS 9 and 10 does not have FTS5 enabled, and therefore will only use FTS4. FTS5 will be automatically enabled if you use a custom version of SQLite with the correct compile flags. You can also specify which version of FTS to use by creating a new instance with the PDFLibrary(path:ftsVersion:tokenizer:) method.

File System Data Source with Encrypted or Locked Documents

If you need your locked documents to be indexed, you can set the file system data source’s documentProvider property to an object that implements the LibraryFileSystemDataSourceDocumentProvider protocol. You can then use it as follows:

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class LibraryDocumentProvider: NSObject, LibraryFileSystemDataSourceDocumentProvider {
    public func dataSource(_ dataSource: LibraryFileSystemDataSource, documentWithUID UID: String?, at fileURL: URL) -> Document? {
        // Create the document as required, ensuring it is decrypted and unlocked and ready to index.
        let document = ...
        if document.isLocked {
            // Unlock document as required.
        }
        return document
    }
}

let library = PSPDFKit.SDK.shared.library! // Replace this with your custom library, if you use one.
let dataSource = LibraryFileSystemDataSource(library: library, documentsDirectoryURL: URL(), documentHandler: nil)
self.libraryDocumentProvider = LibraryDocumentProvider()
dataSource.documentProvider = libraryDocumentProvider
library.dataSource = dataSource
library.updateIndex()
Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@interface LibraryDocumentProvider : NSObject <PSPDFLibraryFileSystemDataSourceDocumentProvider>
@end

@implementation LibraryDocumentProvider

- (PSPDFDocument *)dataSource:(PSPDFLibraryFileSystemDataSource *)dataSource documentWithUID:(NSString *)UID atURL:(NSURL *)fileURL {
    // Create the document as required, ensuring it is decrypted and unlocked and ready to index.
    PSPDFDocument *document = ...;
    if (document.isLocked) {
        // Unlock document as required.
    }
    return document;
}

@end

self.documentProvider = [LibraryDocumentProvider new];
PSPDFLibraryFileSystemDataSource *dataSource = ...;
dataSource.documentProvider = documentProvider;

PSPDFLibrary *library = PSPDFKitGlobal.sharedInstance.library; // Replace this with your custom library, if you use one.
library.dataSource = dataSource;
[library updateIndexWithCompletionHandler:nil];

File System Data Source Performance

In most cases, LibraryFileSystemDataSource is fast enough, and it automatically detects changes to the file system when requested by a PDFLibrary. However, each call to PDFLibrary.updateIndex(completionHandler:) makes the data source traverse its documents directory to detect changes. If this is called rapidly, it could result in a slowdown if the number of files in the directory is large. If your app is responsible for changes in the directory, you can manually specify these changes to the LibraryFileSystemDataSource object by enabling Explicit Mode (starting with PSPDFKit 6.2.2 for iOS). This can be done as follows:

Copy
1
2
3
4
5
6
7
8
9
10
let dataSource = ...
dataSource.isExplicitModeEnabled = true

// Consider a case where you know that a document has been added to or changed in the data source's documents directory and already have the location.
let addedDocumentURL = ...
dataSource.didAddOrModifyDocument(at: addedDocumentURL)

// Similarly, if a document has been removed:
let removedDocumentURL = ...
dataSource.didRemoveDocument(at: removedDocumentURL)
Copy
1
2
3
4
5
6
7
8
9
10
PSPDFLibraryFileSystemDataSource *dataSource = ...;
dataSource.explicitModeEnabled = YES;

// Consider a case where you know that a document has been added to or changed in the data source's documents directory and already have the location.
NSURL *addedDocumentURL = ...;
[dataSource didAddOrModifyDocumentAtURL:addedDocumentURL];

// Similarly, if a document has been removed:
NSURL *removedDocumentURL = ...;
[dataSource didRemoveDocumentAtURL:removedDocumentURL];

Note that using these methods on the data source does not automatically add or remove documents from the library. The data source notes the changes made, and it then specifies them to the PDFLibrary object when requested during the next call to PDFLibrary.updateIndex(completionHandler:).

Explicit mode should only be enabled in cases where you need to call PDFLibrary.updateIndex(completionHandler:) multiple times in a short period of time, and where you also know the changes being made on the file system. In all other cases, let the data source handle the change detection, and keep isExplicitModeEnabled set to false.

Spotlight Indexing

PDFLibrary can also optionally index documents with Spotlight, so the user can search for documents (and their text) from the native device search. To enable this, set up PDFLibrary as described above, and set the spotlightIndexingType property before calling PDFLibrary.updateIndex(completionHandler:). This can be set to one of the following:

  • .disabled — documents are not indexed in Spotlight
  • .enabled — documents are indexed in Spotlight, but their text is not
  • .enabledWithFullText — documents are indexed in Spotlight with their full text

Retrieving Documents from Spotlight

When the user taps on a searchable item from your app in Spotlight search results, your app delegate’s application(_:continue:restorationHandler:) method is called. In your implementation of this method, call PDFLibrary.fetchSpotlightIndexedDocument(for:completionHandler:) to retrieve the document, if any:

Copy
1
2
3
4
5
6
7
8
9
10
11
12
func application(_ application: UIApplication, continue userActivity: NSUserActivity, restorationHandler: @escaping ([Any]?) -> Void) -> Bool {
    guard let library = PSPDFKit.SDK.shared.library else {
        logError("Unable to get shared PDFLibrary instance to continue user activity.")
        return false
    }
    library.fetchSpotlightIndexedDocument(for: userActivity) {
        guard let document = $0 else { return }
        // Open the document in a `PDFViewController`.
    }

    return true
}
Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
- (BOOL)application:(NSApplication *)application continueUserActivity:(NSUserActivity *)userActivity restorationHandler:(void (^)(NSArray *restorableObjects))restorationHandler {
    PSPDFLibrary *library = PSPDFKitGlobal.sharedInstance.library;
    if (!library) {
        // FTS feature isn't enabled in your license.
        return NO;
    }
    [PSPDFKitGlobal.sharedInstance.library fetchSpotlightIndexedDocumentForUserActivity:userActivity completionHandler:^(PSPDFDocument *document) {
        if (!document) {
            return;
        }
        // Open the document in a `PSPDFViewController`.
    }];
    return YES;
}