Indexed Full-Text Search (FTS)

PSPDFKit supports efficient and fast full-text search in PDF documents through PSPDFLibrary. This document describes how to get started with PSPDFLibrary.

Getting Started

PSPDFLibrary relies on a data source to retrieve information about the documents that are to be indexed. The PSPDFLibraryDataSource protocol specifies the methods the data source needs to implement. Generally, you will not need to implement your own data source, but instead use the PSPDFLibraryFileSystemDataSource class provided to you. You use it as follows:

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
guard let library = PSPDFKit.sharedInstance.library else {
    // FTS feature isn't enabled in your license
    return
}

// Assume that you have a directory of PDF documents that you want to index:
let directoryURL = ...

let dataSource = PSPDFLibraryFileSystemDataSource(library: library, documentsDirectoryURL: directoryURL) { (document, stopPointer) in
    // If you want to skip a specific document, return false here.
    // If you want to stop the directory enumeration, set stopPointer.pointee to true
    return true
}

library.dataSource = dataSource

// Begins the indexing operation. This method performs some initial work synchronously, and then starts the indexing, which is asynchronous.
// For large amounts of documents, even the initial work could be slow, which is why this should always be called on a background queue.
DispatchQueue.global(qos: .background).async {
    library.updateIndex()
}
Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
PSPDFLibrary *library = PSPDFKit.sharedInstance.library;
if (!library) {
    // FTS feature isn't enabled in your license
    return;
}

// Assume that you have a directory of PDF documents that you want to index:
NSURL *directoryURL = ...;
PSPDFLibraryFileSystemDataSource *fileDataSource = [[PSPDFLibraryFileSystemDataSource alloc] initWithLibrary:library documentsDirectoryURL:directoryURL documentHandler:^(PSPDFDocument *document, BOOL *stop) {
    // If you want to skip a specific document, return NO here.
    // If you want to stop the directory enumeration, set *stop to YES.
    return YES;
}];

library.dataSource = fileDataSource; // Note that PSPDFLibrary holds the data source with a strong reference.

// Begins the indexing operation. This method performs some initial work synchronously, and then starts the indexing, which is asynchronous.
// For large amounts of documents, even the initial work could be slow, which is why this should always be called on a background queue.
dispatch_async(dispatch_get_global_queue(QOS_CLASS_BACKGROUND), ^{
[library updateIndexWithCompletionHandler:nil];
});

Note that you should always set the library's data source, not just when you want to update the index. A good place do to this is your app delegate's -application:willFinishLaunchingWithOptions:.

PSPDFLibrary posts notification as the index status changes. The following notifications are available:

You'll usually observe the PSPDFLibraryDidFinishIndexingDocumentNotification to perform a search as more and more documents become available.

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Assume that `libraryDidFinishIndexing(_:)` has been registered with `NotificationCenter.default`
func libraryDidFinishIndexing(notification: Notification) {
    guard let library = PSPDFKit.sharedInstance.library else {
        // FTS feature isn't enabled in your license
        return
    }
    if !library.isIndexing {
        // All documents have been indexed.
    }
    library.documentUIDs(matching: "PSPDFKit", options: nil) { searchString, resultSet in
        for (UID, indexSet) in resultSet {
            print("Found the following matches in document \(UID): \(indexSet)")
        }
    }
}
Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Assume that `libraryDidFinishIndexing:` has been registered with `NSNotificationCenter.defaultNotificationCenter`
- (void)libraryDidFinishIndexing:(NSNotification *)notification {
    PSPDFLibrary *library = PSPDFKit.sharedInstance.library;

    if (!library.isIndexing) {
        // All documents have been indexed.
    }

    [library documentUIDsMatchingString:@"PSPDFKit"
                                options:nil
                      completionHandler:^(NSString *searchString, NSDictionary *resultSet) {
        for (NSString *UID in resultSet) {
            NSIndexSet *indexSet = resultSet[UID];
            NSLog(@"Found the following matches in document %@: %@", UID, indexSet);
        }
    }];
}

You can decide to only query the library if all documents have been indexed by using isIndexing. You can also check the current status for individual documents by using -indexStatusForUID:withProgress:. The results are delivered to you in an NSDictionary that maps the UID of documents as an NSString to an NSIndexSet. An index is set in the NSIndexSet of a given document if the search string occurs on that page.

Advanced Matching Options

PSPDFLibrary offers advanced matching options. You pass those options in an NSDictionary when calling -documentUIDsMatchingString:options:completionHandler.

The following options are available:

Name Type Description
PSPDFLibraryMaximumSearchResultsTotalKey NSUInteger The maximum amount of search results for the total of all documents.
PSPDFLibraryMaximumSearchResultsPerDocumentKey NSUInteger The maximum amount of search results per document.
PSPDFLibraryMaximumPreviewResultsTotalKey NSUInteger The maximum amount of preview search results of all documents.
PSPDFLibraryMaximumPreviewResultsPerDocumentKey NSUInteger The maximum amount of preview search results per document.
PSPDFLibraryMatchExactWordsOnlyKey BOOL Only matches exact words. For example "something" would not match "some".
PSPDFLibraryMatchExactPhrasesOnlyKey BOOL Only matches exact phrases. For example, "this is a test" would not match "this is a quick test".

Advanced Configuration

You can configure PSPDFLibrary to match your needs. The following properties on PSPDFLibrary are available:

Property Type Default Description
tokenizer NSString nil The tokenizer used by the library. nil means the custom PSPDFKit one is used. You can learn more about this advanced topic here: Enabling the unicode61 tokenizer.
saveReversedPageText BOOL YES Indicates if the reversed text of a PDF document should be saved. This increases the size of the cache by about 2x, but allows for ends-with searches.
shouldIndexAnnotations BOOL YES Specifies whether contents of annotations in documents should be indexed as well.

You can also create your own instance of PSPDFLibrary. Simply use +libraryWithPath: to create a new instance. path must be the path to an empty directory. If path does not exist yet, the library will create it for you. The SQLite database cache will be stored there. Subsequent calls to +libraryWithPath: will always return the same object.

Indexing Priority

You can also specify the priority of the background queue used for indexing. This can only be changed on creation of the library, and defaults to PSPDFLibraryIndexingPriorityLow. If you require faster indexing, you can: - Create your own PSPDFLibrary instance as described above, or, - Specify a PSPDFLibraryIndexingPriorityKey in the options passed into setLicenseKey:options: to change the priority used by the default library.

SQLite FTS Version

The default PSPDFLibrary (available via PSPDFKit.sharedInstance) uses the highest version of SQLite's Full Text Search available. The version of SQLite shipping with iOS 9 and 10 does not enable have FTS5 enabled, and therefore will only use FTS4. FTS5 will be automatically enabled if you use a custom version of SQLite with the correct compile flags. You can also specify which version of FTS to use by using the +libraryWithPath:ftsVersion:tokenizer:error: method to create a new instance.

File System Data Source With Encrypted Or Locked Documents

If you need your locked documents to be indexed, you can set the file system data source's documentProvider property to an object that implements the PSPDFLibraryFileSystemDataSourceDocumentProvider protocol. You can then use it as follows:

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class LibraryDocumentProvider: NSObject, PSPDFLibraryFileSystemDataSourceDocumentProvider {
    public func dataSource(_ dataSource: PSPDFLibraryFileSystemDataSource, documentWithUID UID: String?, at fileURL: URL) -> PSPDFDocument? {
        // Create the document as required: ensuring it is decrypted and unlocked, ready to index.
        let document = ...
        if document.isLocked {
            // Unlock document as required
        }
        return document
    }
}

let library = PSPDFKit.sharedInstance.library! // replace this with your custom library, if you use one
let dataSource = PSPDFLibraryFileSystemDataSource(library: library, documentsDirectoryURL: URL(), documentHandler: nil)
self.libraryDocumentProvider = LibraryDocumentProvider()
dataSource.documentProvider = libraryDocumentProvider
library.dataSource = dataSource
library.updateIndex()
Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@interface LibraryDocumentProvider : NSObject <PSPDFLibraryFileSystemDataSourceDocumentProvider>
@end

@implementation LibraryDocumentProvider

- (PSPDFDocument *)dataSource:(PSPDFLibraryFileSystemDataSource *)dataSource documentWithUID:(NSString *)UID atURL:(NSURL *)fileURL {
    // Create the document as required: ensuring it is decrypted and unlocked, ready to index.
    PSPDFDocument *document = ...;
    if (document.isLocked) {
        // Unlock document as required
    }
    return document;
}

@end

self.documentProvider = [LibraryDocumentProvider new];
PSPDFLibraryFileSystemDataSource *dataSource = ...;
dataSource.documentProvider = documentProvider;

PSPDFLibrary *library = PSPDFKit.sharedInstance.library; // replace this with your custom library, if you use one
library.dataSource = dataSource;
[library updateIndexWithCompletionHandler:nil];

File System Data Source Performance

In most cases PSPDFLibraryFileSystemDataSource is fast enough, and automatically detects changes to the file system when requested by a PSPDFLibrary. However, each call to -[PSPDFLibrary updateIndexWithCompletionHandler:] makes the data source traverse its documents directory to detect changes. If this is called rapidly, this could result in a slowdown if the number of files in the directory is large. If your app is responsible for changes in the directory, you can manually specify these changes to the PSPDFLibraryFileSystemDataSource object by enabling Explicit Mode (starting with PSPDFKit 6.2.2 for iOS). This can be done as follows:

Copy
1
2
3
4
5
6
7
8
9
10
let dataSource = ...
dataSource.isExplicitModeEnabled = true

// Consider a case where you know that a document has been added to or changed in the data source's documents directory, and have the location.
let addedDocumentURL = ...
dataSource.didAddOrModifyDocument(at: addedDocumentURL)

// Similarly, if a document has been removed:
let removedDocumentURL = ...
dataSource.didRemoveDocument(at: removedDocumentURL)
Copy
1
2
3
4
5
6
7
8
9
10
PSPDFLibraryFileSystemDataSource *dataSource = ...;
dataSource.explicitModeEnabled = YES;

// Consider a case where you know that a document has been added to or changed in the data source's documents directory, and have the location.
NSURL *addedDocumentURL = ...;
[dataSource didAddOrModifyDocumentAtURL:addedDocumentURL];

// Similarly, if a document has been removed:
NSURL *removedDocumentURL = ...;
[dataSource didRemoveDocumentAtURL:removedDocumentURL];

Note that using these methods on the data source does not automatically add or remove documents from the library. The data source notes the changes made, and then specifies them to the PSPDFLibrary object when requested during the next call to -[PSPDFLibrary updateIndexWithCompletionHandler:].

Explicit mode should only be enabled in cases where you need to call -[PSPDFLibrary updateIndexWithCompletionHandler:] multiple times in a short period of time, and you also know the changes being made on the file system. In all other cases, let the data source handle the change detection, and keep explicitModeEnabled set to NO.

Spotlight Indexing

PSPDFLibrary can also optionally index documents with Spotlight, so the user can search for documents (and their text) right from the native device search. To enable this, setup PSPDFLibrary as described above, and set the spotlightIndexingType property before calling updateIndexWithCompletionHandler:. This can be set to:

  • PSPDFLibrarySpotlightIndexingDisabled - Documents are not indexed in Spotlight.
  • PSPDFLibrarySpotlightIndexingEnabled - Documents are indexed in Spotlight, but their text is not.
  • PSPDFLibrarySpotlightIndexingEnabledWithFullText - Documents are indexed in Spotlight with their full text.

Retrieving Documents From Spotlight

When the user taps on a searchable item from your app in Spotlight search results, your app delegate’s application:continueUserActivity:restorationHandler: method is called. In your implementation of this method, call -[PSPDFLibrary fetchSpotlightIndexedDocumentForUserActivity:completionHandler] to retrieve the document, if any.

Copy
1
2
3
4
5
6
7
8
9
10
11
12
func application(_ application: UIApplication, continue userActivity: NSUserActivity, restorationHandler: @escaping ([Any]?) -> Void) -> Bool {
    guard let library = PSPDFKit.sharedInstance.library else {
        logError("Unable to get shared PSPDFLibrary instance to continue user activity.")
        return false
    }
    library.fetchSpotlightIndexedDocument(for: userActivity) { document in
        guard let document = document else { return }
        // Open the document in a PSPDFViewController
    }

    return true
}
Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
- (BOOL)application:(NSApplication *)application continueUserActivity:(NSUserActivity *)userActivity restorationHandler:(void (^)(NSArray *restorableObjects))restorationHandler {
    PSPDFLibrary *library = PSPDFKit.sharedInstance.library;
    if (!library) {
        // FTS feature isn't enabled in your license
        return NO;
    }
    [PSPDFKit.sharedInstance.library fetchSpotlightIndexedDocumentForUserActivity:userActivity completionHandler:^(PSPDFDocument *document) {
        if (!document) {
            return;
        }
        // Open the document in a PSPDFViewController
    }];
    return YES;
}