Indexing PDF Documents on iOS

PSPDFKit supports efficient and fast full-text search in PDF documents through PDFLibrary. This guide describes how to get started with PDFLibrary.

Getting Started

PDFLibrary relies on a data source to retrieve information about the documents that are to be indexed. The LibraryDataSource protocol specifies the methods the data source needs to implement. Generally, you won’t need to implement your own data source, but instead use the LibraryFileSystemDataSource class provided to you. You use it as follows:

guard let library = PSPDFKit.SDK.shared.library else {
    // FTS feature isn't enabled in your license.
    return
}

// Assume that you have a directory of PDF documents you want to index.
let directoryURL = ...

let dataSource = LibraryFileSystemDataSource(library: library, documentsDirectoryURL: directoryURL) { document, stopPointer in
    // If you want to skip a specific document, return `false` here.
    // If you want to stop the directory enumeration, set `stopPointer.pointee` to `true`.
    return true
}

library.dataSource = dataSource // Note that `PDFLibrary` holds the data source with a strong reference.

// Begins the indexing operation. This method performs some initial work synchronously and then starts the indexing, which is asynchronous.
// For large amounts of documents, even the initial work could be slow, which is why this should always be called on a background queue.
DispatchQueue.global(qos: .background).async {
    library.updateIndex()
}
PSPDFLibrary *library = PSPDFKitGlobal.sharedInstance.library;
if (!library) {
    // FTS feature isn't enabled in your license.
    return;
}

// Assume that you have a directory of PDF documents you want to index.
NSURL *directoryURL = ...;
PSPDFLibraryFileSystemDataSource *fileDataSource = [[PSPDFLibraryFileSystemDataSource alloc] initWithLibrary:library documentsDirectoryURL:directoryURL documentHandler:^(PSPDFDocument *document, BOOL *stop) {
    // If you want to skip a specific document, return `NO` here.
    // If you want to stop the directory enumeration, set `*stop` to `YES`.
    return YES;
}];

library.dataSource = fileDataSource; // Note that `PSPDFLibrary` holds the data source with a strong reference.

// Begins the indexing operation. This method performs some initial work synchronously and then starts the indexing, which is asynchronous.
// For large amounts of documents, even the initial work could be slow, which is why this should always be called on a background queue.
dispatch_async(dispatch_get_global_queue(QOS_CLASS_BACKGROUND), ^{
    [library updateIndexWithCompletionHandler:nil];
});

Note that you should always set the library’s data source, and not just when you want to update the index. A good place do to this is your app delegate’s application(_:willFinishLaunchingWithOptions:).

PDFLibrary posts notifications as the index status changes. The following notifications are available:

You’ll usually observe PSPDFLibraryDidFinishIndexingDocument to perform a search as more and more documents become available:

// Assume that `libraryDidFinishIndexing(_:)` has been registered with `NotificationCenter.default`.
func libraryDidFinishIndexing(notification: Notification) {
    guard let library = PSPDFKit.SDK.shared.library else {
        // FTS feature isn't enabled in your license.
        return
    }
    if !library.isIndexing {
        // All documents have been indexed.
    }
    library.documentUIDs(matching: "PSPDFKit", options: nil) { searchString, resultSet in
        for (UID, indexSet) in resultSet {
            print("Found the following matches in document \(UID): \(indexSet)")
        }
    }
}
// Assume that `libraryDidFinishIndexing:` has been registered with `NSNotificationCenter.defaultNotificationCenter`.
- (void)libraryDidFinishIndexing:(NSNotification *)notification {
    PSPDFLibrary *library = PSPDFKitGlobal.sharedInstance.library;

    if (!library.isIndexing) {
        // All documents have been indexed.
    }

    [library documentUIDsMatchingString:@"PSPDFKit"
                                options:nil
                      completionHandler:^(NSString *searchString, NSDictionary *resultSet) {
        for (NSString *UID in resultSet) {
            NSIndexSet *indexSet = resultSet[UID];
            NSLog(@"Found the following matches in document %@: %@", UID, indexSet);
        }
    }];
}

You can decide to only query the library if all documents have been indexed by using isIndexing. You can also check the current status of individual documents by using indexStatus(forUID:withProgress:). The results are delivered to you in a Dictionary that maps the UID of documents as a String to a IndexSet. An index is set in the IndexSet of a given document if the search string occurs on that page.

Indexing Priority

You can also specify the priority of the background queue used for indexing. This can only be changed on creation of the library, and it defaults to PDFLibrary.IndexingPriority.low. If you require faster indexing, you can do one of two things:

SQLite FTS Version

The default PDFLibrary (available via PSPDFKit.SDK.shared) uses the highest version of SQLite’s full-text search available. The version of SQLite shipping with iOS 9 and 10 doesn’t have FTS5 enabled, and therefore will only use FTS4. FTS5 will be automatically enabled if you use a custom version of SQLite with the correct compile flags. You can also specify which version of FTS to use by creating a new instance with the PDFLibrary(path:ftsVersion:tokenizer:) method.

File System Data Source with Encrypted or Locked Documents

If you need your locked documents to be indexed, you can set the file system data source’s documentProvider property to an object that implements the LibraryFileSystemDataSourceDocumentProvider protocol. You can then use it as follows:

class LibraryDocumentProvider: NSObject, LibraryFileSystemDataSourceDocumentProvider {
    public func dataSource(_ dataSource: LibraryFileSystemDataSource, documentWithUID UID: String?, at fileURL: URL) -> Document? {
        // Create the document as required, ensuring it is decrypted and unlocked and ready to index.
        let document = ...
        if document.isLocked {
            // Unlock document as required.
        }
        return document
    }
}

let library = PSPDFKit.SDK.shared.library! // Replace this with your custom library, if you use one.
let dataSource = LibraryFileSystemDataSource(library: library, documentsDirectoryURL: URL(), documentHandler: nil)
self.libraryDocumentProvider = LibraryDocumentProvider()
dataSource.documentProvider = libraryDocumentProvider
library.dataSource = dataSource
library.updateIndex()
@interface LibraryDocumentProvider : NSObject <PSPDFLibraryFileSystemDataSourceDocumentProvider>
@end

@implementation LibraryDocumentProvider

- (PSPDFDocument *)dataSource:(PSPDFLibraryFileSystemDataSource *)dataSource documentWithUID:(NSString *)UID atURL:(NSURL *)fileURL {
    // Create the document as required, ensuring it is decrypted and unlocked and ready to index.
    PSPDFDocument *document = ...;
    if (document.isLocked) {
        // Unlock document as required.
    }
    return document;
}

@end

self.documentProvider = [LibraryDocumentProvider new];
PSPDFLibraryFileSystemDataSource *dataSource = ...;
dataSource.documentProvider = documentProvider;

PSPDFLibrary *library = PSPDFKitGlobal.sharedInstance.library; // Replace this with your custom library, if you use one.
library.dataSource = dataSource;
[library updateIndexWithCompletionHandler:nil];

File System Data Source Performance

In most cases, LibraryFileSystemDataSource is fast enough, and it automatically detects changes to the file system when requested by a PDFLibrary. However, each call to PDFLibrary.updateIndex(completionHandler:) makes the data source traverse its documents directory to detect changes. If this is called rapidly, it could result in a slowdown if the number of files in the directory is large. If your app is responsible for changes in the directory, you can manually specify these changes to the LibraryFileSystemDataSource object by enabling Explicit Mode (starting with PSPDFKit 6.2.2 for iOS). This can be done as follows:

let dataSource = ...
dataSource.isExplicitModeEnabled = true

// Consider a case where you know that a document has been added to or changed in the data source's documents directory and already have the location.
let addedDocumentURL = ...
dataSource.didAddOrModifyDocument(at: addedDocumentURL)

// Similarly, if a document has been removed:
let removedDocumentURL = ...
dataSource.didRemoveDocument(at: removedDocumentURL)
PSPDFLibraryFileSystemDataSource *dataSource = ...;
dataSource.explicitModeEnabled = YES;

// Consider a case where you know that a document has been added to or changed in the data source's documents directory and already have the location.
NSURL *addedDocumentURL = ...;
[dataSource didAddOrModifyDocumentAtURL:addedDocumentURL];

// Similarly, if a document has been removed:
NSURL *removedDocumentURL = ...;
[dataSource didRemoveDocumentAtURL:removedDocumentURL];

Note that using these methods on the data source doesn’t automatically add or remove documents from the library. The data source notes the changes made, and it then specifies them to the PDFLibrary object when requested during the next call to PDFLibrary.updateIndex(completionHandler:).

Explicit mode should only be enabled in cases where you need to call PDFLibrary.updateIndex(completionHandler:) multiple times in a short period of time, and where you also know the changes being made on the file system. In all other cases, let the data source handle the change detection, and keep [isExplicitModeEnabled] set to false.