Indexed Full-Text PDF Search in UWP
PSPDFKit supports fast and efficient full-text search in PDF documents through PSPDFKit.Search.Library
. This document describes how to get started.
Getting Started
To start indexing, create a Library
and give it a name. You can then add folders that contain PDF files to this named library. The Library
will index all the PDFs in those folders.
Here’s a simple example of how to create or open a library and start indexing PDFs in a directory:
// Opening a library creates one if it doesn't already exist. var library = await Library.OpenLibraryAsync("MyLibrary"); // Find a folder containing PDFs. var folderPicker = new Windows.Storage.Pickers.FolderPicker(); folderPicker.SuggestedStartLocation = Windows.Storage.Pickers.PickerLocationId.Desktop; folderPicker.FileTypeFilter.Add("*"); Windows.Storage.StorageFolder folder = await folderPicker.PickSingleFolderAsync(); if (folder != null) { // Queue up the PDFs in the folder for indexing. library.EnqueueDocumentsInFolderAsync(folder); }
The documents will now be indexed in the background.
Alternatively, you can enqueue a List
of IDataProvider
objects with the EnqueueDocumentsFromProviderAsync
method.
Then, you can choose to start querying documents right away or wait until all documents added to the indexer queue have been completed.
Here’s an example of how to wait and then get the list of indexed documents:
// Wait for indexing to finish. await library.WaitForAllIndexingTasksToFinishAsync(); // Get the list of indexed documents. var documentUIDs = await library.GetIndexedUidsAsync();
Identifying Documents
The documents in the list returned by GetIndexedUidsAsync
are represented by a unique ID (UID). When using StorageFile
s, this UID is a string
compromised of a future access token identifying the folder containing the PDF and the file name of the PDF within that folder. For IDataProvider
s, the indexed UID is a string
simply containing the IDataProvider
UID.
Due to the unique restrictions of UWP, when using StorageFile
s, it’s essential that you don’t clear the application’s FutureAccessList
if you wish to retain your libraries, as this is the only place for the future access token to be recorded.
Moreover, when using DataProvider
s, neither the streams nor providers themselves are tracked internally, and they need to be managed by your own application.
You can create a PSPDFKit.Document.DocumentSource
object for a given document UID using either DocumentSource.CreateFromStorageFileUidAsync
or DocumentSource.CreateFromDataProvider
, both of which are static methods.
A StorageFile
object for the file can be accessed by calling GetFile
on the created DocumentSource
object. Note that the method will throw an exception if the document referred to can no longer be located.
Here’s an example:
// Get the list of indexed documents. var documentUIDs = await library.GetIndexedUidsAsync(); foreach (var uid in documentUIDs) { try { var documentSource = await DocumentSource.CreateFromUidAsync(uid); StorageFile file = documentSource.GetFile(); } catch (Exception e) { // Examine the exception. } }
Both the StorageProvider
and DataProvider
implementations can be used side by side. StorageProvider
UIDs contain the file name, while DataProvider
ones are merely numeric, you’re able to easily check when needed. Note the need to maintain a list of all relevant providers:
if (uid.EndsWith(".pdf")) { document = await DocumentSource.CreateFromStorageFileUidAsync(uid); } else { document = DocumentSource.CreateFromDataProvider(_providers.Find(provider => provider.Uid == uid)); }
Index and Document Status
Library
allows you to query for the current indexing state.
You can decide to only query the library if all queued documents have been indexed by using IsIndexingAsync()
. You may also check the current status of individual documents by using GetIndexDocumentStatusAsync()
.
Querying the Library
To query the library, use the SearchAsync
method, supplying it with a LibraryQuery
object.
Here’s an example:
// Search all documents in the library for the text "Acme." var succeeded = await library.SearchAsync(new LibraryQuery("Acme"));
The results of the query are sent to a query result handler, which you must provide to the library.
Here’s an example:
library.OnSearchComplete += MyOnSearchCompleteMethod;
The OnSearchComplete
event handler receives a reference to the originating library, along with a dictionary mapping a document UID to a LibraryQueryResult
object. Each result object also contains the UID as a property and a list of the page indexes where matching results were found.
If you wish to show preview snippets, you should set the GenerateTextPreviews
property in the query object to true
. Then, preview text snippets will be delivered to you via the OnSearchPreviewComplete
event handler.
Here’s an example:
library.OnSearchPreviewComplete += MyOnSearchPreviewCompleteMethod; var query = new LibraryQuery("Acme") { GenerateTextPreviews = true } var succeeded = await library.SearchAsync(query);
The OnSearchPreviewComplete
event handler receives a reference to the originating library, along with a list of LibraryPreviewResult
objects — one for each match. Each of these objects contains a UID identifying the document, a page index where the matching text is located, a snippet of text surrounding the match, the range of the matched text within the preview snippet, and the page text. Each object also has an annotation ID indicating whether or not the match was found in an annotation.
Advanced Matching Options
Library
offers advanced matching options. You can set these options in a LibraryQuery
object.
Password-Protected Documents
When indexing documents, it’s possible you might come across a password-protected document.
You can unlock a password-protected document with an event handler, which is fired every time a password is required. The following example shows how this is possible:
private Library _library; internal async void Initialize(PdfView pdfView) { _library = await Library.OpenLibraryAsync("catalog"); _library.OnPasswordRequested += Library_OnPasswordRequested; } private void Library_OnPasswordRequested(Library sender, PasswordRequest passwordRequest) { if (passwordRequest.Uid.Contains("Password.pdf")) { passwordRequest.Password = "test123"; break; } passwordRequest.Deferral.Complete(); }
PasswordRequest
will always have the UID populated with the path being indexed (note the full path will be assigned with the future access token and the file name). Check against this string to determine which document requires a password and populate the Password
member of PasswordRequest
to unlock the document. Ensure the Deferral
is completed, as per the last line of Library_OnPasswordRequested
; otherwise, the index will fail and throw an exception.
Example Code
You’ll find a complete working code example in the Catalog app provided with the SDK.