Read Text from PDFs Using JavaScript

Among the different data formats that a document can contain, text is usually one of the most important. PSPDFKit for Web includes a rich API toolset to extract and process that data separately.

PSPDFKit for Web’s API includes a variety of methods to enable access to different types of content from a document.

Page Information

It’s possible to retrieve basic information from a specific page — like page dimensions, rotation, and labels. A call to Instance#pageInfoForIndex can return that information for you in a PSPDFKit.PageInfo object:

const {
	width,
	height,
	index,
	label,
	rotation
} = instance.pageInfoForIndex(0);

Page Text

Retrieving the text of a page can be done using Instance#textLinesForPageIndex, which returns a Promise resolving to a PSPDFKit.Immutable.List of PSPDFKit.TextLine. In turn, this can be traversed to parse the content of each line:

// Retrieve and log text lines for page 0.
const textLines = await instance.textLinesForPageIndex(0);
textLines.forEach((textLine, textLineIndex) => {
	console.log(`Content for text line ${textLineIndex}`);
	console.log(`Text: ${textLine.contents}`);
	console.log(`Id: ${textLine.id}`);
	console.log(`Page index: ${textLine.pageIndex}`);
	console.log(`Bounding box: ${JSON.stringify(textLine.boundingBox.toJS())}`);
});

Form Fields

It’s possible to retrieve detailed information about each form field in a document with Instance#getFormFields:

const formFields = await instance.getFormFields();

You can check each form field type’s properties in the corresponding API reference section.

Form Field Values

Similarly to form fields, form field values can be retrieved with Instance#getFormFieldValues:

const values = instance.getFormFieldValues();

The returned object includes each form field value indexed by the form field name.

Annotation Text

Some annotation types can include text as one of their properties:

// Retrieve annotations from page 0.
const annotations = await instance.getAnnotations(0);
// Retrieves the first text annotation available.
const textAnnotation = annotations.find(annotation => annotation instanceof PSPDFKit.Annotations.TextAnnotation);
// Logs the text of the text annotation.
console.log(textAnnotation.text);

Note annotations can also include text as one of their properties.

Text under an Annotation

Markup annotations can be used to highlight or draw attention to some text in the document. That text isn’t part of the annotation’s properties, but it can be obtained by mapping the annotation’s bounding box to the bounding boxes of the text lines of the page.

PSPDFKit for Web makes that operation easy by providing Instance#getMarkupAnnotationText and Instance#getTextFromRects:

// Retrieve annotations from page 0.
const annotations = await instance.getAnnotations(0);
// Retrieves the first highlight annotation available.
const highlightAnnotation = annotations.find(annotation => annotation instanceof PSPDFKit.Annotations.HighlightAnnotation);
// Logs the text behind the highlight annotation.
console.log(await instance.getMarkupAnnotationText(highlightAnnotation));

Bookmarks

Extracting bookmark information can be done with PSPDFKit for Web’s Instancel#getBookmarks method:

const bookmarks = await instance.getBookmarks();
bookmarks.forEach(bookmark => {
	console.log(bookmark.toJS());
});

Digital Signatures

When your license includes the Digital Signatures component, you can extract digital signature information from any digitally signed document. This can also be done through Instance#getSignaturesInfo, which resolves to a PSPDFKit.SignaturesInfo record. This object includes:

const signaturesInfo = await instance.getSignaturesInfo();