Create text highlight annotations from text extraction

Q: How to create text highlight annotations from text extraction?

A: Extracting text from a PDF file is a common task, but, as you might have noticed, it isn’t always as straightforward as it should be. For that reason PSPDFKit offers APIs to retrieve text from a document. On PSPDFKit for Web you can extract the text from a page using textLinesForPageIndex.

So the first step is to extract the text from a page ofthe PDF document

Copy
1
2
3
// Getting all text lines from page 0
const textLines = await instance.textLinesForPageIndex(0);
textLines.forEach(textLine => console.log(textLine.contents));

Then we can retrieve the text lines bounding boxes PSPDFKit.TextLine#boundingBox:

1
const boundingBoxes = textLines.map(textLine => textLine.boundingBox)

This will return us a PSPDFKit.Geometry.Rect record for any textLine in that page. In our case it returns a PSPDFKit.Immutable.List of two records because there are two lines of text on page 0 of the document. The final step is to create an highlight annotation using those boundingBoxes like this:

Copy
1
2
3
4
5
6
instance.create(
  new PSPDFKit.Annotations.HighlightAnnotation({
    pageIndex: 0,
    rects: boundingBoxes,
    boundingBox: PSPDFKit.Geometry.Rect.union(boundingBoxes)
    });