Blog Post

How to Extract Text from PDF in Python

Oghenerukevwe Henrietta Kofi

In this post, you’ll learn how to extract text from PDF documents using Python. It’ll cover open source solutions for extracting text, as well as how to use PSPDFKit’s Python PDF API.

It’s important to add that there are two different types of text extraction:

Extracting text that’s already selectable in a PDF viewer. PDFs are usually made up of text authored in a word processing program.
Extracting text from an image-based PDF document. PDFs are typically made up of images from documents that are scanned.

This post will focus on extracting text that’s already selectable.

Requirements

This tutorial will make use of Python version 3.12.3, but it should work with most 3.x Python versions. Create a new folder and a Python file to store all the code from this tutorial:

$ mkdir text_extract_pdf
$ cd text_extract_pdf
$ touch app.py

You’ll also need to install pypdf. You’ll rely on this library to read a PDF file and extract data from it. It can easily be installed using PIP:

$ pip install pypdf

The tutorial will make use of two example PDF files to demonstrate the code, but you can use whichever PDF file you prefer while following along: file 1 and file 2. Just make sure to save the PDF file next to the app.py file and replace the file names in the rest of this tutorial appropriately.

Extracting Text from a PDF Using an Open Source Library

Open the app.py file and type the following code:

from pypdf import PdfReader

reader = PdfReader("compressed.tracemonkey-pldi-09.pdf")
for page in reader.pages:
    print(page.extract_text())

When you save and run the code, it’ll print all the text from the PDF file in the terminal. The code creates a PdfReader object. Then it loops over all the pages in the PDF using the .pages property and prints the text from each page using the .extract_text method.

Skipping Headers and Footers

pypdf allows you to use visitor functions that get called with each operator or text fragment. The visitor function receives five arguments: the text, the current transformation matrix, the text matrix, the font dictionary, and the font size. You can make use of the text matrix to figure out the x/y coordinates of the text fragment and decide if you want to skip it or extract it.

In the following example, pypdf will skip the header and footer of this PDF document, as they fall outside the acceptable y coordinate range:

from pypdf import PdfReader

reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]

parts = []

def visitor_body(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 50 and y < 720:
        parts.append(text)

page.extract_text(visitor_text=visitor_body)
print("".join(parts))

Decrypting the PDF Files

The PDF files you’re working with may be encrypted. Luckily, you don’t have to look anywhere else for a solution, as pypdf supports encryption and decryption of PDF files as well.

To work with encrypted documents, you’ll need to install the cryptography package:

$ pip install cryptography

Use the .decrypt method to decrypt a PDF file before extracting text from it:

from pypdf import PdfReader

reader = PdfReader("encrypted-pdf.pdf")

if reader.is_encrypted:
    reader.decrypt("password")

# extract text from all pages
for page in reader.pages:
    print(page.extract_text())

Extracting Text from a PDF Using Python and PSPDFKit API

This section will cover how you can extract text with PSPDFKit API.

First, go to our website and create your free account. You’ll see the page below.

Free account PSPDFKit API

After you’ve verified your email, you’ll have access to your API key. Navigate to the Overview page to get started, or go to API Keys to retrieve your key.

Image showing navigation to API keys on PSPDFKit API’s dashboard

To work with PSPDFKit API, you’ll need to install the requests package:

pip install requests

After installing the package, you can create a Python script to perform text extraction using the API’s /build endpoint:

import json
import requests

file = "./example.pdf"

url = "https://api.pspdfkit.com/build"

payload= {
  "instructions": json.dumps({
    "parts": [
      {
        "file": "file"
      }
    ],
    "output": {
        "type": "json-content",
        "plainText": True,
        "structuredText": True,
    }
})}

files=[
  ('file',('file.pdf',open(file,'rb'),'application/pdf')),
]
headers = {
  'Authorization': 'Bearer <API-KEY>'
}

response = requests.post(url, headers = headers, data = payload, files = files)

if response.status_code == 200:
  print(response.content)
else:
  print(
    f"Request to PSPDFKit API failed with status code {response.status_code}: '{response.text}'."
  )

Be sure to replace <API-KEY> in the code above with your key from the PSPDFKit API dashboard. Also ensure that an actual PDF file is present at the path specified by the file variable on line 4.

You can perform many operations using PSPDFKit API, including text extraction, Office conversion, and OCR. Learn more by reading our documentation.

Conclusion

This tutorial covered the basics of extracting text from a PDF file using Python and pypdf. It also showed how to extract text from an encrypted PDF file.

The second part of the tutorial introduced PSPDFKit API as an alternative solution for extracting text from a PDF. Leveraging the power of PSPDFKit API, you can efficiently and easily extract meaningful text from PDF files while ensuring high extraction speed and quality.

While pypdf and other open source libraries are suitable for basic text extraction needs, PSPDFKit API offers advanced features, SOC 2-compliant security, easy integration, versatile actions, and transparent pricing.

Author

Oghenerukevwe Henrietta Kofi Server and Services Engineer

Rukky joined PSPDFKit as an intern in 2022 and is currently a software engineer on the Server and Services Team. She’s passionate about building great software, and in her spare time, she enjoys reading cheesy novels, watching films, and playing video games.

How to Extract Text from PDF in Python

Requirements

Extracting Text from a PDF Using an Open Source Library

Skipping Headers and Footers

Decrypting the PDF Files

Extracting Text from a PDF Using Python and PSPDFKit API

Conclusion

Share Post

Related Articles

How to Use Tesseract OCR in Python

How to Convert HTML to Image Using wkhtmltoimage and Python

How to Convert HTML to PDF Using wkhtmltopdf and C#