How to Extract Text from PDF in Python
![](/assets/images/people/oghenerukevwe-henrietta-kofi-f2b77d0a.jpg)
![Illustration: How to Extract Text from PDF in Python](/assets/images/blog/2024/extract-text-from-pdf-using-python/article-header-3931f331.png)
In this post, you’ll learn how to extract text from PDF documents using Python. It’ll cover open source solutions for extracting text, as well as how to use PSPDFKit’s Python PDF API.
It’s important to add that there are two different types of text extraction:
-
Extracting text that’s already selectable in a PDF viewer. PDFs are usually made up of text authored in a word processing program.
-
Extracting text from an image-based PDF document. PDFs are typically made up of images from documents that are scanned.
This post will focus on extracting text that’s already selectable.
Requirements
This tutorial will make use of Python version 3.12.3, but it should work with most 3.x Python versions. Create a new folder and a Python file to store all the code from this tutorial:
$ mkdir text_extract_pdf $ cd text_extract_pdf $ touch app.py
You’ll also need to install pypdf. You’ll rely on this library to read a PDF file and extract data from it. It can easily be installed using PIP:
$ pip install pypdf
The tutorial will make use of two example PDF files to demonstrate the code, but you can use whichever PDF file you prefer while following along: file 1 and file 2. Just make sure to save the PDF file next to the app.py
file and replace the file names in the rest of this tutorial appropriately.
Extracting Text from a PDF Using an Open Source Library
Open the app.py
file and type the following code:
from pypdf import PdfReader reader = PdfReader("compressed.tracemonkey-pldi-09.pdf") for page in reader.pages: print(page.extract_text())
When you save and run the code, it’ll print all the text from the PDF file in the terminal. The code creates a PdfReader
object. Then it loops over all the pages in the PDF using the .pages
property and prints the text from each page using the .extract_text
method.
Skipping Headers and Footers
pypdf allows you to use visitor functions that get called with each operator or text fragment. The visitor function receives five arguments: the text, the current transformation matrix, the text matrix, the font dictionary, and the font size. You can make use of the text matrix to figure out the x/y coordinates of the text fragment and decide if you want to skip it or extract it.
In the following example, pypdf will skip the header and footer of this PDF document, as they fall outside the acceptable y coordinate range:
from pypdf import PdfReader reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf") page = reader.pages[3] parts = [] def visitor_body(text, cm, tm, fontDict, fontSize): y = tm[5] if y > 50 and y < 720: parts.append(text) page.extract_text(visitor_text=visitor_body) print("".join(parts))
Decrypting the PDF Files
The PDF files you’re working with may be encrypted. Luckily, you don’t have to look anywhere else for a solution, as pypdf supports encryption and decryption of PDF files as well.
To work with encrypted documents, you’ll need to install the cryptography
package:
$ pip install cryptography
Use the .decrypt
method to decrypt a PDF file before extracting text from it:
from pypdf import PdfReader reader = PdfReader("encrypted-pdf.pdf") if reader.is_encrypted: reader.decrypt("password") # extract text from all pages for page in reader.pages: print(page.extract_text())
Extracting Text from a PDF Using Python and PSPDFKit API
This section will cover how you can extract text with PSPDFKit API.
First, go to our website and create your free account. You’ll see the page below.
After you’ve verified your email, you’ll have access to your API key. Navigate to the Overview page to get started, or go to API Keys to retrieve your key.
To work with PSPDFKit API, you’ll need to install the requests
package:
pip install requests
After installing the package, you can create a Python script to perform text extraction using the API’s /build
endpoint:
import json import requests file = "./example.pdf" url = "https://api.pspdfkit.com/build" payload= { "instructions": json.dumps({ "parts": [ { "file": "file" } ], "output": { "type": "json-content", "plainText": True, "structuredText": True, } })} files=[ ('file',('file.pdf',open(file,'rb'),'application/pdf')), ] headers = { 'Authorization': 'Bearer <API-KEY>' } response = requests.post(url, headers = headers, data = payload, files = files) if response.status_code == 200: print(response.content) else: print( f"Request to PSPDFKit API failed with status code {response.status_code}: '{response.text}'." )
Be sure to replace <API-KEY>
in the code above with your key from the PSPDFKit API dashboard. Also ensure that an actual PDF file is present at the path specified by the file
variable on line 4.
You can perform many operations using PSPDFKit API, including text extraction, Office conversion, and OCR. Learn more by reading our documentation.
Conclusion
This tutorial covered the basics of extracting text from a PDF file using Python and pypdf. It also showed how to extract text from an encrypted PDF file.
The second part of the tutorial introduced PSPDFKit API as an alternative solution for extracting text from a PDF. Leveraging the power of PSPDFKit API, you can efficiently and easily extract meaningful text from PDF files while ensuring high extraction speed and quality.
While pypdf and other open source libraries are suitable for basic text extraction needs, PSPDFKit API offers advanced features, SOC 2-compliant security, easy integration, versatile actions, and transparent pricing.
![](/assets/images/people/oghenerukevwe-henrietta-kofi-f2b77d0a.jpg)
Rukky joined PSPDFKit as an intern in 2022 and is currently a software engineer on the Server and Services Team. She’s passionate about building great software, and in her spare time, she enjoys reading cheesy novels, watching films, and playing video games.