How to extract text from a PDF file via python? [closed]

Question 1

I'm extracting this PDF's text using the PyPDF2 Python package (version 1.27.2):

import PyPDF2
with open("sample.pdf", "rb") as pdf_file:
 read_pdf = PyPDF2.PdfFileReader(pdf_file)
 number_of_pages = read_pdf.getNumPages()
 page = read_pdf.pages[0]
 page_content = page.extractText()
print(page_content)

I get this output which is different from the PDF document:

 ! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4
5
 ' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)
%

How can I extract the text as is in the PDF document?

Question 2

Copy the text using a good PDF viewer - Adobe's canonical Acrobat Reader, if possible. Do you get the same result? The difference is not that the text is different, but the font is - the character codes map to other values. Not all PDFs contain the correct data to restore this.

Question 3

I tried another document and it worked. Yes, it seems the issue is with the PDF itself

Question 4

That PDF contains a character CMap table, so the restrictions and work-arounds discussed in this thread are is relevant - stackoverflow.com/questions/4203414/….

Question 5

The PDF indeed contains a correct CMAP so it is trivial to convert the ad hoc character mapping to plain text. However, it takes additional processing to retrieve the correct order of text. Mac OS X's Quartz PDF renderer is a nasty piece of work! In its original rendering order I get "m T’h iuss iisn ga tosam fopllloew DalFo dnogc wumithe ntht eI tutorial"... Only after sorting by x coordinates I get a far more likely correct result: "This is a sample PDF document I’m using to follow along with the tutorial".

Question 6

stackoverflow.com/questions/32667398/…

Question 7

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache TikaTM REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika
raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

Question 8

I tested pypdf2, tika and tried and failed to install textract and pdftotext. Pypdf2 returned 99 words while tika returned all 858 words from my test invoice. So I ended up going with tika.

Question 9

If you need to run this on all the PDF files in a directory (recursively), take this script

Question 10

This is very slow as it runs a Java REST web-server in localhost port 9998 under the hoods.

Question 11

It downloads a tika-server.jar 76 MB file into C:\Users\User\AppData\Local\Temp. Is there a way to make this permanent if I clean temp later? It also requires a JAVA vm installed, is that right?

Question 12

@Stian PyPDF2 improved a lot. Could you please check again + update your comment?

Question 13

pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.

pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.

Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:

Anything special regarding tables (just that the text is there, not about the formatting)
Arabic test (RTL-languages)
Mathematical formulas.

That means if your use-case requires those points, you might perceive the quality differently.

Having said that, the results from November 2022:

Quality

Speed

pypdf

I became the maintainer of pypdf and PyPDF2 in 2022! 😁 The community improved the text extraction a lot in 2022. Give it a try :-)

First, install it:

pip install pypdf

And then use it:

from pypdf import PdfReader
reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
 text += page.extract_text() + "\n"

Please note that those packages are not maintained:

PyPDF2, PyPDF3, PyPDF4
pdfminer (without .six)

pymupdf

import fitz # install using: pip install PyMuPDF
with fitz.open("my.pdf") as doc:
 text = ""
 for page in doc:
 text += page.get_text()
print(text)

Other PDF libraries

pikepdf does not support text extraction (source)

Question 14

However, there seems to be a problem with the order of the text from the PDF. Intuitively the text would read from top to bottom and left to right, but here it seem to show up in another order

Question 15

Except, it occasionally just can't find the text in a page...

Question 16

@Raf If you have an example PDF, please go ahead and create an issue: github.com/pymupdf/PyMuPDF/issues - the developer behin it is pretty active

Question 17

This is the most light-weight answer I've seen so far. No java server necessary!

Question 18

This is the latest working solution as of 23 Jan 2022.

Question 19

Use textract.

It supports many types of files including PDFs

import textract
text = textract.process("path/to/file.extension")

Question 20

Works for PDFs, epubs, etc - processes PDFs that even PDFMiner fails on.

Question 21

how to use it in aws lambda , I tried this but , import error occured fro textract

Question 22

textract is a wrapper for Poppler:pdftotext (among others).

Question 23

@ArunKumar: To use anything in AWS Lambda that's not built-in, you have to include it and all extra dependencies, in your bundle.

Question 24

textract seems to be dead (source). Use either pdfminer.six directly or pymupdf

Question 25

Look at this code for PyPDF2<=1.26.0:

import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

The output is:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

Using the same code to read a pdf from 201308FCR.pdf .The output is normal.

Its documentation explains why:

def extractText(self):
 """
 Locate all text drawing commands, in the order they are provided in the
 content stream, and extract the text. This works well for some PDF
 files, but poorly for others, depending on the generator used. This will
 be refined in the future. Do not rely on the order of text coming out of
 this function, as it will change if this function is made more
 sophisticated.
 :return: a unicode string object.
 """

Question 26

@VineeshTP: Are you getting anything for page_content? If yes, then see if it helps by using a different encoding other than (utf-8)

Question 27

Best library I found for reading the pdf using python is 'tika'

Question 28

201308FCR.pdf not found.

Question 29

@Matin Thoma is it possible to preserve the format, when extracting, say python code from a PDF?

Question 30

After trying textract (which seemed to have too many dependencies) and pypdf2 (which could not extract text from the pdfs I tested with) and tika (which was too slow) I ended up using pdftotext from xpdf (as already suggested in another answer) and just called the binary from python directly (you may need to adapt the path to pdftotext):

import os, subprocess
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
args = ["/usr/local/bin/pdftotext",
 '-enc',
 'UTF-8',
 "{}/my-pdf.pdf".format(SCRIPT_DIR),
 '-']
res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output = res.stdout.decode('utf-8')

There is pdftotext which does basically the same but this assumes pdftotext in /usr/local/bin whereas I am using this in AWS lambda and wanted to use it from the current directory.

Btw: For using this on lambda you need to put the binary and the dependency to libstdc++.so into your lambda function. I personally needed to compile xpdf. As instructions for this would blow up this answer I put them on my personal blog.

Question 31

Oh my god, it works!! Finally, a solution that extracts the text in the correct order! I want to hug you for this answer! (Or if you don't like hugs, here's a virtual coffee/beer/...)

Question 32

Please give PyPDF2 another chance. We've improved it a lot :-)

Question 33

I've try many Python PDF converters, and I like to update this review. Tika is one of the best. But PyMuPDF is a good news from @ehsaneha user.

I did a code to compare them in: https://github.com/erfelipe/PDFtextExtraction I hope to help you.

Tika-Python is a Python binding to the Apache TikaTM REST services allowing Tika to be called natively in the Python community.

from tika import parser
raw = parser.from_file("///Users/Documents/Textos/Texto1.pdf")
raw = str(raw)
safe_text = raw.encode('utf-8', errors='ignore')
safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )

Question 34

special thanks for .encode('utf-8', errors='ignore')

Question 35

AttributeError: module 'os' has no attribute 'setsid'

Question 36

this worked for me, when opening the file in 'rb' mode with open('../path/to/pdf','rb') as pdf: raw = str(parser.from_file(pdf)) text = raw.encode('utf-8', errors='ignore')

Question 37

You may want to use time proved xPDF and derived tools to extract text instead as pyPDF2 seems to have various issues with the text extraction still.

The long answer is that there are lot of variations how a text is encoded inside PDF and that it may require to decoded PDF string itself, then may need to map with CMAP, then may need to analyze distance between words and letters etc.

In case the PDF is damaged (i.e. displaying the correct text but when copying it gives garbage) and you really need to extract text, then you may want to consider converting PDF into image (using ImageMagik) and then use Tesseract to get text from image using OCR.

Question 38

-1 because the OP is asking for reading pdfs in Python, and although there is an xpdf wrapper for python it is poorly maintained.

Question 39

You might want to give PyPDF2 another shot (also mind the capitalization)

Question 40

I found a solution here PDFLayoutTextStripper

It's good because it can keep the layout of the original PDF.

It's written in Java but I have added a Gateway to support Python.

Sample code:

from py4j.java_gateway import JavaGateway
gw = JavaGateway()
result = gw.entry_point.strip('samples/bus.pdf')
# result is a dict of {
# 'success': 'true' or 'false',
# 'payload': pdf file content if 'success' is 'true'
# 'error': error message if 'success' is 'false'
# }
print result['payload']

Sample output from PDFLayoutTextStripper: enter image description here

You can see more details here Stripper with Python

Question 41

The best feature of this library is definitely its ability to (mostly) preserve the layout. The worst is that you need to standup a gateway service in Java.

Question 42

The below code is a solution to the question in Python 3. Before running the code, make sure you have installed the pypdf library in your environment. If not installed, open the command prompt and run the following command (instead of pip you might need pip3):

pip install pypdf --upgrade

Solution Code using pypdf > 3.0.0:

import pypdf
reader = PyPDF2.PdfReader('sample.pdf')
for page in reader.pages:
 print(page.extract_text())

Question 43

How would u save all the content in one text file and use it for further analysis

Question 44

pdftotext is the best and simplest one! pdftotext also reserves the structure as well.

I tried PyPDF2, PDFMiner and a few others but none of them gave a satisfactory result.

Question 45

Message as follows when installing pdf2text,Collecting PDFMiner (from pdf2text), so I don't understand this answer now.

Question 46

pdf2text and pdftotext are different. You can use the link from the answer.

Question 47

OK. That's a little bit confusing.

Question 48

You might want to give PyPDF2 another shot. We've improved it a lot.

Question 49

In 2020 the solutions above were not working for the particular pdf I was working with. Below is what did the trick. I am on Windows 10 and Python 3.8

Test pdf file: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing

#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
 '''Convert pdf content from a file path to text
 :path the file path
 '''
 rsrcmgr = PDFResourceManager()
 codec = 'utf-8'
 laparams = LAParams()
 with io.StringIO() as retstr:
 with TextConverter(rsrcmgr, retstr, codec=codec,
 laparams=laparams) as device:
 with open(path, 'rb') as fp:
 interpreter = PDFPageInterpreter(rsrcmgr, device)
 password = ""
 maxpages = 0
 caching = True
 pagenos = set()
 for page in PDFPage.get_pages(fp,
 pagenos,
 maxpages=maxpages,
 password=password,
 caching=caching,
 check_extractable=True):
 interpreter.process_page(page)
 return retstr.getvalue()
if __name__ == "__main__":
 print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

Question 50

Excellent answer. There's an anaconda install as well. I was installed and had extracted text in < 5 minutes. [note: tika also worked, but pdfminer.six was much faster)

Question 51

You are a lifesaver!

Question 52

In 2023, 3 lines of pypdf do the same: extract text with pypdf

Question 53

In 2024, many libraries can extract the text, but depending upon the original structure of the PDF -- particularly the use of tables -- the result will vary dramatically. 3 lines of code does not imply that the output from a given PDF will be coherent or useful.

Question 54

I tested Jortega's code above, and it really struggled with data in tables, especially when there was a blank cell.

Question 55

pdfplumber is one of the better libraries to read and extract data from pdf. It also provides ways to read table data and after struggling with a lot of such libraries, pdfplumber worked best for me.

Mind you, it works best for machine-written pdf and not scanned pdf.

import pdfplumber
with pdfplumber.open(r'D:\examplepdf.pdf') as pdf:
first_page = pdf.pages[0]
print(first_page.extract_text())

Question 56

This is nice, but I have a question on the format of the output. I want to save the result of the print into a pandas dataframe. Is that possible?

Question 57

I've got a better work around than OCR and to maintain the page alignment while extracting the text from a PDF. Should be of help:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
 rsrcmgr = PDFResourceManager()
 retstr = StringIO()
 codec = 'utf-8'
 laparams = LAParams()
 device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
 fp = open(path, 'rb')
 interpreter = PDFPageInterpreter(rsrcmgr, device)
 password = ""
 maxpages = 0
 caching = True
 pagenos=set()
 for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
 interpreter.process_page(page)
 text = retstr.getvalue()
 fp.close()
 device.close()
 retstr.close()
 return text
text= convert_pdf_to_txt('test.pdf')
print(text)

Question 58

Nb. The latest version no longer uses the codec arg . I fixed this by removing it i.e. device = TextConverter(rsrcmgr, retstr, laparams=laparams)

Question 59

Multi - page pdf can be extracted as text at single stretch instead of giving individual page number as argument using below code

import PyPDF2
import collections
pdf_file = open('samples.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
c = collections.Counter(range(number_of_pages))
for i in c:
 page = read_pdf.getPage(i)
 page_content = page.extractText()
 print page_content.encode('utf-8')

Question 60

Only problem here the content of new page overwrites the last one

Question 61

As of 2021 I would like to recommend pdfreader due to the fact that PyPDF2/3 seems to be troublesome now and tika is actually written in java and needs a jre in the background. pdfreader is pythonic, currently well maintained and has extensive documentation here.

Installation as usual: pip install pdfreader

Short example of usage:

from pdfreader import PDFDocument, SimplePDFViewer
# get raw document
fd = open(file_name, "rb")
doc = PDFDocument(fd)
# there is an iterator for pages
page_one = next(doc.pages())
all_pages = [p for p in doc.pages()]
# and even a viewer
fd = open(file_name, "rb")
viewer = SimplePDFViewer(fd)

Question 62

On a note, installing pdfreader on Windows requires Microsoft C++ Build Tools installed on your system, whilst the answer below recommending pymupdf installed directly using pip without any extra requirement.

Question 63

I couldnt use it on jupyter notebook, keeps crashing the kernel

Question 64

If wanting to extract text from a table, I've found tabula to be easily implemented, accurate, and fast:

to get a pandas dataframe:

import tabula
df = tabula.read_pdf('your.pdf')
df

By default, it ignores page content outside of the table. So far, I've only tested on a single-page, single-table file, but there are kwargs to accommodate multiple pages and/or multiple tables.

install via:

pip install tabula-py
# or
conda install -c conda-forge tabula-py

In terms of straight-up text extraction see: https://stackoverflow.com/a/63190886/9249533

DJK DJK 9,3324 gold badges27 silver badges41 bronze badges · Accepted Answer · 2018-02-07 21:43:27Z

320

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache TikaTM REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika
raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

Share

Improve this answer

edited Jun 20, 2023 at 21:36

Benjamin Loison's user avatar

Benjamin Loison

5,7314 gold badges19 silver badges37 bronze badges

answered Feb 7, 2018 at 21:43

DJK's user avatar

DJK DJK

9,3324 gold badges27 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Stian

Stian Over a year ago

I tested pypdf2, tika and tried and failed to install textract and pdftotext. Pypdf2 returned 99 words while tika returned all 858 words from my test invoice. So I ended up going with tika.

2018年06月19日T09:11:11.427Z+00:00

Hope

Hope Over a year ago

If you need to run this on all the PDF files in a directory (recursively), take this script

2019年04月19日T10:28:14.797Z+00:00

andruso

andruso Over a year ago

This is very slow as it runs a Java REST web-server in localhost port 9998 under the hoods.

2019年10月03日T17:38:17.363Z+00:00

Basj

Basj Over a year ago

It downloads a tika-server.jar 76 MB file into C:\Users\User\AppData\Local\Temp. Is there a way to make this permanent if I clean temp later? It also requires a JAVA vm installed, is that right?

2019年11月15日T12:30:18.84Z+00:00

Martin Thoma

Martin Thoma Over a year ago

@Stian PyPDF2 improved a lot. Could you please check again + update your comment?

2022年07月30日T21:57:34.983Z+00:00

|

CollectivesTM on Stack Overflow

31 Answers 31

9 Comments

pypdf

pymupdf

Other PDF libraries

8 Comments

10 Comments

4 Comments

2 Comments

3 Comments

2 Comments

1 Comment

1 Comment

4 Comments

7 Comments

1 Comment

1 Comment

1 Comment

2 Comments

2 Comments

2 Comments

Comments

1 Comment

The advantage of this method:

disadvantage:

3 Comments

Answer

Solution

Comments

1 Comment

Comments

Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Linked

Related