Python3 - Determing if a PDF is scanned or "searchable"

Question 1

I am in a situation where I need to determine if a PDF file is either:

Scanned
Searchable (contain text)

To do this, I am simply running the command pdffonts on the PDF file:

# Check if the resource definition indicates this is a scanned PDF
cmd = ['pdffonts', pdf_file]
proc = subprocess.Popen(
 cmd, stdout=subprocess.PIPE, bufsize=0, text=True, shell=False)
out, err = proc.communicate()
scanned = True
for idx, line in enumerate(out.splitlines()):
 if idx == 2:
 scanned = False

Imagine I have two PDF files:

scanned.pdf
notscanned.pdf

The results of above:

scanned.pdf:

name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------

notscanned.pdf

name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TNNLVN+Calibri TrueType WinAnsi yes yes yes 9 0

As you can see, the notscanned.pdf file contains font information.

In my Python script, I iterate over each line in the command line output:

for idx, line in enumerate(out.splitlines()):
 if idx == 2:
 scanned = False

And if there is a line 2, then it contains font information.

I am wondering if this approach is viable? I've been searching for a solution for quite a bit, and it seems there is no way to know 100%.

Question 2

See How can I distinguish a digitally-created PDF from a searchable PDF?

Question 3

I agree that there's no way to know 100% if a PDF contains proper text or an image of a scanned hard-copy. I've seen PDFs that weren't scanned, but which contained a single jpg per page, and I've seen PDFs that were scanned but had an OCR version of the text seamlessly^* underlayed.

Your existing code can be improved a little. You don't need to set scanned, and then update it in an if in a for, you can just check the condition you're interested in directly:

scanned = 2 < len(out.splitlines())

I'll be honest, I don't like the simplicity of that. We're writing code that relys on the incidental formatting of a table that was designed for human reading, and which we have no control over. That said, since we have no control, it may be the best we can do.

Also, it looks like you could probably be using
subprocess.run(cmd, stdout=subprocess.PIPE), which is recommended in python3. It won't simplify your code, but it will make you be clear and safe about handling the results of the subprocess. (For example, right now you're not doing anything with err, and it's unclear if that's a mistake or not.)

As for the fundamental approach:
I think it's probably the best you can do, if you're stuck with the "scanned vs searchable" dichotomy. You could also try using pdftotext to see if there's text in the document.
But as I said above, the presence of text doesn't absolutely mean that the document can reliably be searched, and the absence of font information doesn't absolutely mean that the document is a scan.

_{*It was not seamless.}

Question 4

Good point about setting the scanned property! Would you mind explaining why subprocess.run() is "better" than using subprocess.Popen?

Question 5

The documentation page recommends using it, so it's "recommended" :) I would guess that it's recommended because it can handle almost all use-cases, while being easy to use, easy to use safely, and maybe a little less verbose? I haven't actually used the subprocess module in python3, so I'm not sure.

Question 6

How about the PDF metadata check on '/Resources' ?!

I believe for any text in a PDF (electronic document) there are more chances of having a font, especially the PDF, whose objective is to make a portable file, so, it maintains the font definition.

If you are a PyPDF2 user, try

pdf_reader = PyPDF2.PdfFileReader(input_file_location)
page_data = pdf_reader.getPage(page_num)
if '/Font' in page_data['/Resources']:
 print("[Info]: Looks like there is text in the PDF, contains:", page_data['/Resources'].keys())
elif len(page_data1['/Resources'].get('/XObject', {})) != 1:
 print("[Info]: PDF Contains:", page_data['/Resources'].keys())
for obj in x_object:
 obj_ = x_object[obj]
 if obj_['/Subtype'] == '/Image':
 print("[Info]: PDF is image only")

ShapeOfMatter ShapeOfMatter 4,4377 silver badges25 bronze badges · Accepted Answer · 2019-05-24 12:34:02Z

I agree that there's no way to know 100% if a PDF contains proper text or an image of a scanned hard-copy. I've seen PDFs that weren't scanned, but which contained a single jpg per page, and I've seen PDFs that were scanned but had an OCR version of the text seamlessly^* underlayed.

Your existing code can be improved a little. You don't need to set scanned, and then update it in an if in a for, you can just check the condition you're interested in directly:

scanned = 2 < len(out.splitlines())

I'll be honest, I don't like the simplicity of that. We're writing code that relys on the incidental formatting of a table that was designed for human reading, and which we have no control over. That said, since we have no control, it may be the best we can do.

Also, it looks like you could probably be using
subprocess.run(cmd, stdout=subprocess.PIPE), which is recommended in python3. It won't simplify your code, but it will make you be clear and safe about handling the results of the subprocess. (For example, right now you're not doing anything with err, and it's unclear if that's a mistake or not.)

As for the fundamental approach:
I think it's probably the best you can do, if you're stuck with the "scanned vs searchable" dichotomy. You could also try using pdftotext to see if there's text in the document.
But as I said above, the presence of text doesn't absolutely mean that the document can reliably be searched, and the absence of font information doesn't absolutely mean that the document is a scan.

_{*It was not seamless.}

Good point about setting the scanned property! Would you mind explaining why subprocess.run() is "better" than using subprocess.Popen?
The documentation page recommends using it, so it's "recommended" :) I would guess that it's recommended because it can handle almost all use-cases, while being easy to use, easy to use safely, and maybe a little less verbose? I haven't actually used the subprocess module in python3, so I'm not sure.

Stack Exchange Network

Python3 - Determing if a PDF is scanned or "searchable"

scanned.pdf:

notscanned.pdf

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python3 - Determing if a PDF is scanned or "searchable"

scanned.pdf:

notscanned.pdf

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions