4
\$\begingroup\$

I am in a situation where I need to determine if a PDF file is either:

  • Scanned
  • Searchable (contain text)

To do this, I am simply running the command pdffonts on the PDF file:

# Check if the resource definition indicates this is a scanned PDF
cmd = ['pdffonts', pdf_file]
proc = subprocess.Popen(
 cmd, stdout=subprocess.PIPE, bufsize=0, text=True, shell=False)
out, err = proc.communicate()
scanned = True
for idx, line in enumerate(out.splitlines()):
 if idx == 2:
 scanned = False

Imagine I have two PDF files:

  • scanned.pdf
  • notscanned.pdf

The results of above:

scanned.pdf:

name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------

notscanned.pdf

name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TNNLVN+Calibri TrueType WinAnsi yes yes yes 9 0

As you can see, the notscanned.pdf file contains font information.

In my Python script, I iterate over each line in the command line output:

for idx, line in enumerate(out.splitlines()):
 if idx == 2:
 scanned = False

And if there is a line 2, then it contains font information.

I am wondering if this approach is viable? I've been searching for a solution for quite a bit, and it seems there is no way to know 100%.

200_success
146k22 gold badges190 silver badges479 bronze badges
asked May 24, 2019 at 11:19
\$\endgroup\$
1

2 Answers 2

3
\$\begingroup\$

I agree that there's no way to know 100% if a PDF contains proper text or an image of a scanned hard-copy. I've seen PDFs that weren't scanned, but which contained a single jpg per page, and I've seen PDFs that were scanned but had an OCR version of the text seamlessly* underlayed.

Your existing code can be improved a little. You don't need to set scanned, and then update it in an if in a for, you can just check the condition you're interested in directly:

scanned = 2 < len(out.splitlines())

I'll be honest, I don't like the simplicity of that. We're writing code that relys on the incidental formatting of a table that was designed for human reading, and which we have no control over. That said, since we have no control, it may be the best we can do.

Also, it looks like you could probably be using
subprocess.run(cmd, stdout=subprocess.PIPE), which is recommended in python3. It won't simplify your code, but it will make you be clear and safe about handling the results of the subprocess. (For example, right now you're not doing anything with err, and it's unclear if that's a mistake or not.)

As for the fundamental approach:
I think it's probably the best you can do, if you're stuck with the "scanned vs searchable" dichotomy. You could also try using pdftotext to see if there's text in the document.
But as I said above, the presence of text doesn't absolutely mean that the document can reliably be searched, and the absence of font information doesn't absolutely mean that the document is a scan.

*It was not seamless.

answered May 24, 2019 at 12:34
\$\endgroup\$
2
  • \$\begingroup\$ Good point about setting the scanned property! Would you mind explaining why subprocess.run() is "better" than using subprocess.Popen? \$\endgroup\$ Commented May 25, 2019 at 17:51
  • \$\begingroup\$ The documentation page recommends using it, so it's "recommended" :) I would guess that it's recommended because it can handle almost all use-cases, while being easy to use, easy to use safely, and maybe a little less verbose? I haven't actually used the subprocess module in python3, so I'm not sure. \$\endgroup\$ Commented May 25, 2019 at 20:52
4
\$\begingroup\$

How about the PDF metadata check on '/Resources' ?!

I believe for any text in a PDF (electronic document) there are more chances of having a font, especially the PDF, whose objective is to make a portable file, so, it maintains the font definition.

If you are a PyPDF2 user, try

pdf_reader = PyPDF2.PdfFileReader(input_file_location)
page_data = pdf_reader.getPage(page_num)
if '/Font' in page_data['/Resources']:
 print("[Info]: Looks like there is text in the PDF, contains:", page_data['/Resources'].keys())
elif len(page_data1['/Resources'].get('/XObject', {})) != 1:
 print("[Info]: PDF Contains:", page_data['/Resources'].keys())
for obj in x_object:
 obj_ = x_object[obj]
 if obj_['/Subtype'] == '/Image':
 print("[Info]: PDF is image only")
answered Nov 11, 2019 at 16:13
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.