4
\$\begingroup\$

Everything is working fine except timing. it takes lot time for my file containing 1000 pages and having 100 pages of interest.

import re
from PyPDF2 import PdfFileReader, PdfFileWriter
import glob, os
# find pages
def findText(f, slist):
 file = open(f, 'rb')
 pdfDoc = PdfFileReader(file)
 pages = []
 for i in range(pdfDoc.getNumPages()):
 content = pdfDoc.getPage(i).extractText().lower()
 for s in slist:
 if re.search(s.lower(), content) is not None:
 if i not in pages:
 pages.append(i)
 return pages
#extract pages
def extractPage(f, fOut, pages):
 file = open(f, 'rb')
 output = PdfFileWriter()
 pdfOne = PdfFileReader(file)
 for i in pages:
 output.addPage(pdfOne.getPage(i))
 outputStream = open(fOut, "wb")
 output.write(outputStream)
 outputStream.close()
 return
os.chdir(r"path\to\mydir")
for pdfFile in glob.glob("*.pdf"):
 print(pdfFile)
 outPdfFile = pdfFile.replace(".pdf","_searched_extracted.pdf")
 stringList = ["string1", "string2"]
 extractPage(pdfFile, outPdfFile, findText(pdfFile, stringList))

Updated code after suggestions is at:

https://gist.github.com/pra007/099f10b07be5b7126a36438c67ad7a1f

asked Sep 7, 2016 at 9:10
\$\endgroup\$
4
  • 1
    \$\begingroup\$ We don't really care about the overall time but more about the specifics. Instead of python file.py, use python -m cProfile -s cumtime file.py and post the functions that took the most time. \$\endgroup\$ Commented Sep 7, 2016 at 11:22
  • \$\begingroup\$ is my modified code OK? \$\endgroup\$ Commented Sep 7, 2016 at 11:33
  • \$\begingroup\$ I have rolled back the question to Rev 1. Please see What to do when someone answers . \$\endgroup\$ Commented Sep 7, 2016 at 16:50
  • \$\begingroup\$ Thanks. I will keep in mind next time not to change the question. \$\endgroup\$ Commented Sep 8, 2016 at 3:41

2 Answers 2

4
\$\begingroup\$

You could try profiling but the code is simple enough that I think you're spending most of the time in PyPDF2 code. Two options:

  • You can preprocess your PDF files to store their text somewhere, which will make the search phase much faster, especially if you run multiples queries on the same PDF files
  • You can try another parser such as a Python 3 version of PDFMiner, or even a parser written in a faster language
answered Sep 7, 2016 at 9:29
\$\endgroup\$
2
  • \$\begingroup\$ Thanks. I thought pdfminer is dead. let me test pdfminer3k \$\endgroup\$ Commented Sep 7, 2016 at 9:34
  • \$\begingroup\$ @Rahul Preprocessing sounds better. It's not an option for you? \$\endgroup\$ Commented Sep 7, 2016 at 10:14
1
\$\begingroup\$

One thing that might help a lot is to compile your regexs just once. Instead of

def findText(f, slist):
 file = open(f, 'rb')
 pdfDoc = PdfFileReader(file)
 pages = []
 for i in range(pdfDoc.getNumPages()):
 content = pdfDoc.getPage(i).extractText().lower()
 for s in slist:
 if re.search(s.lower(), content) is not None:
 if i not in pages:
 pages.append(i)
 return pages

try:

def findText(f, slist):
 file = open(f, 'rb')
 pdfDoc = PdfFileReader(file)
 pages = []
 searches = [ re.compile(s.lower()) for s in slist ]
 for i in range(pdfDoc.getNumPages()):
 content = pdfDoc.getPage(i).extractText().lower()
 for s in searches:
 if s.search(content) is not None:
 if i not in pages:
 pages.append(i)
 return pages

Also, you can short-circuit out a lot faster than you're doing:

def findText(f, slist):
 file = open(f, 'rb')
 pdfDoc = PdfFileReader(file)
 pages = []
 searches = [ re.compile(s.lower()) for s in slist ]
 for i in range(pdfDoc.getNumPages()):
 content = pdfDoc.getPage(i).extractText().lower()
 for s in searches:
 if s.search(content) is not None:
 pages.append(i)
 break
 return pages
answered Sep 8, 2016 at 4:15
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.