4

I have to read pdf books that are turkish stories. I found a library which is called pyPdf. My test function whichis the below doesn't encode correctly. I think, I need to have turkish codec packet. Am i wrong ? if i am wrong how can I solve this problem orelse how can I find this turkish codec packet?

from StringIO import StringIO
import pyPdf,os
def getPDFContent(path):
 content = ""
 num_pages = 10
 p = file(path, "rb")
 pdf = pyPdf.PdfFileReader(p)
 for i in range(0, num_pages):
 content += pdf.getPage(i).extractText() + "\n"
 content = " ".join(content.replace(u"\xa0", " ").strip().split()) 
 return content
if __name__ == '__main__':
 pdfContent = StringIO(getPDFContent(os.path.abspath("adiaylin-aysekulin.pdf")).encode("utf-8", "ignore"))
 for line in pdfContent:
 print line.strip()
 input("Press Enter to continue...")
asked May 22, 2013 at 16:22
1
  • 1
    What did you want to say ? Can you explain me ? Commented May 27, 2013 at 10:27

1 Answer 1

1

What kind of error / unexpected output are you getting specifically?

According to the pyPdf homepage, pyPdf is no longer maintained. But there is a fork called PyPDF2 (GitHub) that promises to "handle a wider range of input PDF instances".

Maybe upgrading to PyPDF2 solves your problem, I suggest you try that first.

answered May 28, 2013 at 12:22
Sign up to request clarification or add additional context in comments.

1 Comment

I solved the problem in a different way. Before I have converted pdf to text and then i read the the text file.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.