I have to read pdf books that are turkish stories. I found a library which is called pyPdf. My test function whichis the below doesn't encode correctly. I think, I need to have turkish codec packet. Am i wrong ? if i am wrong how can I solve this problem orelse how can I find this turkish codec packet?
from StringIO import StringIO
import pyPdf,os
def getPDFContent(path):
content = ""
num_pages = 10
p = file(path, "rb")
pdf = pyPdf.PdfFileReader(p)
for i in range(0, num_pages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
if __name__ == '__main__':
pdfContent = StringIO(getPDFContent(os.path.abspath("adiaylin-aysekulin.pdf")).encode("utf-8", "ignore"))
for line in pdfContent:
print line.strip()
input("Press Enter to continue...")
-
1What did you want to say ? Can you explain me ?hinzir– hinzir2013年05月27日 10:27:45 +00:00Commented May 27, 2013 at 10:27
1 Answer 1
What kind of error / unexpected output are you getting specifically?
According to the pyPdf homepage, pyPdf is no longer maintained. But there is a fork called PyPDF2 (GitHub) that promises to "handle a wider range of input PDF instances".
Maybe upgrading to PyPDF2 solves your problem, I suggest you try that first.