import re
##EDIT didn't mean to copy filename = "rr.txt" ## opens file unicode file type
buffer = open('r.txt','r').read()
quotes = re.findall(ur'"[^"^\u201c]*["\u201d].*', buffer)
for quote in quotes:
print ''
print quote
## prints quotes found
## Problem is that the print output has rectangular blocks between each Character
Why?
How do you return output without the rectangular blocks messing everything up?
asked May 11, 2012 at 15:01
aromamode
291 gold badge2 silver badges5 bronze badges
-
The file I used was a basic save, unicode text file, text copied from a PDFaromamode– aromamode2012年05月11日 15:40:24 +00:00Commented May 11, 2012 at 15:40
-
parisis.files.wordpress.com/2011/01/noam-chomsky.pdfaromamode– aromamode2012年05月11日 15:41:24 +00:00Commented May 11, 2012 at 15:41
-
How do you know the text file is Unicode? What OS are you running Acrobat in? In Windows it saves as a code page where the quotes are 0x93 and 0x94.Mark Ransom– Mark Ransom2012年05月11日 16:03:42 +00:00Commented May 11, 2012 at 16:03
-
When I save the text file it gives options for encoding. They are: ANSI, unicode, unicode big endian and UFT-8. I used unicode... I'm running windowsaromamode– aromamode2012年05月11日 16:14:54 +00:00Commented May 11, 2012 at 16:14
2 Answers 2
You're opening it incorrectly. And "Unicode" in Windows is actually UTF-16LE.
buffer = codecs.open('r.txt', 'r', encoding='utf-16le').read()
answered May 11, 2012 at 15:10
Ignacio Vazquez-Abrams
804k160 gold badges1.4k silver badges1.4k bronze badges
Sign up to request clarification or add additional context in comments.
3 Comments
Mark Ransom
I wonder how that regular expression was finding anything if the read was messed up?
Ignacio Vazquez-Abrams
@Mark: An interesting question. I suspect that my answer isn't completely correct, but it is about 90% of the way there (e.g. the file is in the system encoding instead of UTF-16LE).
aromamode
Thanks for the help, the Re, does need some work that i can do now. cheers
This isn't related to Python. Your console window renders the output of Python and this breaks.
Use a font in your console window that supports the necessary Unicode characters.
answered May 11, 2012 at 15:04
Aaron Digulla
330k111 gold badges626 silver badges840 bronze badges
1 Comment
aromamode
The above isn't really helpful and it seems to me the problem came from using python and seems to have been fixed using python.
lang-py