I am currently parsing through some old exams to determine the frequency of the questions (because many questions would resurface at this years exam). I am using pyperclip to get the input for the re.findall.
This is the regex I am using: pattern = re.compile(ur'\d.[a-zA-Z .,\']+\?', re.UNICODE), and this is an example question on an older exam (the pattern I am trying to find): 9. In Wycherley’s The Country Wife, what does Mr. Pinchwife threaten to inscribe on Mrs. Pinchwife’s face with his penknife? The apostrophe is not one I can find on my keyboard, and trying to execute the code results in this error:
File "examAnalyzer.py", line 7
pattern = re.compile(ur'\d.[a-zA-Z .,\Æ]+\?', re.UNICODE)
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
I am using Python 2.7.11 and Anaconda 4.0, and the Python file is edited using VIM.
2 Answers 2
You can use the \u representation of the apostrophe, which is \u2019.
Also, the dot should be escaped to match a literal dot symbol.
Use
ur'\d\.[a-zA-Z .,\'\u2019]+\?'
^^ ^^^^^^
When in doubt what the hex representation a symbol has, you can check it at r12a>> apps>> Unicode code converter.
Comments
Your python file has declared a file encoding of utf8 but the file itself is saved in another encoding.
You should give the correct encoding in the first line:
# -*- coding: <correct encoding> -*-
2 Comments
# -*- coding: directive at the top of your script so the Python interpreter knows what encoding has been used (there are several valid forms, see PEP 263), and you also need to tell your editor / IDE to use UTF-8.Explore related questions
See similar questions with these tags.
\u2019? BTW, the.should be escaped when you need to match a literal dot. Tryur'\d\.[a-zA-Z .,\'\u2019]+\?'