1

I am currently parsing through some old exams to determine the frequency of the questions (because many questions would resurface at this years exam). I am using pyperclip to get the input for the re.findall.

This is the regex I am using: pattern = re.compile(ur'\d.[a-zA-Z .,\']+\?', re.UNICODE), and this is an example question on an older exam (the pattern I am trying to find): 9. In Wycherley’s The Country Wife, what does Mr. Pinchwife threaten to inscribe on Mrs. Pinchwife’s face with his penknife? The apostrophe is not one I can find on my keyboard, and trying to execute the code results in this error:

 File "examAnalyzer.py", line 7
 pattern = re.compile(ur'\d.[a-zA-Z .,\Æ]+\?', re.UNICODE)
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

I am using Python 2.7.11 and Anaconda 4.0, and the Python file is edited using VIM.

PM 2Ring
55.6k6 gold badges96 silver badges203 bronze badges
asked May 27, 2016 at 20:00
2
  • Can't you use \u2019? BTW, the . should be escaped when you need to match a literal dot. Try ur'\d\.[a-zA-Z .,\'\u2019]+\?' Commented May 27, 2016 at 20:09
  • Would you look at that, it works flawlessly! Commented May 27, 2016 at 20:16

2 Answers 2

1

You can use the \u representation of the apostrophe, which is \u2019.

Also, the dot should be escaped to match a literal dot symbol.

Use

ur'\d\.[a-zA-Z .,\'\u2019]+\?'
 ^^ ^^^^^^ 

When in doubt what the hex representation a symbol has, you can check it at r12a>> apps>> Unicode code converter.

answered May 27, 2016 at 20:18
Sign up to request clarification or add additional context in comments.

Comments

0

Your python file has declared a file encoding of utf8 but the file itself is saved in another encoding.

You should give the correct encoding in the first line:

# -*- coding: <correct encoding> -*-
answered May 27, 2016 at 20:15

2 Comments

Which encoding is the correct? I have declared utf-8 as encoding in this script.
@gloriousCatnip: If you want to include Unicode literal characters in your script I recommend you use UTF-8. Use a valid UTF-8 # -*- coding: directive at the top of your script so the Python interpreter knows what encoding has been used (there are several valid forms, see PEP 263), and you also need to tell your editor / IDE to use UTF-8.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.