Python regex is having problems finding a special unicode character

Asked 9 years, 7 months ago

Viewed 263 times

I am currently parsing through some old exams to determine the frequency of the questions (because many questions would resurface at this years exam). I am using pyperclip to get the input for the re.findall.

This is the regex I am using: pattern = re.compile(ur'\d.[a-zA-Z .,\']+\?', re.UNICODE), and this is an example question on an older exam (the pattern I am trying to find): 9. In Wycherley’s The Country Wife, what does Mr. Pinchwife threaten to inscribe on Mrs. Pinchwife’s face with his penknife? The apostrophe is not one I can find on my keyboard, and trying to execute the code results in this error:

 File "examAnalyzer.py", line 7
 pattern = re.compile(ur'\d.[a-zA-Z .,\Æ]+\?', re.UNICODE)
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

I am using Python 2.7.11 and Anaconda 4.0, and the Python file is edited using VIM.

Improve this question

edited May 27, 2016 at 20:37

PM 2Ring's user avatar

PM 2Ring

55.6k6 gold badges96 silver badges203 bronze badges

asked May 27, 2016 at 20:00

gloriousCatnip's user avatar

gloriousCatnip

4311 gold badge6 silver badges18 bronze badges

Can't you use \u2019? BTW, the . should be escaped when you need to match a literal dot. Try ur'\d\.[a-zA-Z .,\'\u2019]+\?'

Wiktor Stribiżew
– Wiktor Stribiżew

2016年05月27日 20:09:44 +00:00
Commented May 27, 2016 at 20:09
Would you look at that, it works flawlessly!

gloriousCatnip
– gloriousCatnip

2016年05月27日 20:16:42 +00:00
Commented May 27, 2016 at 20:16

Add a comment |

2 Answers 2

Sorted by: Reset to default

You can use the \u representation of the apostrophe, which is \u2019.

Also, the dot should be escaped to match a literal dot symbol.

Use

ur'\d\.[a-zA-Z .,\'\u2019]+\?'
 ^^ ^^^^^^

When in doubt what the hex representation a symbol has, you can check it at r12a>> apps>> Unicode code converter.

Improve this answer

answered May 27, 2016 at 20:18

Wiktor Stribiżew's user avatar

Wiktor Stribiżew

631k41 gold badges503 silver badges633 bronze badges

Comments

Your python file has declared a file encoding of utf8 but the file itself is saved in another encoding.

You should give the correct encoding in the first line:

# -*- coding: <correct encoding> -*-

Improve this answer

answered May 27, 2016 at 20:15

Daniel's user avatar

Daniel

42.8k4 gold badges57 silver badges82 bronze badges

2 Comments

gloriousCatnip

gloriousCatnip Over a year ago

Which encoding is the correct? I have declared utf-8 as encoding in this script.

2016年05月27日T20:16:29.567Z+00:00

PM 2Ring

PM 2Ring Over a year ago

@gloriousCatnip: If you want to include Unicode literal characters in your script I recommend you use UTF-8. Use a valid UTF-8 # -*- coding: directive at the top of your script so the Python interpreter knows what encoding has been used (there are several valid forms, see PEP 263), and you also need to tell your editor / IDE to use UTF-8.

2016年05月27日T20:46:13.09Z+00:00

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Python regex is having problems finding a special unicode character

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related