Weird Python Regex Issues

Asked 14 years, 9 months ago

Viewed 222 times

whitespace_pattern = u"\s" # bug: tried to use unicode \u0020, broke regex
time_sig_pattern = \
 """^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
time_sig = compile(time_sig_pattern, U|M)

For some reason, adding the Verbose flag, X, to compile breaks the pattern.

Also, I wanted to use unicode for whitespace_pattern recognition (supposedly, we'll get patterns that use non-unicode spaces and we need to explicitly check for that one unicode character as a valid space), but the pattern keeps breaking.

Improve this question

asked Apr 4, 2011 at 15:50

prafulfillment's user avatar

prafulfillment

9112 gold badges11 silver badges26 bronze badges

Add a comment |

3 Answers 3

Sorted by: Reset to default

VERBOSE gives you the ability to write comments in your regex to document it.

In order to do so, it ignores spaces, since you need to use line breaks to write comments.

Replace all spaces in your regex by \s to specify they are spaces you want to match in your pattern, and not just some spaces to format your comments.

What's more, you may want to use the r prefix for the string you use as a pattern. It tells Python not to interpret special notations such as \n and let you use backslashes without escaping them.

Improve this answer

edited Apr 6, 2011 at 2:15

prafulfillment's user avatar

prafulfillment

9112 gold badges11 silver badges26 bronze badges

answered Apr 4, 2011 at 16:00

Bite code's user avatar

Bite code

601k118 gold badges310 silver badges336 bronze badges

Comments

Always define regexes with the r prefix to indicate they are raw strings.

 r"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}

Improve this answer

answered Apr 4, 2011 at 15:55

Daniel Roseman's user avatar

Daniel Roseman

602k68 gold badges911 silver badges924 bronze badges

Comments

When creating a regex to match unicode characters you do not want to use a Python unicode string. In your example regular expression needs to see the literal characters \u0020, so you should use whitespace_pattern = r"\u0020" instead of u"\u0020".

As other answers have mentioned, you should also use the r prefix for time_sig_pattern, after those two changes your code should work fine.

For VERBOSE to work correctly you need to escape all whitespace in the pattern, so towards the beginning of the pattern replace the space in time signature with "\ " (quotes for clarity), \s, or [ ] as documented here.

Improve this answer

answered Apr 4, 2011 at 16:45

Andrew Clark's user avatar

Andrew Clark

210k36 gold badges286 silver badges310 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Weird Python Regex Issues

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related