matching unicode characters in python regular expressions

Question 1

I have read thru the other questions at Stackoverflow, but still no closer. Sorry, if this is allready answered, but I didn`t get anything proposed there to work.

>>> import re
>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg')
>>> print m.groupdict()
{'tag': 'xmas', 'filename': 'xmas1.jpg'}

All is well, then I try something with Norwegian characters in it ( or something more unicode-like ):

>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg')
>>> print m.groupdict()
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groupdict'

How can I match typical unicode characters, like øæå? I`d like to be able to match those characters as well, in both the tag-group above and the one for filename.

Question 2

Make sure you normalize your strings because there are diffent codepoint-sequences generating the same visual apperance.

Question 3

You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix:

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode, and you can leave off the re.UNICODE flag.

Question 4

+1 for: and input your string as a Unicode string by using the u prefix

Question 5

I don't understand, why do we need to pass in re.UNICODE? (I'm using python 3)

Question 6

@CharlieParker Notice the date of this answer :) In Python 3, re.UNICODE does nothing.

Question 7

You need the UNICODE flag:

m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg', re.UNICODE)

Question 8

Is it required for Python3 too?

Question 9

@Kevin - you don't need the unicode flag with Python 3. "Unicode matching is already enabled by default in Python 3 for Unicode (str) patterns..." - docs.python.org/3/howto/regex.html

Question 10

I don't understand, why do we need to pass in re.UNICODE? (I'm using python 3)

Question 11

In Python 2, you need the re.UNICODE flag and the unicode string constructor

>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好,世界-=+","utf-8"),flags=re.UNICODE)
u',./___\uff0c___-=+'
>>> print re.sub(r"[\w]+","___",unicode(",./你好,世界-=+","utf-8"),flags=re.UNICODE)
,./___,___-=+

(In the latter case, the comma is Chinese comma.)

Question 12

I don't understand, why do we need to pass in re.UNICODE? (I'm using python 3)

Thomas 184k57 gold badges383 silver badges511 bronze badges · Accepted Answer · 2011-02-17 12:18:18Z

52

You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix:

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode, and you can leave off the re.UNICODE flag.

Share

Improve this answer

edited Jun 30, 2022 at 20:01

answered Feb 17, 2011 at 12:18

Thomas's user avatar

Thomas

184k57 gold badges383 silver badges511 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tamm

Tamm Over a year ago

+1 for: and input your string as a Unicode string by using the u prefix

2013年12月18日T15:56:21.617Z+00:00

Charlie Parker

Charlie Parker Over a year ago

I don't understand, why do we need to pass in re.UNICODE? (I'm using python 3)

2022年06月30日T19:09:30.367Z+00:00

Thomas

Thomas Over a year ago

@CharlieParker Notice the date of this answer :) In Python 3, re.UNICODE does nothing.

2022年06月30日T20:01:28.507Z+00:00

CollectivesTM on Stack Overflow

matching unicode characters in python regular expressions

3 Answers 3

3 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

3 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related