1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

Python unicode regex issue

Asked 10 years, 2 months ago

Viewed 78 times

Why does this work:

>>> ss
u'\U0001f300'
>>> r = re.compile(u"[u'\U0001F300-\U0001F5FF']+", re.UNICODE)
>>> r.search(ss) # this works
<_sre.SRE_Match object at 0x7f359acf03d8>

But this doesn't:

>>> r = re.compile("[u'\U0001F300-\U0001F5FF']+", re.UNICODE)
>>> r.search(ss) # this doesn't

Based on Ignacio's answer below, this also works:

>>> r = re.compile(u"[\U0001F300-\U0001F5FF]+", re.UNICODE)
>>> r.search(ss)
<_sre.SRE_Match object at 0x7f359acf03d8>

Improve this question

edited Oct 22, 2015 at 0:50

asked Oct 22, 2015 at 0:41

Ankur Agarwal's user avatar

Ankur Agarwal

25k44 gold badges148 silver badges217 bronze badges

Those u'..' inside the character classes are not doing anything except including u as a legal match - along with the apostrophe, twice.

Mark Reed
– Mark Reed

2015年10月22日 00:50:51 +00:00
Commented Oct 22, 2015 at 0:50
@MarkReed I don't understand. Based on what you said, how did my very first match succeed (in my post above)?

Ankur Agarwal
– Ankur Agarwal

2015年10月22日 00:52:28 +00:00
Commented Oct 22, 2015 at 0:52
2

Your first match says: "match one or more of any of the codepoints u, single apostrophe, and any character from U+1F300 to U+1F5FF". ss contains the single codepoint U+1F300, which meets the requirements.

Mark Tolonen
– Mark Tolonen

2015年10月22日 00:59:01 +00:00
Commented Oct 22, 2015 at 0:59
1

Character classes are "or"s. [ax-z] matches any of a, x, y or z. Your character class matches u or ' or U+1F300 or U+1F301 or ... or U+1F5FE or U+1F5FF.

Mark Reed
– Mark Reed

2015年10月22日 01:03:04 +00:00
Commented Oct 22, 2015 at 1:03
1

re.UNICODE only affects the behavior of \d, \s, \w and has nothing to do with the Unicode/byte semantic of the regex engine.

nhahtdh
– nhahtdh

2015年10月22日 04:41:26 +00:00
Commented Oct 22, 2015 at 4:41

| Show 1 more comment

1 Answer 1

Sorted by: Reset to default

Use a unicode pattern when performing a search on a unicode haystack.

Also, the "u'...'" should not be in the pattern; those are Unicode characters (in the unicode) without that regardless.

Improve this answer

answered Oct 22, 2015 at 0:47

Ignacio Vazquez-Abrams's user avatar

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Python unicode regex issue

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related