Why does this work:
>>> ss
u'\U0001f300'
>>> r = re.compile(u"[u'\U0001F300-\U0001F5FF']+", re.UNICODE)
>>> r.search(ss) # this works
<_sre.SRE_Match object at 0x7f359acf03d8>
But this doesn't:
>>> r = re.compile("[u'\U0001F300-\U0001F5FF']+", re.UNICODE)
>>> r.search(ss) # this doesn't
Based on Ignacio's answer below, this also works:
>>> r = re.compile(u"[\U0001F300-\U0001F5FF]+", re.UNICODE)
>>> r.search(ss)
<_sre.SRE_Match object at 0x7f359acf03d8>
asked Oct 22, 2015 at 0:41
Ankur Agarwal
25k44 gold badges148 silver badges217 bronze badges
1 Answer 1
Use a unicode pattern when performing a search on a unicode haystack.
Also, the "u'...'" should not be in the pattern; those are Unicode characters (in the unicode) without that regardless.
answered Oct 22, 2015 at 0:47
Ignacio Vazquez-Abrams
804k160 gold badges1.4k silver badges1.4k bronze badges
Sign up to request clarification or add additional context in comments.
Comments
Explore related questions
See similar questions with these tags.
lang-py
u'..'inside the character classes are not doing anything except includinguas a legal match - along with the apostrophe, twice.u, single apostrophe, and any character fromU+1F300toU+1F5FF".sscontains the single codepointU+1F300, which meets the requirements.[ax-z]matches any ofa,x,yorz. Your character class matchesuor'or U+1F300 or U+1F301 or ... or U+1F5FE or U+1F5FF.re.UNICODEonly affects the behavior of\d,\s,\wand has nothing to do with the Unicode/byte semantic of the regex engine.