I need to delete some Unicode symbols from the string 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'
I know they exist here for sure. I tried:
re.sub('([\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ')
but it doesn't work. String stays the same. What am I doing wrong?
2 Answers 2
Are you using python 2.x or 3.0?
If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.
re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', '', ...)
http://docs.python.org/tutorial/introduction.html#unicode-strings
Edit:
It's also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like \w or \b, of which this pattern does not use any and so would not be affected by.
3 Comments
u and r prefixes. That's pretty cool!SyntaxError: invalid syntax in Python 3.6Use unicode strings. Use the re.UNICODE flag.
>>> myre = re.compile(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+',
re.UNICODE)
>>> myre
<_sre.SRE_Pattern object at 0xb20b378>
>>> mystr = u'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'
>>> result = myre.sub('', mystr)
>>> len(mystr), len(result)
(38, 22)
>>> print result
بسم الله الرحمن الرحيم
Read the article by Joel Spolsky called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
4 Comments
re in python, you have to know that it doesn't support Unicode character property (like \p{L}). pypi.python.org/pypi/regex does.re.UNICODE flag is useless here, since it only affects shorthand character classes \w, \d, \s.Explore related questions
See similar questions with these tags.