1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

Python and regular expression with Unicode

Asked 17 years ago

Viewed 114k times

I need to delete some Unicode symbols from the string 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'

I know they exist here for sure. I tried:

re.sub('([\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ')

but it doesn't work. String stays the same. What am I doing wrong?

Improve this question

edited Jul 20, 2023 at 22:48

Karl Knechtel's user avatar

Karl Knechtel

61.5k14 gold badges134 silver badges194 bronze badges

asked Dec 26, 2008 at 14:40

bsn's user avatar

bsn

1,0541 gold badge8 silver badges7 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default

112

Are you using python 2.x or 3.0?

If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.

re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', '', ...)

http://docs.python.org/tutorial/introduction.html#unicode-strings

Edit:

It's also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like \w or \b, of which this pattern does not use any and so would not be affected by.

Improve this answer

edited Dec 26, 2008 at 16:03

answered Dec 26, 2008 at 14:57

ʞɔıu's user avatar

ʞɔıu

48.7k36 gold badges110 silver badges156 bronze badges

3 Comments

Balthazar Rouberol

Balthazar Rouberol Over a year ago

Hmm, did not know you could concatenate both u and r prefixes. That's pretty cool!

2013年03月12日T09:16:03.96Z+00:00

Umair Ayub

Umair Ayub Over a year ago

@BalthazarRouberol I get SyntaxError: invalid syntax in Python 3.6

2018年06月20日T11:28:52.383Z+00:00

Mansour.M

Mansour.M Over a year ago

You can't use ur in python 3. Just use r.

2022年03月28日T13:02:18.02Z+00:00

Use unicode strings. Use the re.UNICODE flag.

>>> myre = re.compile(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', 
 re.UNICODE)
>>> myre
<_sre.SRE_Pattern object at 0xb20b378>
>>> mystr = u'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'
>>> result = myre.sub('', mystr)
>>> len(mystr), len(result)
(38, 22)
>>> print result
بسم الله الرحمن الرحيم

Read the article by Joel Spolsky called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Improve this answer

answered Dec 26, 2008 at 15:55

nosklo's user avatar

nosklo

224k58 gold badges300 silver badges299 bronze badges

4 Comments

securecurve

securecurve Over a year ago

@nosklo, why the curly braces that sets the number of chars -- {5} -- are not working with unicode characters, I'm having problems with it, yet, the + works fine..do you have any idea? Thanks!

2013年02月10日T11:02:50.067Z+00:00

nosklo

nosklo Over a year ago

@securecurve I have no idea, and without my magic crystal ball there's no way to help. I just tested it, and it works fine for me. If it doesn't work for you, I suggest you ask a new question, providing your code and the result you're getting.

2013年02月20日T14:45:02.857Z+00:00

noisy

noisy Over a year ago

In case you want to use re in python, you have to know that it doesn't support Unicode character property (like \p{L}). pypi.python.org/pypi/regex does.

2013年06月01日T13:07:23.27Z+00:00

nhahtdh

nhahtdh Over a year ago

re.UNICODE flag is useless here, since it only affects shorthand character classes \w, \d, \s.

2015年10月06日T07:59:45.047Z+00:00

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Python and regular expression with Unicode

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related