Python regex and Unicode [duplicate]

Question 1

I am currently trying to figure out how to use Unicode in a regex in Python.

The regex I want to get to work is the following:

r"([A-ZÜÖÄß]+\s)+"

This should include all occurences of multiple capitalized words, that may or may not have Umlauts in them. Funnily enouth it will do nearly what I wanted, but it still ignores Umlauts.

For example, in FUßBALL AND MORE only BALL AND MORE should be detected.

I already tried to simply use the Unicode representations (Ü becomes \u00DC etc.), as it was advised in another thread, but that does not work too. Instead I might try to use the "regex" library instead of "re", but I kindoff want to know what I am doing wrong right now.

If you are able to enlighten me, please feel free to do so.

Question 2

Well that makes sense, yes I am using Python version 2.7.12 ----- Cool. That does mean that I don't misunderstand regexes (I feared to just have produced a realy stupid regex ;D )

Question 3

Replacing the Chars with their ISO representation worked like a charm. ---> r'(?:[A-Z\xC4\xD6\xDC\xDF]+\s)+' Do you mind posting your comment as an answer? Then I could accept that and close the question. Thank you a lot, by the way!

Question 4

I'll look over it as soon as I am back at my workdesk. I can't upvote you any more. Somebody must have downvoted your stuff - for reasons i suppose...

Question 5

Yes. Adding the 'u' seems to work well. I changed the answer status accordingly.

Question 6

So, that means it is another duplicate of a very popular question. Closed as such.

Question 7

Use Unicode strings. Make sure your source is saved in the declared encoding:

#coding:utf8
import re
for s in re.finditer(ur"[A-ZÜÖÄß]+",u"FUßBALL AND MORE"):
 print s.group()

Output:

FUßBALL
AND
MORE

Without Unicode strings, your byte strings are in the encoding of your source file. If that is UTF-8, they are multi-byte for non-ASCII. You will still have problems with Unicode strings in a narrow Python build, but only if you use Unicode codepoints>U+FFFF (such as emoji) as they will be encoded using UTF-16 surrogates (two codepoints). In that case, switch to the latest Python 3.x where the problem was solved and all Unicode codepoints have a length of 1.

Mark Tolonen 181k26 gold badges184 silver badges279 bronze badges · Accepted Answer · 2017-10-06 05:25:33Z

Use Unicode strings. Make sure your source is saved in the declared encoding:

#coding:utf8
import re
for s in re.finditer(ur"[A-ZÜÖÄß]+",u"FUßBALL AND MORE"):
 print s.group()

Output:

FUßBALL
AND
MORE

Without Unicode strings, your byte strings are in the encoding of your source file. If that is UTF-8, they are multi-byte for non-ASCII. You will still have problems with Unicode strings in a narrow Python build, but only if you use Unicode codepoints>U+FFFF (such as emoji) as they will be encoded using UTF-16 surrogates (two codepoints). In that case, switch to the latest Python 3.x where the problem was solved and all Unicode codepoints have a length of 1.

CollectivesTM on Stack Overflow

Python regex and Unicode [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Linked

Related