I am currently trying to figure out how to use Unicode in a regex in Python.
The regex I want to get to work is the following:
r"([A-ZÜÖÄß]+\s)+"
This should include all occurences of multiple capitalized words, that may or may not have Umlauts in them. Funnily enouth it will do nearly what I wanted, but it still ignores Umlauts.
For example, in FUßBALL AND MORE only BALL AND MORE should be detected.
I already tried to simply use the Unicode representations (Ü becomes \u00DC etc.), as it was advised in another thread, but that does not work too. Instead I might try to use the "regex" library instead of "re", but I kindoff want to know what I am doing wrong right now.
If you are able to enlighten me, please feel free to do so.
-
Well that makes sense, yes I am using Python version 2.7.12 ----- Cool. That does mean that I don't misunderstand regexes (I feared to just have produced a realy stupid regex ;D )Junge– Junge2017年10月05日 09:10:13 +00:00Commented Oct 5, 2017 at 9:10
-
Replacing the Chars with their ISO representation worked like a charm. ---> r'(?:[A-Z\xC4\xD6\xDC\xDF]+\s)+' Do you mind posting your comment as an answer? Then I could accept that and close the question. Thank you a lot, by the way!Junge– Junge2017年10月05日 09:48:30 +00:00Commented Oct 5, 2017 at 9:48
-
I'll look over it as soon as I am back at my workdesk. I can't upvote you any more. Somebody must have downvoted your stuff - for reasons i suppose...Junge– Junge2017年10月08日 11:33:53 +00:00Commented Oct 8, 2017 at 11:33
-
Yes. Adding the 'u' seems to work well. I changed the answer status accordingly.Junge– Junge2017年10月09日 06:44:15 +00:00Commented Oct 9, 2017 at 6:44
-
So, that means it is another duplicate of a very popular question. Closed as such.Wiktor Stribiżew– Wiktor Stribiżew2017年10月09日 06:49:11 +00:00Commented Oct 9, 2017 at 6:49
1 Answer 1
Use Unicode strings. Make sure your source is saved in the declared encoding:
#coding:utf8
import re
for s in re.finditer(ur"[A-ZÜÖÄß]+",u"FUßBALL AND MORE"):
print s.group()
Output:
FUßBALL
AND
MORE
Without Unicode strings, your byte strings are in the encoding of your source file. If that is UTF-8, they are multi-byte for non-ASCII. You will still have problems with Unicode strings in a narrow Python build, but only if you use Unicode codepoints>U+FFFF (such as emoji) as they will be encoded using UTF-16 surrogates (two codepoints). In that case, switch to the latest Python 3.x where the problem was solved and all Unicode codepoints have a length of 1.