Issue 26784: regular expression problem at umlaut handling

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70971

classification

Title:	regular expression problem at umlaut handling
Type:	behavior	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	not a bug
Assigned To:	Nosy List:	arbyter, ezio.melotti, mrabarnett, pitrou, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2016年04月16日 16:48 by arbyter, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Messages (6)
msg263567 - (view)	Author: Marcus (arbyter)	Date: 2016年04月16日 16:48
Working with this example string "E-112233-555-11 \| Bläh - Bläh" with the following code leeds under python 2.7.10 (OSX) to an exception whereas the same code works under python 3.5.1 (OSX). s = "E-112233-555-11 \| Bläh - Bläh" expr = re.compile(r"(?P<p>[A-Z]{1}-[0-9]{0,}(-[0-9]{0,}(-[0-9]{0,})?)?)?(( [\|] )?(?P<a>[\s\w])?)? - (?P<j>[\s\w])?",re.UNICODE) res = re.match(expr,s) a = (res.group('p'), res.group('a'), res.group('j')) print(a) When I change the first umlaut in "Bläh" from ä to ü it works as expected on python 2 and 3. A change from ä to ö however leeds to a crash again. Ideas?
msg263569 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2016年04月16日 17:41
First, in the context of Python a crash means a core dump or an analogue on Windows. In this case the code just works not as you expected. The short answer: s should be a unicode. In your code "ä" is encoded as 8-bit string '\xc3\xa4'. When matched, every bytes is independently expanded to Unicode range. The first byte becomes u'\xc3' = u'Ã', the second byte becomes u'¤', non-alphanumeric. '[\s\w]' doesn't match u'Ã¤'. "ü" is encoded as 8-bit string '\xc3\xbc'. The second byte becomes u'1⁄4', numeric. '[\s\w]' matches u'Ã1⁄4'.
msg263570 - (view)	Author: Marcus (arbyter)	Date: 2016年04月16日 17:54
Thx for your explanation. You explained why [\s\w] didn't match for "ä". In my situation it didn't matches for the first "ä" but the second time I used [\s\w] in the same regex it matched at the second "ä". What's the explanation for this?
msg263572 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2016年04月16日 18:10
Sorry, I don't understand you. If the regex failed to match the first "ä", it can't match the second "ä". Do you have an example?
msg263575 - (view)	Author: Marcus (arbyter)	Date: 2016年04月16日 18:32
When I replace the first "ä" with a random letter the untouched expression has not problems to match the second word which contains also an "ä" s = "E-112233-555-11 \| Bläh - Bläh" #untuched string s = "E-112233-555-11 \| Bloh - Bläh" #string where the first ä is replaced by an "o"
msg263577 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2016年04月16日 18:48
Because "[\s\w]*" matches only a part of "Bläh": "Bl\xc3".

History
Date	User	Action	Args
2022年04月11日 14:58:29	admin	set	github: 70971
2016年04月16日 18:48:00	serhiy.storchaka	set	messages: + msg263577
2016年04月16日 18:32:35	arbyter	set	messages: + msg263575
2016年04月16日 18:10:10	serhiy.storchaka	set	messages: + msg263572
2016年04月16日 17:54:11	arbyter	set	messages: + msg263570
2016年04月16日 17:41:33	serhiy.storchaka	set	status: open -> closed resolution: not a bug messages: + msg263569 stage: resolved
2016年04月16日 17:11:18	SilentGhost	set	nosy: + ezio.melotti, pitrou, serhiy.storchaka, mrabarnett components: + Regular Expressions
2016年04月16日 16:48:11	arbyter	create

homepage