Issue 12728: Python re lib fails case insensitive matches on Unicode data

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56937

classification

Title:	Python re lib fails case insensitive matches on Unicode data
Type:	behavior	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.4, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:	17381	Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	Arfrever, ezio.melotti, gvanrossum, lemburg, loewis, mrabarnett, pitrou, python-dev, serhiy.storchaka, tchrist, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2011年08月11日 18:48 by tchrist, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
sigmata.python	tchrist, 2011年08月11日 18:48	Test case proving Python lib re is erroneously using casemapping when it is supposed to use casefolding
re_ignore_case_2.patch	serhiy.storchaka, 2014年10月31日 16:10	review
re_cases.py	serhiy.storchaka, 2014年11月07日 21:39

Messages (9)
msg141916 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011年08月11日 18:48
The Python re library is broken in its approach to case-insensitive matches. It erroneously attempts to compare lowercase mappings. This is wrong. You must compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get wrong answers. I include a small test case that illustrates this bug. The bug exists on both 2.7 and 3.2, and on both wide builds and narrow builds. For comparison, I also show results using Matthew Barnett's regex library, which gets all 5 tests correct where re gets all 5 tests wrong. A sample run is: FAIL: re pattern Ι is not the same as string ͅ PASS: regex pattern Ι is indeed the same as string ͅ FAIL: re pattern Μ is not the same as string μ PASS: regex pattern Μ is indeed the same as string μ FAIL: re pattern s is not the same as string s PASS: regex pattern s is indeed the same as string s FAIL: re pattern ΣΤΙΓΜΑΣ is not the same as string στιγμας PASS: regex pattern ΣΤΙΓΜΑΣ is indeed the same as string στιγμας FAIL: re pattern POST is not the same as string post PASS: regex pattern POST is indeed the same as string post re lib passed 0 of 5 tests regex lib passed 5 of 5 tests
msg141987 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2011年08月12日 19:28
I am not sure that everyone will agree that this is a bug, rather than a feature request, or that if a bug, that it should be changed in existing releases and possibly break running code. The doc just says, somewhat vaguely, that IGNORECASE "works for Unicode characters as expected". I have added others as nosy for their opinions. The test file should have omitted the gratuitous and distracting warnings, especially the one that effectively scolds Windows users for running Windows. With those omitted, the test cases given would form the basis for an added TestCase.
msg141988 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011年08月12日 20:09
> Terry J. Reedy <tjreedy@udel.edu> added the comment: > I am not sure that everyone will agree that this is a bug, rather than a fe= > ature request, or that if a bug, that it should be changed in existing rele= > ases and possibly break running code. The doc just says, somewhat vaguely, = > that IGNORECASE "works for Unicode characters as expected". I have added ot= > hers as nosy for their opinions. Working as expected for Unicode characters means it must the Unicode's rules for casefolding. Otherwise you don't have Unicode at all; you just have ISO 10646. Unicode is not merely a larger character repertoire; again, that is merely ISO 10646. Unicode is all about the rules for processing this larger repertoire. This is a very common mistake, so common that it is in the Unicode FAQ: Q: What is the relation between ISO/IEC 10646 and Unicode? A: In 1991, the ISO Working Group responsible for ISO/IEC 10646 (JTC 1/SC 2/WG 2) and the Unicode Consortium decided to create one universal standard for coding multilingual text. Since then, the ISO 10646 Working Group (SC 2/WG 2) and the Unicode Consortium have worked together very closely to extend the standard and to keep their respective versions synchronized. [EH] Q: So are they the same thing? A: No. Although the character codes and encoding forms are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646. http://unicode.org/faq/unicode_iso.html Part of those functional character specifications can be found in the three casefolding fields of the file UnicodeData.txt and also in two auxiliary files of the Unicode distribution, CaseFolding.txt and SpecialCasing.txt. The Unicode Character Database is not optional. If you do not use it, you do not have Unicode; instead you merely have ISO 10646, which is of zero practical use to anyone compared with Unicode. I'm sure that Python would not want to be stuck having something of no use to anyone when everyone else actually supports Unicode. One is not allowed to make up one's own rules that run counter to Unicode's and still make the claim that one is working on Unicode, since that is in fact not what one is doing. Based on all that, Python does not do case insensitive matching on Unicode, a condition contrary to its documented claims. That clearly makes it a bug that needs fixing rather than a feature request to be summarily ignored. > The test file should have omitted the gratuitous and distracting warnings, = > especially the one that effectively scolds Windows users for running Window= > s. With those omitted, the test cases given would form the basis for an add= > ed TestCase. I have absolutely no idea what on earth you could possibly be referring to. Honestly. I ran my tests on both releases (2.7 and 3.2), on both builds (wide and narrow), and on both platforms (Unix and Mac). The warnings are in there so I can make sure I have everything set up correctly to run the tests, and will understand why I get more failures than expected in the event that things are not set up appropriately. Let me make perfectly clear that I have never in my life come anywhere near a Microsoft system, let alone touched one, and that I furthermore never shall. I have not the foggiest notion what in the world you are complaining about. If the problem is that you are for some reason unable to create a Python with full Unicode support under Microsoft, that is hardly my fault. Render unto Caesar that which is Caesar's: complain to Microsoft about Microsoft's bugs, not to me, as I am wholly blameless of their problems. If you don't like my test cases, you know where to find vi. I supposed I could always send you the program that writes these programs for me, but as I knew you won't like it, I withheld it. You already have all that you need to see exactly where the bugs are and how to fix them. --tom
msg143034 - (view)	Author: Guido van Rossum (gvanrossum) * (Python committer)	Date: 2011年08月26日 21:04
This bug could do with a little less attitude. That said, I think it is a bug and should be fixed, at the very least for Python 3.3. As always, it is a matter of much debate to what extent bugs can be fixed in previous Python versions (specifically, 2.7 and 3.2) without breaking more code than it fixes, and I don't want to jump the gun on that issue. Let's first see what it takes to fix this for 3.3.
msg227236 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年09月21日 20:45
Here is preliminary patch which fixes case-insensitive regular expression matching of unicode strings. It is incomplete, it needs applying patches from issue17381, which fixes other aspects of case-insensitive matching. One bug is left for Turkish letters. This matching is not transitive. Three pairs of letters should match: ı ~ I ~ i ~ İ. All other combinations should not match (ı !~ i, I !~ İ, ı !~ İ). This patch doesn't fixes this bug.
msg230349 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年10月31日 16:10
Here are complete patch and script used to generate equivalence table.
msg230830 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年11月07日 21:39
Could anyone please make a review? The script is updated so that it now is compatible with 2.7. There are some differences in equivalence table between 2.7 and 3.4 (e.g. 'ΐ' (U+0390) is not equivalent to 'ΐ' (U+1FD3) in 2.7).
msg230951 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2014年11月10日 10:47
New changeset 4caa695af94c by Serhiy Storchaka in branch '2.7': Issue #12728: Different Unicode characters having the same uppercase but https://hg.python.org/cpython/rev/4caa695af94c New changeset 47b3084dd6aa by Serhiy Storchaka in branch '3.4': Issue #12728: Different Unicode characters having the same uppercase but https://hg.python.org/cpython/rev/47b3084dd6aa New changeset 09ec09cfe539 by Serhiy Storchaka in branch 'default': Issue #12728: Different Unicode characters having the same uppercase but https://hg.python.org/cpython/rev/09ec09cfe539
msg230952 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年11月10日 10:52
This solution (with hardcoded table of equivalent lowercases) is temporary. In future re engine will be changed to support correct caseless matching of different lowercase forms internally.

History
Date	User	Action	Args
2022年04月11日 14:57:20	admin	set	github: 56937
2014年11月10日 10:52:04	serhiy.storchaka	set	status: open -> closed resolution: fixed messages: + msg230952 stage: patch review -> resolved
2014年11月10日 10:47:07	python-dev	set	nosy: + python-dev messages: + msg230951
2014年11月07日 21:39:11	serhiy.storchaka	set	files: + re_cases.py messages: + msg230830
2014年11月07日 21:32:19	serhiy.storchaka	set	files: - re_cases.py
2014年11月07日 21:31:42	serhiy.storchaka	set	files: - re_ignore_case.patch
2014年10月31日 16:10:18	serhiy.storchaka	set	files: + re_ignore_case_2.patch, re_cases.py messages: + msg230349
2014年09月21日 20:45:06	serhiy.storchaka	set	files: + re_ignore_case.patch dependencies: + IGNORECASE breaks unicode literal range matching assignee: serhiy.storchaka versions: + Python 3.5, - Python 3.3 keywords: + patch nosy: + serhiy.storchaka messages: + msg227236 stage: needs patch -> patch review
2013年07月10日 19:12:44	terry.reedy	set	versions: + Python 3.4, - Python 3.2
2011年08月26日 21:04:03	gvanrossum	set	nosy: + gvanrossum messages: + msg143034
2011年08月13日 00:56:17	mrabarnett	set	nosy: + mrabarnett
2011年08月12日 20:09:32	tchrist	set	messages: + msg141988
2011年08月12日 19:28:48	terry.reedy	set	versions: + Python 3.2, Python 3.3 nosy: + terry.reedy, lemburg, pitrou, loewis messages: + msg141987 stage: needs patch
2011年08月12日 18:01:36	Arfrever	set	nosy: + Arfrever
2011年08月12日 00:20:11	ezio.melotti	set	nosy: + ezio.melotti
2011年08月11日 19:50:28	tchrist	set	type: behavior components: + Regular Expressions, - Library (Lib)
2011年08月11日 18:48:20	tchrist	create

homepage