Issue 5828: Invalid behavior of unicode.lower

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50078

classification

Title:	Invalid behavior of unicode.lower
Type:	behavior	Stage:	patch review
Components:	Unicode	Versions:	Python 3.0, Python 3.1, Python 2.7, Python 2.6

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	loewis	Nosy List:	amaury.forgeotdarc, doerwalter, jarek, loewis, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2009年04月24日 10:39 by jarek, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
diff.txt	doerwalter, 2009年04月24日 12:57
diff2.txt	doerwalter, 2009年04月24日 14:15
diff3.txt	doerwalter, 2009年04月25日 09:16
mud.diff	loewis, 2009年04月25日 11:38
diff4.txt	doerwalter, 2009年04月25日 13:37

Messages (14)
msg86400 - (view)	Author: Jarek Sobieszek (jarek)	Date: 2009年04月24日 10:39
u'\u1d79'.lower() returns u'\x00' I think it should return u'\u1d79', at least according to my understanding of UnicodeData.txt (the lowercase field is empty).
msg86401 - (view)	Author: Walter Dörwald (doerwalter) * (Python committer)	Date: 2009年04月24日 10:49
It does return u'\u1d79' for me on Python 2.5.2: >>> u'\u1d79'.lower() u'\u1d79' >>> import sys >>> sys.version '2.5.2 (r252:60911, Apr 8 2008, 18:54:00) \n[GCC 3.3.5 (Debian 1:3.3.5-13)]' However on 2.6.2 it's broken: >>> u'\u1d79'.lower() u'\x00' >>> import sys >>> sys.version '2.6.2 (r262:71600, Apr 19 2009, 18:38:49) \n[GCC 4.0.1 (Apple Inc. build 5490)]'
msg86405 - (view)	Author: Walter Dörwald (doerwalter) * (Python committer)	Date: 2009年04月24日 12:57
The following patch fixes the problem for me, however it breaks the test suite. The change seems to have been introduced in r66362. Assigning to Martin.
msg86406 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2009年04月24日 13:05
The same change should be applied to _PyUnicode_ToTitlecase as well.
msg86411 - (view)	Author: Walter Dörwald (doerwalter) * (Python committer)	Date: 2009年04月24日 14:15
Updated the patch (diff2.txt) as requested by Amaury.
msg86425 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2009年04月24日 18:51
Py3.0.1 >>> '\u1d79'.lower() '\x00' I am guessing that this bug is in 2.7 and 3.1 as well.
msg86447 - (view)	Author: Walter Dörwald (doerwalter) * (Python committer)	Date: 2009年04月25日 09:16
Here is a third version of the patch. AFAICT the logic of the unicode database is as follows: * If the NODELTA_MASK is not set, delta is an offset. * If NODELTA_MASK is set and delta is != 0, delta is the upper/lower/title case character. * If NODELTA_MASK is set and delta is == 0, there is no upper/lower/title case variant (i.e. the method returns the original character. Is this the correct interpretation? I've also updated the testsuite (changed the checksum and added a new test). (BTW, the patch is against the py3k branch).
msg86476 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2009年04月25日 11:38
I think the patch is incorrect; the bug is already in makeunicodedata.py. For U+1d79, it should set the lowercase letter to U+1d79. If you look at makeunicodedata.py, you see that the entire logic is bogus: when the column is absent, it should default it to the character itself (except for titlecase, where it should default it to uppercase). Then, if it finds that one of the characters can't be delta-encoded, it should go back to changing the previous mappings as well. I'm attaching an untested patch that should do that. Also see issue4971, which is related.
msg86506 - (view)	Author: Walter Dörwald (doerwalter) * (Python committer)	Date: 2009年04月25日 13:37
I've merged your version of the patch with my changes to the test suite and regenerated the Unicode database. Attached is the resulting patch (diff4.txt)
msg86507 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2009年04月25日 13:47
Feel free to check it into trunk, and merge into the other three branches from there. If you don't want to do that, assign it back to me.
msg86511 - (view)	Author: Walter Dörwald (doerwalter) * (Python committer)	Date: 2009年04月25日 14:10
Checked in: r71894 (trunk) r71895 (release26-maint)
msg86512 - (view)	Author: Walter Dörwald (doerwalter) * (Python committer)	Date: 2009年04月25日 14:17
Checked in: r71896 (py3k) r71897 (release30-maint)
msg86513 - (view)	Author: Walter Dörwald (doerwalter) * (Python committer)	Date: 2009年04月25日 14:20
BTW, are the steps to regenerate the Unicode database documented somewhere? What I did was: cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/UnicodeData.txt . cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/CompositionExclusions.txt . cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/EastAsianWidth.txt . cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/DerivedCoreProperties.txt . cp /Volumes/ftp.unicode.org/Public/3.2-Update/ucd/UnicodeData-3.2.0.txt . cp /Volumes/ftp.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt . cp /Volumes/ftp.unicode.org/Public/3.2-Update/CompositionExclusions-3.2.0.txt . cp /Volumes/ftp.unicode.org/Public/3.2-Update/EastAsianWidth-3.2.0.txt . cp /Volumes/ftp.unicode.org/Public/3.2-Update/DerivedCoreProperties-3.2.0.txt . ./python.exe Tools/unicode/makeunicodedata.py
msg86514 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2009年04月25日 14:46
> BTW, are the steps to regenerate the Unicode database documented > somewhere? I don't think so - your procedure looks right, though. Regenerating the database is often more difficult, though, in particular when we upgrade to a new version. Often, the new version will add new complications which have to be dealt with, so a deep understanding of makeunicodata.py is often needed to be able to use it. Welcome to the club :-)

History
Date	User	Action	Args
2022年04月11日 14:56:48	admin	set	github: 50078
2009年04月25日 14:46:47	loewis	set	messages: + msg86514
2009年04月25日 14:21:04	doerwalter	set	assignee: doerwalter -> loewis
2009年04月25日 14:20:53	doerwalter	set	messages: + msg86513
2009年04月25日 14:17:33	doerwalter	set	status: open -> closed resolution: fixed messages: + msg86512
2009年04月25日 14:10:37	doerwalter	set	messages: + msg86511
2009年04月25日 13:47:12	loewis	set	assignee: loewis -> doerwalter messages: + msg86507
2009年04月25日 13:37:12	doerwalter	set	files: + diff4.txt messages: + msg86506
2009年04月25日 11:38:30	loewis	set	files: + mud.diff keywords: + patch messages: + msg86476
2009年04月25日 09:16:31	doerwalter	set	files: + diff3.txt messages: + msg86447
2009年04月24日 18:51:31	terry.reedy	set	nosy: + terry.reedy messages: + msg86425 versions: + Python 3.0, Python 3.1, Python 2.7
2009年04月24日 14:15:39	doerwalter	set	files: + diff2.txt messages: + msg86411
2009年04月24日 13:05:49	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg86406
2009年04月24日 12:57:23	doerwalter	set	files: + diff.txt nosy: + loewis messages: + msg86405 assignee: loewis stage: patch review
2009年04月24日 10:49:32	doerwalter	set	nosy: + doerwalter messages: + msg86401
2009年04月24日 10:39:58	jarek	create

homepage