Message 90086 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	christoph
Recipients	christoph
Date	2009年07月03日.23:14:46
SpamBayes Score	2.408251e-05
Marked as misclassified	No
Message-id	<1246662889.32.0.498842923824.issue6412@psf.upfronthosting.co.za>

Content
Titlecase, i.e. istitle() and title(), is buggy when the string includes combining diacritical marks. >>> u'H\u0301ngh'.istitle() False >>> u'H\u0301ngh'.title() u'H\u0301Ngh' >>> The string given already is in titlecase so that the following result is expected: >>> u'H\u0301ngh'.istitle() True >>> u'H\u0301ngh'.title() u'H\u0301ngh' >>> UTR#21 Case Mappings defines the following algorithm for titlecase mapping [1]: For each character C, find the preceding character B. ignore any intervening case-ignorable characters when finding B. If B exists, and is cased map C to UCD_lower(C) Otherwise, map C to UCD_title(C) The class of 'case-ignorable' is defined under [2] and includes Nonspacing Marks (Mn) as listed in [3]. This includes diacritcal marks and others. These should not be handled similar to spaces which they currently are, thus dividing words. A patch including the above test case is attached. [1] http://unicode.org/reports/tr21/tr21-5.html#Case_Conversion_of_Strings [2] http://unicode.org/reports/tr21/tr21-5.html#Definitions [3] http://www.fileformat.info/info/unicode/category/Mn/list.htm

Content

Titlecase, i.e. istitle() and title(), is buggy when the string
includes combining diacritical marks.
>>> u'H\u0301ngh'.istitle()
False
>>> u'H\u0301ngh'.title()
u'H\u0301Ngh'
>>>
The string given already is in titlecase so that the following result
is expected:
>>> u'H\u0301ngh'.istitle()
True
>>> u'H\u0301ngh'.title()
u'H\u0301ngh'
>>>
UTR#21 Case Mappings defines the following algorithm for titlecase
mapping [1]:
For each character C, find the preceding character B. 
 ignore any intervening case-ignorable characters when finding B.
If B exists, and is cased 
 map C to UCD_lower(C)
Otherwise, 
 map C to UCD_title(C)
The class of 'case-ignorable' is defined under [2] and includes
Nonspacing Marks (Mn) as listed in [3]. This includes diacritcal marks
and others. These should not be handled similar to spaces which they
currently are, thus dividing words.
A patch including the above test case is attached.
[1]
http://unicode.org/reports/tr21/tr21-5.html#Case_Conversion_of_Strings
[2] http://unicode.org/reports/tr21/tr21-5.html#Definitions
[3] http://www.fileformat.info/info/unicode/category/Mn/list.htm

History
Date	User	Action	Args
2009年07月03日 23:14:49	christoph	set	recipients: + christoph
2009年07月03日 23:14:49	christoph	set	messageid: <1246662889.32.0.498842923824.issue6412@psf.upfronthosting.co.za>
2009年07月03日 23:14:47	christoph	link	issue6412 messages
2009年07月03日 23:14:47	christoph	create

homepage