This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009年07月03日 23:14 by christoph, last changed 2022年04月11日 14:56 by admin.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| test_unicode.titlecase.diff | christoph, 2009年07月03日 23:14 | Patch adding a test case for istitle() | ||
| unicodeobject.titlecase.diff | christoph, 2009年07月03日 23:21 | Incomplete patch fixing title() and istitle() | ||
| unicodeobject.titlecase.2.diff | christoph, 2009年09月14日 21:18 | Patch fixing title() and istitle() | ||
| unicodeobject.titlecase.3.diff | christoph, 2009年09月29日 10:40 | |||
| Messages (14) | |||
|---|---|---|---|
| msg90086 - (view) | Author: Christoph Burgmer (christoph) | Date: 2009年07月03日 23:14 | |
Titlecase, i.e. istitle() and title(), is buggy when the string includes combining diacritical marks. >>> u'H\u0301ngh'.istitle() False >>> u'H\u0301ngh'.title() u'H\u0301Ngh' >>> The string given already is in titlecase so that the following result is expected: >>> u'H\u0301ngh'.istitle() True >>> u'H\u0301ngh'.title() u'H\u0301ngh' >>> UTR#21 Case Mappings defines the following algorithm for titlecase mapping [1]: For each character C, find the preceding character B. ignore any intervening case-ignorable characters when finding B. If B exists, and is cased map C to UCD_lower(C) Otherwise, map C to UCD_title(C) The class of 'case-ignorable' is defined under [2] and includes Nonspacing Marks (Mn) as listed in [3]. This includes diacritcal marks and others. These should not be handled similar to spaces which they currently are, thus dividing words. A patch including the above test case is attached. [1] http://unicode.org/reports/tr21/tr21-5.html#Case_Conversion_of_Strings [2] http://unicode.org/reports/tr21/tr21-5.html#Definitions [3] http://www.fileformat.info/info/unicode/category/Mn/list.htm |
|||
| msg90087 - (view) | Author: Christoph Burgmer (christoph) | Date: 2009年07月03日 23:21 | |
Adding a incomplete patch in need of a function Py_UNICODE_ISCASEIGNORABLE defining the case-ignorable class. I don't want to touch capitalize() as I don't fully understand the semantics, where it is different to title(). It seems though following UTR#21 not the first character should be uppercased, but the first character with casing. |
|||
| msg90563 - (view) | Author: Christoph Burgmer (christoph) | Date: 2009年07月16日 07:55 | |
Casing algorithms should follow Section 3.13 "Default Case Algorithms" in the standard itself, not UTR#21. See http://www.unicode.org/Public/5.2.0/ucd/DerivedCoreProperties-5.2.0d11. Unicode 5.2. A nice mail on the Unicode mail list has a bit explanation to that: http://www.unicode.org/mail-arch/unicode-ml/y2009- |
|||
| msg92635 - (view) | Author: Christoph Burgmer (christoph) | Date: 2009年09月14日 21:18 | |
Implementing full patch solving it the old way (UTR#21). The correct way for the latest Unicode version would be to implement the word breaking algorithm described in (UAX#29) [1] first. [1] http://www.unicode.org/reports/tr29/#Word_Boundaries |
|||
| msg92636 - (view) | Author: Christoph Burgmer (christoph) | Date: 2009年09月14日 21:24 | |
I should add that I didn't include the two header files generated by Tools/unicode/makeunicodedata.py |
|||
| msg93263 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2009年09月29日 09:05 | |
The patch looks good, but it doesn't include the few extra characters that are also considered case-ignorable: * U+0027 APOSTROPHE * U+00AD SOFT HYPHEN (SHY) * U+2019 RIGHT SINGLE QUOTATION MARK Could you add those as well ? Thanks. |
|||
| msg93265 - (view) | Author: Christoph Burgmer (christoph) | Date: 2009年09月29日 09:20 | |
> * U+0027 APOSTROPHE hardcoded (see below) > * U+00AD SOFT HYPHEN (SHY) has the "Format (Cf)" property and thus is included automatically > * U+2019 RIGHT SINGLE QUOTATION MARK hardcoded (see below) I hardcoded some characters into Tools/unicode/makeunicodedata.py: >>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027', u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019', u'\u2024', u'\ufe52', u'\uff07', u'\uff0e']) : · · ״ ‧ : : : ' . ‘ ’ . . ' . Those cannot currently be extracted automatically, as neither DerivedCoreProperties.txt nor the source file for property "Word_Break(C) = MidLetter or MidNumLet" are provided in the script. As I said, the patch is only a second best solution, as the correct path would be implementing the word breaking algorithm as described in the newest standard. This patch is just an improvement over the current situation. |
|||
| msg93267 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2009年09月29日 09:44 | |
Christoph Burgmer wrote: > > Christoph Burgmer <cburgmer@ira.uka.de> added the comment: > >> * U+0027 APOSTROPHE > hardcoded (see below) >> * U+00AD SOFT HYPHEN (SHY) > has the "Format (Cf)" property and thus is included automatically >> * U+2019 RIGHT SINGLE QUOTATION MARK > hardcoded (see below) > > I hardcoded some characters into Tools/unicode/makeunicodedata.py: >>>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027', > u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019', > u'\u2024', u'\ufe52', u'\uff07', u'\uff0e']) > : · · ״ ‧ : : : ' . ‘ ’ . . ' . > > Those cannot currently be extracted automatically, as neither > DerivedCoreProperties.txt nor the source file for property > "Word_Break(C) = MidLetter or MidNumLet" are provided in the script. As long as those code points are defined somewhere in the Unicode standard files, that's ok. It would be good to add a comment explaining the above in the code. BTW: It's better to use "if (....)" instead of \-line joining. The parens will automatically have Python do the line joining for you and it looks better. > As I said, the patch is only a second best solution, as the correct > path would be implementing the word breaking algorithm as described in > the newest standard. This patch is just an improvement over the current > situation. We could handle the work-breaking in a separate new method. For .title(), I think your patch is an improvement and it will fix most of the cases that issue7008 mentions. |
|||
| msg93273 - (view) | Author: Christoph Burgmer (christoph) | Date: 2009年09月29日 10:40 | |
New patch - updated comments to reflect needed integration of DerivedCoreProperties.txt - cleaned up if(...) construct - updated (from issue7008) and integrated testcase When applying this patch, run Tools/unicode/makeunicodedata.py to regenerate the header files. Note though, that with this patch str and unicode objects will not behave equally: >>> s = "This isn't right" >>> s.title() == unicode(s).title() False |
|||
| msg94036 - (view) | Author: Jeff Senn (senn) (Python committer) | Date: 2009年10月14日 21:50 | |
Referred to this from issue 4610... anyone following this might want to look there as well. |
|||
| msg94037 - (view) | Author: Jeff Senn (senn) (Python committer) | Date: 2009年10月14日 21:55 | |
So, is it not considered a bug that: >>> "This isn't right".title() "This Isn'T Right" !?!?!? |
|||
| msg94039 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2009年10月14日 22:28 | |
Jeff Senn wrote: > > Jeff Senn <senn@users.sourceforge.net> added the comment: > > So, is it not considered a bug that: > >>>> "This isn't right".title() > "This Isn'T Right" > > !?!?!? That's http://bugs.python.org/issue7008 and is fixed as part of http://bugs.python.org/issue6412 |
|||
| msg112791 - (view) | Author: Christoph Burgmer (christoph) | Date: 2010年08月04日 11:33 | |
@Terry How is the behavior changed? To me it seems the same to as initially reported. The results are consistent but nonetheless wrong. It's not about whether your agree with the result, but rather about following the Unicode standard. |
|||
| msg112840 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2010年08月04日 17:48 | |
Christoph is responding above to a previous version of this message with an erroneous conclusion based on a misreading of his original message. The proposed patch makes this issue overlap #7008, which had some contentious discussion, so I am adding some people from that to this nosy list so they may opine here. Otherwise starting over: 3.1 has the same bug. 3.1.2 >>> 'H\u0301ngh'.istitle() False >>> 'H\u0301ngh'=='H\u0301ngh'.title() False >>> 'H\u0301ngh'.title() 'H́Ngh' # in IDLE, the accent is over the H The problem is that .title() treats the accent that looks like an apostrophe '\u0301' as if it were an apostrophe "'". The latter are documented as forming word boundaries, as in >>> "De'souza".title() "De'Souza" >>> "O'brian".title() "O'Brian" Here is the beginning of the 3.1.2 title() doc: "str.title() Return a titlecased version of the string where words start with an uppercase character and the remaining characters are lowercase. The algorithm uses a simple language-independent definition of a word as groups of consecutive letters. The definition works in many contexts but it means that apostrophes in contractions and possessives form word boundaries, which may not be the desired result:" That means that >>> "This Isn'T Right".istitle() True is correct as documented. I interpret the conclusion of #7008, based on Guido's msg93242, as saying that that should be left alone. but I interpret previous messages and the test in unicodeobject.titlecase.3.diff as saying this would become be False. Such a change would badly affect the prior examples where the post ' capital *is* wanted. The is why that change was rejected in #7008. So I think ' should be removed from the current patch. I do not know about the other chars that are hard-coded. With or without that, there is the issue of whether the current behavior really contradicts the somewhat vague doc and whether change would break enough code that this issue should be treated as a feature change for 3.2 only. Reading this from msg93265 "As I said, the patch is only a second best solution, as the correct path would be implementing the word breaking algorithm as described in the newest standard. This patch is just an improvement over the current situation." makes me wonder whether .title & and .istitle should be left alone until the right solution is implemented. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:50 | admin | set | github: 50661 |
| 2017年11月08日 20:08:20 | Serhiy Int | set | nosy:
+ Serhiy Int |
| 2012年09月26日 17:10:40 | ezio.melotti | set | versions: + Python 3.3, Python 3.4, - Python 3.1 |
| 2010年08月04日 17:48:02 | terry.reedy | set | nosy:
+ rhettinger, pitrou, r.david.murray messages: + msg112840 versions: + Python 3.1, Python 3.2 |
| 2010年08月04日 16:33:25 | terry.reedy | set | messages: - msg112715 |
| 2010年08月04日 11:33:52 | christoph | set | messages: + msg112791 |
| 2010年08月03日 22:43:46 | terry.reedy | set | versions:
- Python 2.6, Python 2.5 nosy: + terry.reedy messages: + msg112715 stage: needs patch -> patch review |
| 2009年10月14日 22:28:29 | lemburg | set | messages: + msg94039 |
| 2009年10月14日 21:55:23 | senn | set | messages: + msg94037 |
| 2009年10月14日 21:50:35 | senn | set | nosy:
+ senn messages: + msg94036 |
| 2009年09月29日 10:40:29 | christoph | set | files:
+ unicodeobject.titlecase.3.diff messages: + msg93273 |
| 2009年09月29日 09:44:38 | lemburg | set | messages: + msg93267 |
| 2009年09月29日 09:20:11 | christoph | set | messages: + msg93265 |
| 2009年09月29日 09:05:47 | lemburg | set | nosy:
+ lemburg messages: + msg93263 |
| 2009年09月16日 07:55:37 | ggenellina | set | nosy:
+ ggenellina |
| 2009年09月14日 21:24:31 | christoph | set | messages: + msg92636 |
| 2009年09月14日 21:18:28 | christoph | set | files:
+ unicodeobject.titlecase.2.diff messages: + msg92635 |
| 2009年07月16日 07:55:11 | christoph | set | messages: + msg90563 |
| 2009年07月03日 23:21:00 | christoph | set | files:
+ unicodeobject.titlecase.diff messages: + msg90087 |
| 2009年07月03日 23:19:05 | ezio.melotti | set | versions:
+ Python 2.7 nosy: + ezio.melotti priority: normal type: behavior stage: needs patch |
| 2009年07月03日 23:14:47 | christoph | create | |