Issue 12266: str.capitalize contradicts oneself

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56475

classification

Title:	str.capitalize contradicts oneself
Type:	behavior	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.2, Python 3.3, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:	Superseder:	str.upper converts to title View: 12204
Assigned To:	ezio.melotti	Nosy List:	belopolsky, eric.araujo, ezio.melotti, lemburg, py.user, python-dev, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2011年06月05日 05:54 by py.user, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue12266.diff	ezio.melotti, 2011年08月14日 17:58	Patch against 3.2 + tests.	review

Messages (15)
msg137682 - (view)	Author: py.user (py.user) *	Date: 2011年06月05日 05:54
specification str.capitalize()¶ Return a copy of the string with its first character capitalized and the rest lowercased. >>> '\u1ffc', '\u1ff3' ('ῼ', 'ῳ') >>> '\u1ffc'.isupper() False >>> '\u1ff3'.islower() True >>> s = '\u1ff3\u1ff3\u1ffc\u1ffc' >>> s 'ῳῳῼῼ' >>> s.capitalize() 'ῼῳῼῼ' >>> A: lower B: title A -> B & !B -> A
msg137694 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2011年06月05日 13:37
This looks like a duplicate of #12204.
msg137720 - (view)	Author: py.user (py.user) *	Date: 2011年06月06日 00:18
in http://bugs.python.org/issue12204 Marc-Andre Lemburg wrote: I suggest to close this ticket as invalid or to add a note to the documentation explaining how the mapping is applied (and when not). this problem is another str.capitalize makes the first character big, but it doesn't make the rest small clearing documentation is not enough lowering works >>> '\u1ffc' 'ῼ' >>> '\u1ffc'.lower() 'ῳ' >>>
msg140780 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年07月21日 04:59
Indeed this seems a different issue, and might be worth fixing it. Given this definition: str.capitalize()¶ Return a copy of the string with its first character capitalized and the rest lowercased. we might implement capitalize like: >>> def mycapitalize(s): ... return s[0].upper() + s[1:].lower() ... >>> 'fOoBaR'.capitalize() 'Foobar' >>> mycapitalize('fOoBaR') 'Foobar' And this would yield the correct result: >>> s = u'\u1ff3\u1ff3\u1ffc\u1ffc' >>> print s ῳῳῼῼ >>> print s.capitalize() ῼῳῼῼ >>> print mycapitalize(s) ῼῳῳῳ >>> s.capitalize().istitle() False >>> mycapitalize(s).istitle() True This doesn't happen because the actual implementation of str.capitalize checks if a char is uppercase (and not if it's titlecase too) before converting it to lowercase. This can be fixed doing: diff -r cb44fef5ea1d Objects/unicodeobject.c --- a/Objects/unicodeobject.c Thu Jul 21 01:11:30 2011 +0200 +++ b/Objects/unicodeobject.c Thu Jul 21 07:57:21 2011 +0300 @@ -6739,7 +6739,7 @@ } s++; while (--len > 0) { - if (Py_UNICODE_ISUPPER(s)) { + if (Py_UNICODE_ISUPPER(s) \|\| Py_UNICODE_ISTITLE(s)) { s = Py_UNICODE_TOLOWER(*s); status = 1; }
msg140796 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2011年07月21日 08:34
I think it would be better to use this code: if (!Py_UNICODE_ISUPPER(s)) { s = Py_UNICODE_TOUPPER(s); status = 1; } s++; while (--len > 0) { if (Py_UNICODE_ISLOWER(s)) { s = Py_UNICODE_TOLOWER(s); status = 1; } s++; } Since this actually implements what the doc-string says. Note that title case is not the same as upper case. Title case is a special case that get's applied when using a string as a title of a text and may well include characters that are lower case but which are only used in titles.
msg140798 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年07月21日 08:52
Do you mean "if (!Py_UNICODE_ISLOWER(s)) {" (with the '!')? This sounds fine to me, but with this approach all the uncased characters will go through a Py_UNICODE_TO macro, whereas with the current code only the cased ones are converted. I'm not sure this matters too much though. OTOH if the non-lowercase cased chars are always either upper or titlecased, checking for both should be equivalent.
msg140799 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2011年07月21日 09:02
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > Do you mean "if (!Py_UNICODE_ISLOWER(s)) {" (with the '!')? Sorry, here's the correct version: if (!Py_UNICODE_ISUPPER(s)) { s = Py_UNICODE_TOUPPER(s); status = 1; } s++; while (--len > 0) { if (!Py_UNICODE_ISLOWER(s)) { s = Py_UNICODE_TOLOWER(s); status = 1; } s++; } > This sounds fine to me, but with this approach all the uncased characters will go through a Py_UNICODE_TO macro, whereas with the current code only the cased ones are converted. I'm not sure this matters too much though. > > OTOH if the non-lowercase cased chars are always either upper or titlecased, checking for both should be equivalent. AFAIK, there are characters that don't have a case mapping at all. It may also be the case, that a non-cased character still has a lower/upper case mapping, e.g. for typographical reasons. Someone would have to check this against the current Unicode database.
msg140857 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年07月22日 03:30
>>> import sys; hex(sys.maxunicode) '0x10ffff' >>> import unicodedata; unicodedata.unidata_version '6.0.0' import unicodedata all_chars = list(map(chr, range(0x110000))) Ll = [c for c in all_chars if unicodedata.category(c) == 'Ll'] Lu = [c for c in all_chars if unicodedata.category(c) == 'Lu'] Lt = [c for c in all_chars if unicodedata.category(c) == 'Lt'] Lo = [c for c in all_chars if unicodedata.category(c) == 'Lo'] Lm = [c for c in all_chars if unicodedata.category(c) == 'Lm'] >>> [len(x) for x in [Ll, Lu, Lt, Lo, Lm]] [1759, 1436, 31, 97084, 210] >>> sum(1 for c in Lu if c.lower() == c) 471 # uppercase chars with no lower >>> sum(1 for c in Lt if c.lower() == c) 0 # titlecase chars with no lower >>> sum(1 for c in Ll if c.upper() == c) 760 # lowercase chars with no upper >>> sum(1 for c in Lo if c.upper() != c or c.title() != c or c.lower() != c) 0 # "Letter, other" chars with a different upper/title/lower case >>> sum(1 for c in Lm if c.upper() != c or c.title() != c or c.lower() != c) 0 # "Letter, modifier" chars with a different upper/title/lower case >>> sum(1 for c in all_chars if c not in L and (c.upper() != c or c.title() != c or c.lower() != c)) 85 # non-letter chars with a different upper/title/lower case >>> [c for c in all_chars if c not in L and (c.upper() != c or c.title() != c or c.lower() != c)] ['', 'I', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'L', 'C', 'D', 'M', 'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'l', 'c', 'd', 'm', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] >>> list(c.lower() for c in _) ['', 'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'l', 'c', 'd', 'm', 'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'l', 'c', 'd', 'm', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] >>> len(_) 85 >>> {unicodedata.category(c) for c in all_chars if c not in L and (c.upper() != c or c.title() != c or c.lower() != c)} {'So', 'Mn', 'Nl'} So == Symbol, Other Mn == Mark, Nonspacing Nl == Number, Letter
msg140858 - (view)	Author: py.user (py.user) *	Date: 2011年07月22日 04:26
>>> [c for c in all_chars if c not in L and ... L ?
msg140859 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年07月22日 04:34
L = set(sum([Ll, Lu, Lt, Lo, Lm], []))
msg142071 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年08月14日 17:58
Attached patch + tests.
msg142099 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2011年08月15日 06:22
New changeset c34772013c53 by Ezio Melotti in branch '3.2': #12266: Fix str.capitalize() to correctly uppercase/lowercase titlecased and cased non-letter characters. http://hg.python.org/cpython/rev/c34772013c53 New changeset eab17979a586 by Ezio Melotti in branch '2.7': #12266: Fix str.capitalize() to correctly uppercase/lowercase titlecased and cased non-letter characters. http://hg.python.org/cpython/rev/eab17979a586
msg142100 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2011年08月15日 06:26
New changeset 1ea72da11724 by Ezio Melotti in branch 'default': #12266: merge with 3.2. http://hg.python.org/cpython/rev/1ea72da11724
msg142101 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年08月15日 06:31
Fixed, thanks for the report!
msg142103 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2011年08月15日 07:04
New changeset d3816fa1bcdf by Ezio Melotti in branch '2.7': #12266: move the tests in test_unicode. http://hg.python.org/cpython/rev/d3816fa1bcdf

History
Date	User	Action	Args
2022年04月11日 14:57:18	admin	set	github: 56475
2011年08月15日 07:04:58	python-dev	set	messages: + msg142103
2011年08月15日 06:31:57	ezio.melotti	set	status: open -> closed resolution: duplicate -> fixed messages: + msg142101
2011年08月15日 06:26:43	python-dev	set	messages: + msg142100
2011年08月15日 06:22:45	python-dev	set	nosy: + python-dev messages: + msg142099
2011年08月14日 17:58:37	ezio.melotti	set	files: + issue12266.diff keywords: + patch messages: + msg142071
2011年07月22日 04:34:09	ezio.melotti	set	messages: + msg140859
2011年07月22日 04:26:16	py.user	set	messages: + msg140858
2011年07月22日 03:30:13	ezio.melotti	set	messages: + msg140857
2011年07月21日 09:02:34	lemburg	set	messages: + msg140799
2011年07月21日 08:52:55	ezio.melotti	set	messages: + msg140798
2011年07月21日 08:34:22	lemburg	set	messages: + msg140796
2011年07月21日 04:59:52	ezio.melotti	set	status: closed -> open assignee: ezio.melotti messages: + msg140780 versions: + Python 2.7, Python 3.2, Python 3.3, - Python 3.1
2011年07月21日 04:20:56	ezio.melotti	set	nosy: + lemburg, belopolsky, ezio.melotti, eric.araujo
2011年06月06日 00:22:44	py.user	set	title: str.capitalize contradicts -> str.capitalize contradicts oneself
2011年06月06日 00:21:55	py.user	set	type: behavior
2011年06月06日 00:18:57	py.user	set	messages: + msg137720
2011年06月05日 13:37:41	r.david.murray	set	status: open -> closed superseder: str.upper converts to title nosy: + r.david.murray messages: + msg137694 resolution: duplicate stage: resolved
2011年06月05日 05:54:59	py.user	create

homepage