Issue 10521: str methods don't accept non-BMP fillchar on a narrow Unicode build

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54730

classification

Title:	str methods don't accept non-BMP fillchar on a narrow Unicode build
Type:	behavior	Stage:	needs patch
Components:	Interpreter Core	Versions:	Python 3.2, Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	out of date
Assigned To:	Nosy List:	amaury.forgeotdarc, belopolsky, benjamin.peterson, eric.smith, ezio.melotti, lemburg, pitrou, terry.reedy, vstinner
Priority:	normal	Keywords:	patch

Created on 2010年11月24日 15:25 by belopolsky, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue10521-isalpha.diff	ezio.melotti, 2010年11月25日 06:28	Proof of concept that fixes isalpha	review
issue10521-unicode-next.diff	belopolsky, 2010年11月25日 07:13	review

Messages (18)
msg122280 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月24日 15:25
>>> 'xyz'.center(20, '\U00100140') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: The fill character must be exactly one character long str.ljust and str.rjust are similarly affected.
msg122284 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2010年11月24日 15:33
The question is, what should it do with such an input? Pretend it's a single char (but other chars in the source string won't get the same treatment)? Treat it as a two-char string (but then center() and friends should logically be extended to accept strings of arbitrary lengths)?
msg122285 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2010年11月24日 15:57
str.__format__ and friends (int, float, complex) also have this same problem. For example, when they're computing the "fill" character: >>> format('', 'x^') '' >>> format('', '\U00100140^') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: Invalid conversion specification
msg122296 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月24日 19:06
On Wed, Nov 24, 2010 at 10:33 AM, Antoine Pitrou <report@bugs.python.org> wrote: .. > The question is, what should it do with such an input? I think the rule for such functions should be that if input.encode('utf-8') is the same on wide and narrow builds, then the output.encode('utf-8') should be the same. > Pretend it's a single char (but other chars in the source string won't get the same treatment)? Yes, and surrogate pairs in the source string should count for one char as well. > Treat it as a two-char string (but then center() and friends should logically be > extended to accept strings of arbitrary lengths)? No. For better or worse, on wide builds these methods effectively operate on code points. They don't interpret multi-code-point- graphemes or take grapheme width into account: -------------------- 123 -------------------- Application code has to ascertain that it is dealing with with fixed width characters in the target font before using these methods for text alignment.
msg122310 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2010年11月24日 20:37
Alexander Belopolsky wrote: > > New submission from Alexander Belopolsky <belopolsky@users.sourceforge.net>: > >>>> 'xyz'.center(20, '\U00100140') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > TypeError: The fill character must be exactly one character long > > str.ljust and str.rjust are similarly affected. I don't think we should change that for the formatting methods. See my reply on python-dev: str.center(n) centers the string in a padded string that is composed of n code units. Whether that operation will result in a text that's centered visually on output is a completely different story. The original string could contain surrogates, it could also contain combing code points, so the visual presentation of the result may very well not be centered at all; it may not even appear as having the length n to the user. Since we're not going change the semantics of those APIs, it is OK to not support padding with non-BMP code points on UCS-2 builds. Supporting such cases would only cause problems: * if the methods would pad with surrogates, the resulting string would no longer have length n; breaking the assumption that len(str.center(n)) == n * if the methods would pad with half the number of surroagtes to make sure that len(str.center(n)) == n, the resulting output to e.g. a terminal would be further off, than what you already have with surrogates and combining code points in the original string.
msg122329 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月25日 05:01
On Wed, Nov 24, 2010 at 3:37 PM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. > I don't think we should change that for the formatting methods. That's a reasonable position. What about >>> unicodedata.category('\N{OLD ITALIC LETTER A}') 'Lo' >>> '\N{OLD ITALIC LETTER A}'.isalpha() False the str.isalpha() method is underspecified in the reference manual, but a comment in unicodectype.c describes Py_UNICODE_ISALPHA as follows: /* Returns 1 for Unicode characters having the category 'Ll', 'Lu', 'Lt', 'Lo' or 'Lm', 0 otherwise. */ I don't have a wide build handy, but I am fairly sure '\N{OLD ITALIC LETTER A}'.isalpha() would produce True there. The result above is simply consequence of surrogates considered to be non-letters: >>> [c.isalpha() for c in '\N{OLD ITALIC LETTER A}'] [False, False]
msg122330 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月25日 05:03
Here is another str method not ready for non-BMP chars: >>> u = '\U00010140' >>> u.translate({ord(u):ord('A')}) '𐅀' (expected 'A') >>> u = 'B' >>> u.translate({ord(u):ord('A')}) 'A'
msg122336 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年11月25日 06:28
I think that methods like str.isalpha can and should be fixed. Since _PyUnicode_IsAlpha now accepts a Py_UCS4, the body of unicode_isalpha can be changed to convert normal chars and surrogates pairs to a Py_UCS4 before calling Py_UNICODE_ISALPHA. The attached patch is a proof of concept of this approach and returns True for '\N{OLD ITALIC LETTER A}'.isalpha() on a narrow build. It still has a number of issues that should be addressed (check for narrow builds, check for lone surrogates, check for high surrogate at the end of a string, fix compiler warnings ...) but it should be good enough as a PoC. I would also suggest to introduce a set of macros to handle surrogates (e.g. detect, combine) and use it in all the functions that need to work with them.
msg122339 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月25日 07:13
Here is another proof of concept patch for the isalpha issue that introduces a higher level abstraction macro - Py_UNICODE_NEXT. It should be possible to reuse this macro in all isxyz methods and other places where surrogates are currently processed. I should be possible to come up with a pure macro definition of Py_UNICODE_NEXT.
msg122340 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2010年11月25日 07:44
issue9200 already proposes a similar change to str.is* methods.
msg122483 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2010年11月26日 23:37
As a practical matter, I think that for at least the next decade, people are at least as likely to want to fill with a composed, multi-BMP-codepoint 'char' (grapheme) as with a non-BMP char. So to me, failure with the latter is no worse than failure with the former. The underlying problem is that centering k chars within n spaces with fill i is based on one-char per code encodings and fixed pitch fonts with one-char per space. That model is not universally applicable, so I do not consider it a bug that functions based on that model are also not universally applicable. Perhaps docs should be clearer about the limitations of many of the string methods in the new context. A full general solution to the general problem of centering requires a shift to physical units (points or mm) and detailed font information, including kerning. This is beyond the scope of a string method. So I consider this a feature request for a partial generalization of unclear utility and unclear definition.
msg122487 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月27日 00:08
On Fri, Nov 26, 2010 at 6:37 PM, Terry J. Reedy <report@bugs.python.org> wrote: > > Terry J. Reedy <tjreedy@udel.edu> added the comment: > > As a practical matter, I think that for at least the next decade, people are at least as likely to > want to fill with a composed, multi-BMP-codepoint 'char' (grapheme) as with a non-BMP char. > So to me, failure with the latter is no worse than failure with the former. > I disagree. '\N{AEGEAN WORD SEPARATOR DOT}' ('𐄁') looks like a reasonably shaped fill character, while say 'Z\N{COMBINING ACUTE ACCENT}\N{COMBINING GRAVE ACCENT}' ('Ź̀') does not. Yet this is not the point of this bug report. The point is that Python user should not care (much) about how many bytes per character Python uses under the hood or what is the numeric value of the character that she can enter in her program. > The underlying problem is that centering k chars within n spaces with fill i is based > on one-char per code encodings and fixed pitch fonts with one-char per space. No. ' Section Title '.center(40, '*') will look good regardless of font width and even more so when combined with <center> tag or its equivalent in a given application.
msg122488 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2010年11月27日 00:25
I think these macros would be a reasonable approach. I think str.center, etc. should support non-BMP chars, because to not do so can raise an exception. Supporting composed graphemes seems like another problem altogether. And while we could fix that, it's clearly a larger step.
msg122507 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年11月27日 08:16
I agree that s.center(char, n).encode('utf-8') should be the same on both the builds -- even if their len() will be different -- for the following reasons: 1) the string will eventually be encoded, and if they the result is the same on both builds, it will look the same too; 2) trying to keep the same len() will generate different results and it won't work in case of odd width like 'foo'.center(surrogate_pair, 5) because you can't put half surrogate.
msg122548 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2010年11月27日 20:26
After reading the additional messages here and on a similar issue Alexander opened after this, I seem the point of wanting to make the difference between the two types of builds as transparent as sensibly possible. From that viewpoint, rejection of composed chars is not as bad because both types of builds act the same.
msg144630 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年09月29日 20:25
This issue has been fixed in Python 3.3 thanks to the PEP 393.
msg144632 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年09月29日 20:28
It can still be fixed on 2.7/3.2 though.
msg150691 - (view)	Author: Benjamin Peterson (benjamin.peterson) * (Python committer)	Date: 2012年01月05日 21:12
I'm just going to close this and say "use 3.3".

History
Date	User	Action	Args
2022年04月11日 14:57:09	admin	set	github: 54730
2012年01月05日 21:12:48	benjamin.peterson	set	status: open -> closed nosy: + benjamin.peterson messages: + msg150691 resolution: out of date
2011年09月29日 20:28:18	ezio.melotti	set	messages: + msg144632 versions: + Python 2.7
2011年09月29日 20:25:00	vstinner	set	nosy: + vstinner messages: + msg144630
2010年11月27日 20:26:03	terry.reedy	set	messages: + msg122548
2010年11月27日 08:16:12	ezio.melotti	set	messages: + msg122507
2010年11月27日 00:25:09	eric.smith	set	messages: + msg122488
2010年11月27日 00:08:30	belopolsky	set	messages: + msg122487
2010年11月26日 23:37:24	terry.reedy	set	nosy: + terry.reedy messages: + msg122483
2010年11月25日 07:44:36	amaury.forgeotdarc	set	messages: + msg122340
2010年11月25日 07:31:12	ezio.melotti	set	nosy: + amaury.forgeotdarc
2010年11月25日 07:13:24	belopolsky	set	files: + issue10521-unicode-next.diff messages: + msg122339
2010年11月25日 06:28:48	ezio.melotti	set	files: + issue10521-isalpha.diff keywords: + patch messages: + msg122336
2010年11月25日 05:03:54	belopolsky	set	messages: + msg122330
2010年11月25日 05:01:52	belopolsky	set	messages: - msg122313
2010年11月25日 05:01:37	belopolsky	set	messages: + msg122329
2010年11月24日 21:23:57	belopolsky	set	messages: + msg122313
2010年11月24日 20:37:48	lemburg	set	nosy: + lemburg messages: + msg122310
2010年11月24日 19:55:56	ezio.melotti	set	nosy: + ezio.melotti
2010年11月24日 19:06:15	belopolsky	set	messages: + msg122296
2010年11月24日 15:57:33	eric.smith	set	nosy: + eric.smith messages: + msg122285
2010年11月24日 15:33:42	pitrou	set	nosy: + pitrou messages: + msg122284
2010年11月24日 15:25:23	belopolsky	create

homepage