Issue 7649: "u'%c' % char" broken for chars in range '\x80'-'\xFF'

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51898

classification

Title:	"u'%c' % char" broken for chars in range '\x80'-'\xFF'
Type:	behavior	Stage:	resolved
Components:	Interpreter Core, Unicode	Versions:	Python 2.7, Python 2.6

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	ezio.melotti	Nosy List:	amaury.forgeotdarc, doerwalter, eric.smith, ezio.melotti, flox, lemburg, vstinner
Priority:	normal	Keywords:	easy, needs review, patch

Created on 2010年01月07日 01:07 by ezio.melotti, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicode_formatchar.patch	vstinner, 2010年01月07日 01:55
issue7649.diff	ezio.melotti, 2010年01月07日 13:35	Patch to raise an error + unittests
issue7649v2.diff	ezio.melotti, 2010年02月22日 19:00	Patch to raise an error + unittests
issue7649v3.diff	ezio.melotti, 2010年02月23日 13:45	Patch to raise an error + unittests
issue7649v4.diff	ezio.melotti, 2010年02月25日 14:46	Patch to raise an error + unittests

Messages (35)
msg97334 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年01月07日 01:07
On Py2.x u'%c' % char returns the wrong result if char is in range '\x80'-'\xFF': >>> u'%c' % '\x7f' u'\x7f' # correct >>> u'%c' % '\x80' u'\uff80' # broken >>> u'%c' % '\xFF' u'\uffff' # broken This is what happens whit %s and 0x80: >>> u'%s' % '\x80' Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128) >>> u'%c' % 0x80 u'\x80' u'%c' % '\x80' should raise the same error raised by u'%s' % '\x80'.
msg97336 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2010年01月07日 01:41
This looks like a signed vs. unsigned char problem. What platform are you on?
msg97338 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年01月07日 01:55
The problem is the cast char => Py_UNICODE in formatchar() function. char may be unsigned, but most of the time it's signed. signed => unsigned cast extend the sign. Example: (unsigned short)(signed char)0x80 gives 0xFF80. Attached patch fixes the issue and includes an unit test.
msg97339 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年01月07日 02:01
The problem is specific to Python 2.x. With Python3, "%c" expects one unicode character (eg. "a"). My patch fixes the char => Py_UNICODE conversion, but raising an error is maybe better to be consistent with u"%s" % "\x80" (and prepare the migration the Python3).
msg97340 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2010年01月07日 02:48
Shouldn't it work the same as it does for integers? >>> u'%c' % 0x7f u'\x7f' >>> u'%c' % '\x7f' u'\x7f' >>> u'%c' % 0x80 u'\x80' >>> u'%c' % '\x80' u'\uff80' That would imply to me it shouldn't be an error, it should just return u'\x80'.
msg97349 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年01月07日 11:49
If we allow it to work on 2.7 the code will break: 1) when ported to Py3, where mixing bytes strings and unicode is not allowed; 2) when used on Py<2.7, where the behavior is broken; 3) when converted to str.format, where only ints are accepted. If we raise an error, the user will have to either use unicode strings or ints and he will avoid further problems. Moreover, the fact that no one noticed this problems means that is not a common operation, so no one probably expects it to work anyway and raising an error could even help to find the problem if someone used %c in older versions.
msg97350 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2010年01月07日 11:55
There's no perfect answer. And since we've gotten this far without anyone caring, and 2.7 is basically the end of life for this issue, perhaps doing nothing is the best course. Any change we make will affect code that runs in both 2.6 and 2.7, doing nothing preserves compatibility on the 2.x side.
msg97353 - (view)	Author: Florent Xicluna (flox) * (Python committer)	Date: 2010年01月07日 13:02
Tested on 2.5... ~ $ python2.5 Python 2.5.2 (r252:60911, Jan 4 2009, 21:59:32) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> u'%c' % '\x80' u'\Uffffff80' >>> On 2.7 (trunk) I get same behaviour as described on msg97334.
msg97354 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年01月07日 13:35
First patch that makes u'%c' % '\x80' raise a UnicodeDecodeError. I could reproduce the problem on Linux 32/64bit, Windows 32bit and Python from 2.4 to 2.7. The '\Uffffff80' is returned by wide builds.
msg98108 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年01月21日 11:43
@Ezio: Your patch leaks a reference: PyUnicode_FromString(...) is not destroyed (Py_DECREF) on success.
msg99811 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年02月22日 19:00
New patch (issue7649v2.diff) with refleak fixed and improved unittests.
msg99856 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2010年02月22日 22:20
Could you add a comment on why you're calling PyUnicode_FromString and then throwing away the result? I believe it's so you'll get the same error checking as PyUnicode_FromString, but it's sufficiently tricky that I think it deserves a comment. Now that I think about it, having done the conversion in PyUnicode_FromString, why not use: buf[0] = PyUnicode_AS_UNICODE(s)[0]; ? This way you skip the cast and you know for a fact you're using the same interpretation as PyUnicode_FromString, having used its output. Also, since you know the size you may as well call PyUnicode_FromStringAndSize. It's trivially faster and makes the intent clearer, I think.
msg99900 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年02月23日 06:18
Indeed some comments would be helpful, I'll add them. I also tried already to reuse 's' and extract the first char using unicode_getitem, but it returns a PyObject and anyway it's probably more expensive/complicated than calling PyString_AS_STRING again. I'll try with [0] and/or with PyUnicode_FromStringAndSize and make a new patch this afternoon.
msg99912 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年02月23日 13:45
Here's a new patch that uses PyUnicode_FromStringAndSize and PyUnicode_AS_UNICODE(s)[0]. I also added some comments and tested it and it seems to work fine.
msg99914 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2010年02月23日 13:51
The patch looks good to me, in that it implements your desired functionality well. I haven't been following the issue closely enough to know whether or not this new functionality is the right way to go. (I'm not saying it's not, just that I don't have an opinion on it.)
msg99973 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年02月23日 23:23
Commited: r78392 (r78394).
msg100010 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2010年02月24日 08:15
> --- python/trunk/Objects/unicodeobject.c (original) > > +++ python/trunk/Objects/unicodeobject.c Wed Feb 24 00:16:07 2010 > > @@ -8170,6 +8170,7 @@ > > size_t buflen, > > PyObject v) > > { > > + PyObject s; > > /* presume that the buffer is at least 2 characters long / > > if (PyUnicode_Check(v)) { > > if (PyUnicode_GET_SIZE(v) != 1) > > @@ -8180,7 +8181,14 @@ > > else if (PyString_Check(v)) { > > if (PyString_GET_SIZE(v) != 1) > > goto onError; > > - buf[0] = (Py_UNICODE)PyString_AS_STRING(v)[0]; > > + / #7649: if the char is a non-ascii (i.e. in range(0x80,0x100)) byte > > + string, "u'%c' % char" should fail with a UnicodeDecodeError / > > + s = PyUnicode_FromStringAndSize(PyString_AS_STRING(v), 1); > > + / if the char is not decodable return -1 / > > + if (s == NULL) > > + return -1; > > + buf[0] = PyUnicode_AS_UNICODE(s)[0]; > > + Py_DECREF(s); That's a very* inefficient way of doing this. Could you please check for chars above 0x7f first and then use PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API (this API should not have been backported from the Python 3.x in Python 2.6, but it's too late to change that now) ! Thanks.
msg100018 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2010年02月24日 09:44
> Could you please check for chars above 0x7f first and then use > PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii). But why is it necessary to check for chars above 0x7f? > (this API should not have been backported from the Python 3.x > in Python 2.6, This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example.
msg100021 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2010年02月24日 10:02
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >> Could you please check for chars above 0x7f first and then use >> PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API > > I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii). > But why is it necessary to check for chars above 0x7f? The Python default encoding has to be ASCII compatible, so it's better to use a short-cut for pure-ASCII characters and avoid the complete round-trip via a temporary Unicode object. >> (this API should not have been backported from the Python 3.x >> in Python 2.6, > This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example. The function is a really just yet another interface to the PyUnicode_DecodeUTF8() API and it's name is misleading in that: Python 2.x uses the default encoding for converting strings without known encoding to Unicode, the docs for the API say that it decodes Latin-1 (!) and the interface makes it looks like a drop-in replacement for PyString_FromStringAndSize() which it isn't for Python 2.x. For Python 3.x, the default encoding is fixed to UTF-8, so the situation is different (though the docs are still wrong), however I don't see the advantage of using a less explicit name over the direct use of PyUnicode_DecodeUTF8().
msg100023 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2010年02月24日 10:39
> > But why is it necessary to check for chars above 0x7f? > The Python default encoding has to be ASCII compatible, Yes, but it is not necessarily as strict. for example, after I manage to set the default encoding to latin-1, u"%s" % chr(0x80) works; I suppose %c should do the same.
msg100024 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2010年02月24日 10:50
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >>> But why is it necessary to check for chars above 0x7f? >> The Python default encoding has to be ASCII compatible, > Yes, but it is not necessarily as strict. > for example, after I manage to set the default encoding to latin-1, > u"%s" % chr(0x80) works; I suppose %c should do the same. Right and it will. I only meant the check as fast-path for characters <0x80. All other characters would need to go through PyUnicode_Decode(). We can use such a fast-path, since the default encoding has to be ASCII compatible.
msg100025 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年02月24日 10:59
I was trying to decode mainly to get a UnicodeDecodeError automatically. If I check if the char is not in the ASCII range (i.e. >0x7F) I think I'd have to set the error message for the UnicodeDecodeError manually and possibly duplicate it (unless we use a different error message that says that %c accepts only ASCII chars). Also I agree that if u'%s' % chr(0x80) works when the default encoding is not ASCII, then %c should work as well. Trying to decode it with the default encoding and possibly let the UnicodeDecodeError propagate seems a good solution to me (and performance shouldn't be a problem here, since apparently no one uses u'%c' with non-ASCII byte strings). I will try to make another patch.
msg100027 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2010年02月24日 11:03
Marc-André's remark was that if char<0x80 (the vast majority), it's not necessary to call any decode function: just copy the byte. Other cases (error, or non-default encoding) may use a slower path.
msg100028 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年02月24日 11:07
Ok, adding a fast path shouldn't make the code more complicated, so I'll include it in the patch, thanks for the feedback.
msg100059 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2010年02月24日 18:03
I'm working on a similar issue for int.__format__('c'). What's not clear to me is why this doesn't work the same as chr(i). That is, shouldn't chr(i) == ('%c' % i) hold for i in range(256)? And if that's so, why not just copy chr's implementation: if (x < 0 \|\| x >= 256) { PyErr_SetString(PyExc_ValueError, "chr() arg not in range(256)"); return NULL; } s[0] = (char)x; return PyString_FromStringAndSize(s, 1);
msg100060 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2010年02月24日 18:09
The operation here is different: u'%c' % chr(i), which involves a str->unicode conversion.
msg100062 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2010年02月24日 18:17
Of course. Sorry about that. But then why not copy unichr()? It calls PyUnicode_FromOrdinal. I guess my real question is: Should "'%c' % i" be identical to "chr(i)" and should "u'%c' % i" be identical to "unichr(i)"? And by "identical" I mean return the same results and throw the same errors (with a possible difference in the error text).
msg100063 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年02月24日 18:25
At least from point of view, the difference between ints and chars is: * "u'%c' % 0xB5" means "create a Unicode char with codepoint 0xB5", i.e. the Unicode char μ U+00B5 MICRO SIGN; * "u'%c' % '\xB5'" means "create a Unicode char converting the byte '\xB5' to Unicode", i.e. an Е U+0415 CYRILLIC CAPITAL LETTER IE if the byte string is encoded in iso-8859-5 or ต U+0E15 THAI CHARACTER TO TAO if it's encoded in iso-8859-11, and so on for every different encoding.
msg100065 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年02月24日 18:47
Just to clarify: "u'%c' % some_int" should behave like unichr(some_int); "u'%c' % some_chr" should behave like "u'%s' % some_chr".
msg100070 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2010年02月24日 21:40
Ezio Melotti wrote: > Just to clarify: > "u'%c' % some_int" should behave like unichr(some_int); > "u'%c' % some_chr" should behave like "u'%s' % some_chr". That's correct. I guess that in practice "u'%c' % some_chr" is not all that common. Otherwise, the fact that it doesn't raise an error if some_chr is not compatible to the default encoding would have been noticed earlier on.
msg100087 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年02月25日 14:46
The latest patch (issue7649v4.diff) checks if the char is ASCII or non-ASCII and then, if the char is ASCII, it converts it directly to Unicode, otherwise it tries to decode it using the default encoding, raising a UnicodeDecodeError if the decoding fails. I tested it setting iso-8859-1 and utf-8 as default encoding and the behavior was consistent with "%s", however the tests assume that the default encoding is always ASCII, so they failed (both the tests included in the patch and others in test_unicode). I'm not sure if in this case they should be changed/skipped or not. (Also http://docs.python.org/c-api/unicode.html#built-in-codecs says that "Setting encoding to NULL causes the default encoding to be used which is ASCII.", but this is not always true. If you think it should be fixed I'll do it in a separate commit.)
msg100091 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2010年02月25日 16:43
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > The latest patch (issue7649v4.diff) checks if the char is ASCII or non-ASCII and then, if the char is ASCII, it converts it directly to Unicode, otherwise it tries to decode it using the default encoding, raising a UnicodeDecodeError if the decoding fails. Thanks. The patch looks good now... but doesn't apply cleanly anymore, since your first version has already made it into trunk and the 2.6 branch. > I tested it setting iso-8859-1 and utf-8 as default encoding and the behavior was consistent with "%s", however the tests assume that the default encoding is always ASCII, so they failed (both the tests included in the patch and others in test_unicode). I'm not sure if in this case they should be changed/skipped or not. I think that's fine. While we do still allow setting the default to something other than ASCII in 2.x, we don't support such tricks, so there's no need to test for them. > (Also http://docs.python.org/c-api/unicode.html#built-in-codecs says that "Setting encoding to NULL causes the default encoding to be used which is ASCII.", but this is not always true. If you think it should be fixed I'll do it in a separate commit.) The last part of that sentence should be removed.
msg100092 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年02月25日 17:06
The patch should apply cleanly on the latest revision of trunk (if you look at the diff it removes the code of the first version). About the doc, I think it would be better to say that usually it's ASCII (or UTF-8 on Py3 -- Py3 doc should be fixed too), but it might be something different if it's changed using sys.setdefaultencoding().
msg100093 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2010年02月25日 17:13
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > The patch should apply cleanly on the latest revision of trunk (if you look at the diff it removes the code of the first version). Ah, sorry, my fault. > About the doc, I think it would be better to say that usually it's ASCII (or UTF-8 on Py3 -- Py3 doc should be fixed too), but it might be something different if it's changed using sys.setdefaultencoding(). I guess it's better to open a separate ticket for that.
msg100095 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2010年02月25日 17:58
issue7649v4.diff has been committed in r78449 (trunk) and r78450 (release26-maint).

History
Date	User	Action	Args
2022年04月11日 14:56:56	admin	set	github: 51898
2010年02月25日 17:58:08	ezio.melotti	set	status: open -> closed resolution: fixed messages: + msg100095 stage: patch review -> resolved
2010年02月25日 17:13:29	lemburg	set	messages: + msg100093
2010年02月25日 17:06:38	ezio.melotti	set	messages: + msg100092
2010年02月25日 16:43:15	lemburg	set	messages: + msg100091
2010年02月25日 14:46:19	ezio.melotti	set	files: + issue7649v4.diff messages: + msg100087 stage: resolved -> patch review
2010年02月24日 21:40:13	lemburg	set	messages: + msg100070
2010年02月24日 18:47:14	ezio.melotti	set	messages: + msg100065
2010年02月24日 18:25:59	ezio.melotti	set	messages: + msg100063
2010年02月24日 18:17:55	eric.smith	set	messages: + msg100062
2010年02月24日 18:09:27	amaury.forgeotdarc	set	messages: + msg100060
2010年02月24日 18:03:05	eric.smith	set	messages: + msg100059
2010年02月24日 11:07:07	ezio.melotti	set	messages: + msg100028
2010年02月24日 11:03:12	amaury.forgeotdarc	set	messages: + msg100027
2010年02月24日 10:59:33	ezio.melotti	set	messages: + msg100025
2010年02月24日 10:50:15	lemburg	set	messages: + msg100024
2010年02月24日 10:39:33	amaury.forgeotdarc	set	messages: + msg100023
2010年02月24日 10:02:42	lemburg	set	messages: + msg100021
2010年02月24日 09:44:00	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg100018
2010年02月24日 08:15:41	lemburg	set	status: closed -> open nosy: + lemburg messages: + msg100010 resolution: fixed -> (no value)
2010年02月24日 06:43:35	ezio.melotti	set	stage: patch review -> resolved
2010年02月23日 23:23:38	vstinner	set	status: open -> closed resolution: fixed messages: + msg99973
2010年02月23日 13:51:36	eric.smith	set	messages: + msg99914
2010年02月23日 13:45:41	ezio.melotti	set	files: + issue7649v3.diff messages: + msg99912
2010年02月23日 06:18:15	ezio.melotti	set	messages: + msg99900
2010年02月22日 22:20:56	eric.smith	set	messages: + msg99856
2010年02月22日 19:00:08	ezio.melotti	set	keywords: + needs review files: + issue7649v2.diff messages: + msg99811
2010年01月21日 11:43:48	vstinner	set	messages: + msg98108
2010年01月07日 13:35:56	ezio.melotti	set	files: + issue7649.diff messages: + msg97354
2010年01月07日 13:02:37	flox	set	nosy: + flox messages: + msg97353
2010年01月07日 11:55:58	eric.smith	set	messages: + msg97350
2010年01月07日 11:49:16	ezio.melotti	set	messages: + msg97349
2010年01月07日 02:50:47	eric.smith	set	priority: high -> normal keywords: + easy stage: test needed -> patch review
2010年01月07日 02:48:37	eric.smith	set	messages: + msg97340
2010年01月07日 02:01:52	vstinner	set	messages: + msg97339
2010年01月07日 01:55:45	vstinner	set	files: + unicode_formatchar.patch nosy: + vstinner messages: + msg97338 keywords: + patch
2010年01月07日 01:46:04	eric.smith	set	nosy: + doerwalter
2010年01月07日 01:41:57	eric.smith	set	messages: + msg97336
2010年01月07日 01:41:23	eric.smith	set	nosy: + eric.smith
2010年01月07日 01:07:59	ezio.melotti	create

homepage