Issue 9769: PyUnicode_FromFormatV() doesn't handle non-ascii text correctly

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53978

classification

Title:	PyUnicode_FromFormatV() doesn't handle non-ascii text correctly
Type:	enhancement	Stage:
Components:	Interpreter Core, Unicode	Versions:	Python 3.5

process

Dependencies:	Superseder:
Status:	closed	Resolution:	out of date
Assigned To:	Nosy List:	amaury.forgeotdarc, belopolsky, ezio.melotti, reingart
Priority:	low	Keywords:	patch

Created on 2010年09月03日 23:52 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
pyunicode_fromformat_ascii.patch	vstinner, 2010年09月08日 18:17
pyunicode_fromformat_utf8.patch	reingart, 2012年10月28日 21:22	PyUnicode_FromFormatV patch to use UTF-8	review

Messages (15)
msg115542 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年09月03日 23:52
I'm trying to document the encoding of all bytes argument of the C API: see #9738. I tried to understand which encoding is used by PyUnicode_FromFormat() (and PyErr_Format() which calls PyUnicode_FromFormatV()). It looks like ISO-8859-1, see unicodeobject.c near line 1106: for (f = format; f; f++) { if (f == '%') { ... } else s++ = *f; <~~~~ here } ... oh wait, it doesn't work for non-ascii text! Test in gdb: (gdb) print _PyObject_Dump(PyUnicodeUCS2_FromFormat("iso-8859-1:\xd0\xff")) object : 'iso-8859-1:\uffd0\uffff' type : str refcount: 1 address : 0x83d5d80 b'\xd0\xff' is decoded '\uffd0\xffff' :-( It's a bug. -- PyUnicode_FromFormatV() should raise an error on non-ascii format character, or decode it correctly as... ISO-8859-1 or something else. It's difficult to support multi byte encodings (like utf-8), ISO-8859-1 is fine. If we choose to raise an error, how can the user format a non-ascii string? Using its_unicode_format.format(...arguments...) or its_unicode_format % arguments? Is it easy to call these methods in C?
msg115609 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2010年09月04日 19:26
2 remarks: - PyUnicode_FromFormat("%s", text) expects a utf-8 buffer. - Very recently (r84472, r84485), some C files of CPython source code were converted to utf-8. And most of the time, the format given to PyUnicode_FromFormat is a string literal. So it would make sense for PyUnicode_FromFormat to consider the format string as encoded in utf-8. This is worth asking on python-dev though.
msg115820 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年09月07日 23:21
> PyUnicode_FromFormat("%s", text) expects a utf-8 buffer. Really? I don't see how "s++ = f;" (where s is Py_UNICODE* and f is char*) can decode utf-8. It looks more like ISO-8859-1. > Very recently (r84472, r84485), some C files of CPython source code > were converted to utf-8 Python source code (C and Python) is written in ASCII except maybe some headers or some tests written in Python with #coding:xxx header (or without the header, but in utf-8, for Python3). I don't think that a C file calls PyErr_Format() or PyUnicode_FromFormat(V)() with a non-ascii format string.
msg115825 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2010年09月07日 23:54
> > PyUnicode_FromFormat("%s", text) expects a utf-8 buffer. > Really? The format looks more like latin-1, right. But the payload of a "%s" item is decoded as utf-8. > I don't think that a C file calls PyErr_Format() or > PyUnicode_FromFormat(V)() with a non-ascii format string. At the moment, it's true. My remark is that utf-8 tend to be applied to all kind of files; if someone once decide that non-ascii chars are allowed in (some) string constants, they will be stored in utf-8.
msg115889 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年09月08日 18:17
> My remark is that utf-8 tend to be applied to all kind of files; > if someone once decide that non-ascii chars are allowed in (some) > string constants, they will be stored in utf-8. In this case, it will be better to raise an error on non-ascii byte (character) in the format string. It's better to raise an error than to interpret utf-8 as iso-8859-1 (mojibake!). Since nobody noticed this bug (PyFormat_FromString/PyErr_Format expects ISO-8859-1), I suppose that nobody uses non-ASCII format string is always ascii. Python builtin errors are not localised. If an application uses gettext, I suppose that the error will be raised in the Python code, not in the C API. Attached patch changes PyFormat_FromStringV (and so PyFormat_FromString and PyErr_Format) to reject non-ascii byte (character) in the format string. I added a test and documented the format string encoding (which is now ASCII). See also #9738 for the documentation about function argument encoding.
msg116045 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年09月10日 21:37
@amaury: Do you agree to reject non-ascii bytes? TODO: document format encoding in Doc/c-api/*.rst.
msg116046 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2010年09月10日 21:52
Yes, let's be conservative and reject non-ascii bytes in the format string.
msg116071 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年09月11日 00:55
Fixed by r84704 in Python 3.2.
msg121561 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月19日 19:42
I don't understand Victor's argument in msg115889. According to UTF-8 RFC, <http://www.ietf.org/rfc/rfc2279.txt>: - US-ASCII values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g. the printf() function in C libraries) that parse based on US-ASCII values but are transparent to other values. This means that printf-like formatters should not care whether the format string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding. (Passing in multibyte encoding pretending to be bytes would of course lead to havoc, but C type system will protect you from that.) It is also fairly simple to ssnity-check for UTF-8 if necessary, but in case of PyUnicode_FromFormat, the resulting string will be decoded as UTF-8, so all characters in the format string will be checked anyways. Am I missing something?
msg121563 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年11月19日 20:06
On Friday 19 November 2010 20:42:53 you wrote: > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > I don't understand Victor's argument in msg115889. According to UTF-8 RFC, > <http://www.ietf.org/rfc/rfc2279.txt>: > > - US-ASCII values do not appear otherwise in a UTF-8 encoded > character stream. This provides compatibility with file systems > or other software (e.g. the printf() function in C libraries) that > parse based on US-ASCII values but are transparent to other > values. Most C functions including printf works on multibyte strings, not on (wide) character strings. Whereas PyUnicode_FromFormatV() converts the format string (bytes) to unicode (characters). If you would like a comparaison in C, it's like printf()+mbstowcs() in the same function. > This means that printf-like formatters should not care whether the format > string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding. It's maybe true with bytes input and bytes output (eg. PyString_FromFormatV() of Python2), but it's no more true with bytes input and str output (eg. PyUnicode_FromFormatV() of Python3). > It is also fairly simple to ssnity-check for UTF-8 if necessary, but in > case of PyUnicode_FromFormat, the resulting string will be decoded as > UTF-8, so all characters in the format string will be checked anyways. I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210 lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <= byte <= 127). Nobody noticed my change just because the whole Python code base only uses ASCII argument for the format argument of PyUnicode_FromFormatV(). Victor
msg121568 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月19日 20:58
On Fri, Nov 19, 2010 at 3:06 PM, STINNER Victor <report@bugs.python.org> wrote: > .. Whereas PyUnicode_FromFormatV() converts the format string > (bytes) to unicode (characters). If you would like a comparaison in C, it's > like printf()+mbstowcs() in the same function. > I see. So it is really the else s++ = f; that surreptitiously widens the characters. .. > I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210 > lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode > is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <= > byte <= 127). I don't think we need 210 lines to replace "s++ = f" with proper UTF-8 logic. Even if we do, the code can be shared with PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to Python C API.
msg121582 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年11月20日 00:15
On Friday 19 November 2010 21:58:25 you wrote: > > I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long > > (210 lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas > > ASCII decode is just: "unicode_char = (Py_UNICODE)byte;" + an if before > > to check that 0 <= byte <= 127). > > I don't think we need 210 lines to replace "s++ = f" with proper > UTF-8 logic. Even if we do, the code can be shared with > PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to > Python C API. Why should we do that? ASCII format is just fine. Remember that PyUnicode_FromFormatV() is part of the C API. I don't think that anyone would use non-ASCII format in C. If someone does that, (s)he should open a new issue for that :-) But I don't think that we should make the code more complex if it's just useless. Victor
msg121693 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月20日 17:38
On Fri, Nov 19, 2010 at 7:15 PM, STINNER Victor <report@bugs.python.org> wrote: .. > > Why should we do that? ASCII format is just fine. Remember that > PyUnicode_FromFormatV() is part of the C API. I don't think that anyone would > use non-ASCII format in C. Why not. Gettext manual is full of examples with i18nalized format strings. > If someone does that, (s)he should open a new issue > for that :-) Why new issue? The title of this issue fits perfectly and IMO it is hard to argue that to "handle non-ascii text correctly" means to raise an error when non-ascii text is encountered.
msg129838 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年03月02日 00:25
I still consider that ASCII format strings should be enough for everyone. > > If someone does that, (s)he should open a new issue for that :-) > > Why new issue? Ok, so I just remove myself from the nosy list.
msg174080 - (view)	Author: Mariano Reingart (reingart)	Date: 2012年10月28日 21:22
(moved from issue #16343) Working in an internationalization proposal <http://python.org.ar/pyar/TracebackInternationalizationProposal> (issue #16344) I've stopped at this problem (#9769) where multi byte encodings (like utf-8) is not supported by PyUnicode_FromFormatV() Beside my proposal, I think utf-8 should be supported for consistency with the other unicode functions, like PyUnicode_FromString() or even unicode_fromformat_arg() Attached is a patch that: - enhanced the iterator to detect multibyte sequences, with sanity checks about start & continuation bytes - replaced unicode_write_cstr with PyUnicode_DecodeUTF8Stateful - tests Hope it helps, this is my first patch for cpython and my C skills are a bit rusty, so excuse me if there is any newbie glitch

History
Date	User	Action	Args
2022年04月11日 14:57:06	admin	set	github: 53978
2015年10月02日 21:12:14	vstinner	set	status: open -> closed resolution: out of date
2014年06月29日 23:59:58	belopolsky	set	assignee: belopolsky -> versions: + Python 3.5, - Python 3.4
2012年10月28日 21:22:43	reingart	set	files: + pyunicode_fromformat_utf8.patch nosy: + reingart messages: + msg174080
2012年10月28日 20:18:19	chris.jerdonek	set	versions: + Python 3.4, - Python 3.3
2012年10月28日 20:14:48	chris.jerdonek	link	issue16343 superseder
2011年03月02日 01:08:58	vstinner	set	nosy: - vstinner
2011年03月02日 00:25:43	vstinner	set	nosy: amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti messages: + msg129838
2011年03月02日 00:22:11	belopolsky	set	priority: normal -> low nosy: amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti
2011年03月02日 00:21:23	belopolsky	set	assignee: belopolsky type: enhancement resolution: fixed -> (no value) nosy: amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti versions: + Python 3.3, - Python 3.2
2010年11月20日 17:38:45	belopolsky	set	messages: + msg121693
2010年11月20日 00:15:55	vstinner	set	messages: + msg121582
2010年11月19日 20:58:22	belopolsky	set	messages: + msg121568
2010年11月19日 20:06:13	vstinner	set	messages: + msg121563
2010年11月19日 19:55:32	ezio.melotti	set	nosy: + ezio.melotti
2010年11月19日 19:42:51	belopolsky	set	status: closed -> open nosy: + belopolsky messages: + msg121561
2010年09月11日 00:55:11	vstinner	set	status: open -> closed resolution: fixed messages: + msg116071
2010年09月10日 21:52:12	amaury.forgeotdarc	set	messages: + msg116046
2010年09月10日 21:37:58	vstinner	set	messages: + msg116045
2010年09月08日 18:17:16	vstinner	set	files: + pyunicode_fromformat_ascii.patch keywords: + patch messages: + msg115889
2010年09月07日 23:54:09	amaury.forgeotdarc	set	messages: + msg115825
2010年09月07日 23:21:46	vstinner	set	messages: + msg115820
2010年09月04日 19:26:13	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg115609
2010年09月03日 23:52:59	vstinner	create

homepage