homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: PyUnicode_FromFormatV() doesn't handle non-ascii text correctly
Type: enhancement Stage:
Components: Interpreter Core, Unicode Versions: Python 3.5
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, belopolsky, ezio.melotti, reingart
Priority: low Keywords: patch

Created on 2010年09月03日 23:52 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
pyunicode_fromformat_ascii.patch vstinner, 2010年09月08日 18:17
pyunicode_fromformat_utf8.patch reingart, 2012年10月28日 21:22 PyUnicode_FromFormatV patch to use UTF-8 review
Messages (15)
msg115542 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年09月03日 23:52
I'm trying to document the encoding of all bytes argument of the C API: see #9738. I tried to understand which encoding is used by PyUnicode_FromFormat*() (and PyErr_Format() which calls PyUnicode_FromFormatV()). It looks like ISO-8859-1, see unicodeobject.c near line 1106:
 for (f = format; *f; f++) {
 if (*f == '%') {
 ...
 } else
 *s++ = *f; <~~~~ here
 }
... oh wait, it doesn't work for non-ascii text! Test in gdb:
(gdb) print _PyObject_Dump(PyUnicodeUCS2_FromFormat("iso-8859-1:\xd0\xff"))
object : 'iso-8859-1:\uffd0\uffff'
type : str
refcount: 1
address : 0x83d5d80
b'\xd0\xff' is decoded '\uffd0\xffff' :-( It's a bug.
--
PyUnicode_FromFormatV() should raise an error on non-ascii format character, or decode it correctly as... ISO-8859-1 or something else. It's difficult to support multi byte encodings (like utf-8), ISO-8859-1 is fine. If we choose to raise an error, how can the user format a non-ascii string? Using its_unicode_format.format(...arguments...) or its_unicode_format % arguments? Is it easy to call these methods in C?
msg115609 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010年09月04日 19:26
2 remarks: 
- PyUnicode_FromFormat("%s", text) expects a utf-8 buffer.
- Very recently (r84472, r84485), some C files of CPython source code were converted to utf-8. And most of the time, the format given to PyUnicode_FromFormat is a string literal.
So it would make sense for PyUnicode_FromFormat to consider the format string as encoded in utf-8. This is worth asking on python-dev though.
msg115820 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年09月07日 23:21
> PyUnicode_FromFormat("%s", text) expects a utf-8 buffer.
Really? I don't see how "*s++ = *f;" (where s is Py_UNICODE* and f is char*) can decode utf-8. It looks more like ISO-8859-1.
> Very recently (r84472, r84485), some C files of CPython source code
> were converted to utf-8
Python source code (C and Python) is written in ASCII except maybe some headers or some tests written in Python with #coding:xxx header (or without the header, but in utf-8, for Python3). I don't think that a C file calls PyErr_Format() or PyUnicode_FromFormat(V)() with a non-ascii format string.
msg115825 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010年09月07日 23:54
> > PyUnicode_FromFormat("%s", text) expects a utf-8 buffer.
> Really?
The *format* looks more like latin-1, right. But the payload of a "%s" item is decoded as utf-8.
> I don't think that a C file calls PyErr_Format() or
> PyUnicode_FromFormat(V)() with a non-ascii format string.
At the moment, it's true. My remark is that utf-8 tend to be applied to all kind of files; if someone once decide that non-ascii chars are allowed in (some) string constants, they will be stored in utf-8.
msg115889 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年09月08日 18:17
> My remark is that utf-8 tend to be applied to all kind of files;
> if someone once decide that non-ascii chars are allowed in (some) 
> string constants, they will be stored in utf-8.
In this case, it will be better to raise an error on non-ascii byte (character) in the format string. It's better to raise an error than to interpret utf-8 as iso-8859-1 (mojibake!). Since nobody noticed this bug (PyFormat_FromString/PyErr_Format expects ISO-8859-1), I suppose that nobody uses non-ASCII format string is always ascii.
Python builtin errors are not localised. If an application uses gettext, I suppose that the error will be raised in the Python code, not in the C API.
Attached patch changes PyFormat_FromStringV (and so PyFormat_FromString and PyErr_Format) to reject non-ascii byte (character) in the format string. I added a test and documented the format string encoding (which is now ASCII). See also #9738 for the documentation about function argument encoding.
msg116045 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年09月10日 21:37
@amaury: Do you agree to reject non-ascii bytes?
TODO: document format encoding in Doc/c-api/*.rst.
msg116046 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010年09月10日 21:52
Yes, let's be conservative and reject non-ascii bytes in the format string.
msg116071 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年09月11日 00:55
Fixed by r84704 in Python 3.2.
msg121561 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年11月19日 19:42
I don't understand Victor's argument in msg115889. According to UTF-8 RFC, <http://www.ietf.org/rfc/rfc2279.txt>:
 - US-ASCII values do not appear otherwise in a UTF-8 encoded
 character stream. This provides compatibility with file systems
 or other software (e.g. the printf() function in C libraries) that
 parse based on US-ASCII values but are transparent to other
 values.
This means that printf-like formatters should not care whether the format string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding. (Passing in multibyte encoding pretending to be bytes would of course lead to havoc, but C type system will protect you from that.)
It is also fairly simple to ssnity-check for UTF-8 if necessary, but in case of PyUnicode_FromFormat, the resulting string will be decoded as UTF-8, so all characters in the format string will be checked anyways.
Am I missing something?
msg121563 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年11月19日 20:06
On Friday 19 November 2010 20:42:53 you wrote:
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> I don't understand Victor's argument in msg115889. According to UTF-8 RFC,
> <http://www.ietf.org/rfc/rfc2279.txt>:
> 
> - US-ASCII values do not appear otherwise in a UTF-8 encoded
> character stream. This provides compatibility with file systems
> or other software (e.g. the printf() function in C libraries) that
> parse based on US-ASCII values but are transparent to other
> values.
Most C functions including printf works on multi*byte* strings, not on (wide) 
character strings. Whereas PyUnicode_FromFormatV() converts the format string 
(bytes) to unicode (characters). If you would like a comparaison in C, it's 
like printf()+mbstowcs() in the same function.
> This means that printf-like formatters should not care whether the format
> string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding. 
It's maybe true with bytes input and bytes output (eg. PyString_FromFormatV() 
of Python2), but it's no more true with bytes input and str output (eg. 
PyUnicode_FromFormatV() of Python3).
> It is also fairly simple to ssnity-check for UTF-8 if necessary, but in
> case of PyUnicode_FromFormat, the resulting string will be decoded as
> UTF-8, so all characters in the format string will be checked anyways.
I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210 
lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode 
is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <= 
byte <= 127).
Nobody noticed my change just because the whole Python code base only uses 
ASCII argument for the format argument of PyUnicode_FromFormatV().
Victor
msg121568 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年11月19日 20:58
On Fri, Nov 19, 2010 at 3:06 PM, STINNER Victor <report@bugs.python.org> wrote:
> .. Whereas PyUnicode_FromFormatV() converts the format string
> (bytes) to unicode (characters). If you would like a comparaison in C, it's
> like printf()+mbstowcs() in the same function.
>
I see. So it is really the
 else
 *s++ = *f;
that surreptitiously widens the characters.
..
> I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210
> lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode
> is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <=
> byte <= 127).
I don't think we need 210 lines to replace "*s++ = *f" with proper
UTF-8 logic. Even if we do, the code can be shared with
PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to
Python C API.
msg121582 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年11月20日 00:15
On Friday 19 November 2010 21:58:25 you wrote:
> > I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long
> > (210 lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas
> > ASCII decode is just: "unicode_char = (Py_UNICODE)byte;" + an if before
> > to check that 0 <= byte <= 127).
> 
> I don't think we need 210 lines to replace "*s++ = *f" with proper
> UTF-8 logic. Even if we do, the code can be shared with
> PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to
> Python C API.
Why should we do that? ASCII format is just fine. Remember that 
PyUnicode_FromFormatV() is part of the C API. I don't think that anyone would 
use non-ASCII format in C. If someone does that, (s)he should open a new issue 
for that :-) But I don't think that we should make the code more complex if 
it's just useless.
Victor
msg121693 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年11月20日 17:38
On Fri, Nov 19, 2010 at 7:15 PM, STINNER Victor <report@bugs.python.org> wrote:
..
>
> Why should we do that? ASCII format is just fine. Remember that
> PyUnicode_FromFormatV() is part of the C API. I don't think that anyone would
> use non-ASCII format in C.
Why not. Gettext manual is full of examples with i18nalized format strings.
> If someone does that, (s)he should open a new issue
> for that :-)
Why new issue? The title of this issue fits perfectly and IMO it is
hard to argue that to "handle non-ascii text correctly" means to raise
an error when non-ascii text is encountered.
msg129838 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011年03月02日 00:25
I still consider that ASCII format strings should be enough for everyone.
> > If someone does that, (s)he should open a new issue for that :-)
>
> Why new issue?
Ok, so I just remove myself from the nosy list.
msg174080 - (view) Author: Mariano Reingart (reingart) Date: 2012年10月28日 21:22
(moved from issue #16343)
Working in an internationalization proposal <http://python.org.ar/pyar/TracebackInternationalizationProposal> (issue #16344)
I've stopped at this problem (#9769) where multi byte encodings (like utf-8) is not supported by PyUnicode_FromFormatV()
Beside my proposal, I think utf-8 should be supported for consistency with the other unicode functions, like PyUnicode_FromString() or even unicode_fromformat_arg()
Attached is a patch that:
- enhanced the iterator to detect multibyte sequences, with sanity checks about start & continuation bytes
- replaced unicode_write_cstr with PyUnicode_DecodeUTF8Stateful
- tests
Hope it helps, this is my first patch for cpython and my C skills are a bit rusty, so excuse me if there is any newbie glitch
History
Date User Action Args
2022年04月11日 14:57:06adminsetgithub: 53978
2015年10月02日 21:12:14vstinnersetstatus: open -> closed
resolution: out of date
2014年06月29日 23:59:58belopolskysetassignee: belopolsky ->
versions: + Python 3.5, - Python 3.4
2012年10月28日 21:22:43reingartsetfiles: + pyunicode_fromformat_utf8.patch
nosy: + reingart
messages: + msg174080

2012年10月28日 20:18:19chris.jerdoneksetversions: + Python 3.4, - Python 3.3
2012年10月28日 20:14:48chris.jerdoneklinkissue16343 superseder
2011年03月02日 01:08:58vstinnersetnosy: - vstinner
2011年03月02日 00:25:43vstinnersetnosy: amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti
messages: + msg129838
2011年03月02日 00:22:11belopolskysetpriority: normal -> low
nosy: amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti
2011年03月02日 00:21:23belopolskysetassignee: belopolsky
type: enhancement
resolution: fixed -> (no value)
nosy: amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti
versions: + Python 3.3, - Python 3.2
2010年11月20日 17:38:45belopolskysetmessages: + msg121693
2010年11月20日 00:15:55vstinnersetmessages: + msg121582
2010年11月19日 20:58:22belopolskysetmessages: + msg121568
2010年11月19日 20:06:13vstinnersetmessages: + msg121563
2010年11月19日 19:55:32ezio.melottisetnosy: + ezio.melotti
2010年11月19日 19:42:51belopolskysetstatus: closed -> open
nosy: + belopolsky
messages: + msg121561

2010年09月11日 00:55:11vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg116071
2010年09月10日 21:52:12amaury.forgeotdarcsetmessages: + msg116046
2010年09月10日 21:37:58vstinnersetmessages: + msg116045
2010年09月08日 18:17:16vstinnersetfiles: + pyunicode_fromformat_ascii.patch
keywords: + patch
messages: + msg115889
2010年09月07日 23:54:09amaury.forgeotdarcsetmessages: + msg115825
2010年09月07日 23:21:46vstinnersetmessages: + msg115820
2010年09月04日 19:26:13amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg115609
2010年09月03日 23:52:59vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /