Message121563
| Author |
vstinner |
| Recipients |
amaury.forgeotdarc, belopolsky, ezio.melotti, vstinner |
| Date |
2010年11月19日.20:06:13 |
| SpamBayes Score |
3.330669e-16 |
| Marked as misclassified |
No |
| Message-id |
<201011192106.06983.victor.stinner@haypocalc.com> |
| In-reply-to |
<1290195773.07.0.831312765946.issue9769@psf.upfronthosting.co.za> |
| Content |
On Friday 19 November 2010 20:42:53 you wrote:
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
>
> I don't understand Victor's argument in msg115889. According to UTF-8 RFC,
> <http://www.ietf.org/rfc/rfc2279.txt>:
>
> - US-ASCII values do not appear otherwise in a UTF-8 encoded
> character stream. This provides compatibility with file systems
> or other software (e.g. the printf() function in C libraries) that
> parse based on US-ASCII values but are transparent to other
> values.
Most C functions including printf works on multi*byte* strings, not on (wide)
character strings. Whereas PyUnicode_FromFormatV() converts the format string
(bytes) to unicode (characters). If you would like a comparaison in C, it's
like printf()+mbstowcs() in the same function.
> This means that printf-like formatters should not care whether the format
> string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding.
It's maybe true with bytes input and bytes output (eg. PyString_FromFormatV()
of Python2), but it's no more true with bytes input and str output (eg.
PyUnicode_FromFormatV() of Python3).
> It is also fairly simple to ssnity-check for UTF-8 if necessary, but in
> case of PyUnicode_FromFormat, the resulting string will be decoded as
> UTF-8, so all characters in the format string will be checked anyways.
I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210
lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode
is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <=
byte <= 127).
Nobody noticed my change just because the whole Python code base only uses
ASCII argument for the format argument of PyUnicode_FromFormatV().
Victor |
|