Message100021
| Author |
lemburg |
| Recipients |
amaury.forgeotdarc, doerwalter, eric.smith, ezio.melotti, flox, lemburg, vstinner |
| Date |
2010年02月24日.10:02:41 |
| SpamBayes Score |
1.7136292e-13 |
| Marked as misclassified |
No |
| Message-id |
<4B84F940.40800@egenix.com> |
| In-reply-to |
<1267004642.18.0.456785037879.issue7649@psf.upfronthosting.co.za> |
| Content |
Amaury Forgeot d'Arc wrote:
>
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
>
>> Could you please check for chars above 0x7f first and then use
>> PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API
>
> I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii).
> But why is it necessary to check for chars above 0x7f?
The Python default encoding has to be ASCII compatible,
so it's better to use a short-cut for pure-ASCII characters
and avoid the complete round-trip via a temporary Unicode
object.
>> (this API should not have been backported from the Python 3.x
>> in Python 2.6,
> This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example.
The function is a really just yet another interface to the
PyUnicode_DecodeUTF8() API and it's name is misleading in that:
Python 2.x uses the default encoding for converting strings without
known encoding to Unicode, the docs for the API say that
it decodes Latin-1 (!) and the interface makes it looks like
a drop-in replacement for PyString_FromStringAndSize() which
it isn't for Python 2.x.
For Python 3.x, the default encoding is fixed to UTF-8, so the
situation is different (though the docs are still wrong),
however I don't see the advantage of using a less explicit
name over the direct use of PyUnicode_DecodeUTF8(). |
|