Message 100021 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	amaury.forgeotdarc, doerwalter, eric.smith, ezio.melotti, flox, lemburg, vstinner
Date	2010年02月24日.10:02:41
SpamBayes Score	1.7136292e-13
Marked as misclassified	No
Message-id	<4B84F940.40800@egenix.com>
In-reply-to	<1267004642.18.0.456785037879.issue7649@psf.upfronthosting.co.za>

Content
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >> Could you please check for chars above 0x7f first and then use >> PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API > > I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii). > But why is it necessary to check for chars above 0x7f? The Python default encoding has to be ASCII compatible, so it's better to use a short-cut for pure-ASCII characters and avoid the complete round-trip via a temporary Unicode object. >> (this API should not have been backported from the Python 3.x >> in Python 2.6, > This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example. The function is a really just yet another interface to the PyUnicode_DecodeUTF8() API and it's name is misleading in that: Python 2.x uses the default encoding for converting strings without known encoding to Unicode, the docs for the API say that it decodes Latin-1 (!) and the interface makes it looks like a drop-in replacement for PyString_FromStringAndSize() which it isn't for Python 2.x. For Python 3.x, the default encoding is fixed to UTF-8, so the situation is different (though the docs are still wrong), however I don't see the advantage of using a less explicit name over the direct use of PyUnicode_DecodeUTF8().

Content

Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
>> Could you please check for chars above 0x7f first and then use
>> PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API
> 
> I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii).
> But why is it necessary to check for chars above 0x7f?
The Python default encoding has to be ASCII compatible,
so it's better to use a short-cut for pure-ASCII characters
and avoid the complete round-trip via a temporary Unicode
object.
>> (this API should not have been backported from the Python 3.x
>> in Python 2.6,
> This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example.
The function is a really just yet another interface to the
PyUnicode_DecodeUTF8() API and it's name is misleading in that:
Python 2.x uses the default encoding for converting strings without
known encoding to Unicode, the docs for the API say that
it decodes Latin-1 (!) and the interface makes it looks like
a drop-in replacement for PyString_FromStringAndSize() which
it isn't for Python 2.x.
For Python 3.x, the default encoding is fixed to UTF-8, so the
situation is different (though the docs are still wrong),
however I don't see the advantage of using a less explicit
name over the direct use of PyUnicode_DecodeUTF8().

History
Date	User	Action	Args
2010年02月24日 10:02:43	lemburg	set	recipients: + lemburg, doerwalter, amaury.forgeotdarc, vstinner, eric.smith, ezio.melotti, flox
2010年02月24日 10:02:42	lemburg	link	issue7649 messages
2010年02月24日 10:02:41	lemburg	create

homepage