Message 104894 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	Arfrever, ezio.melotti, gregory.p.smith, lemburg, loewis, pitrou, vstinner
Date	2010年05月03日.22:11:20
SpamBayes Score	0.0010579377
Marked as misclassified	No
Message-id	<4BDF4A06.1020603@v.loewis.de>
In-reply-to	<4BDF47DB.8020105@egenix.com>

Content
> Your name will end up being partially escaped as surrogate: > > 'L\udcf6wis' > > Further processing will fail That depends on the further processing, no? > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeEncodeError: 'latin-1' codec can't encode character '\udcf6' in position 1: ordinal not in > range(256) Where did you get this error from? > It doesn't work if an application tries to work with the data, > e.g. tries to convert it Converting it to what? > parse it Parsing will work fine. > decode it It's a string. You shouldn't decode it. > The reason is > that information included by the use of the 'surrogateescape' > error handler is lost along the way and this then causes data > corruption. And how would that not happen if it was bytes? The problems you describe were one of the primary motivations to switch to Unicode: it's byte strings that have these problems.

Content

> Your name will end up being partially escaped as surrogate:
> 
> 'L\udcf6wis'
> 
> Further processing will fail
That depends on the further processing, no?
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\udcf6' in position 1: ordinal not in
> range(256)
Where did you get this error from?
> It doesn't work if an application tries to work *with* the data,
> e.g. tries to convert it
Converting it to what?
> parse it
Parsing will work fine.
> decode it
It's a string. You shouldn't decode it.
> The reason is
> that information included by the use of the 'surrogateescape'
> error handler is lost along the way and this then causes data
> corruption.
And how would that not happen if it was bytes? The problems you describe
were one of the primary motivations to switch to Unicode: it's *byte*
strings that have these problems.

History
Date	User	Action	Args
2010年05月03日 22:11:23	loewis	set	recipients: + loewis, lemburg, gregory.p.smith, pitrou, vstinner, ezio.melotti, Arfrever
2010年05月03日 22:11:20	loewis	link	issue8603 messages
2010年05月03日 22:11:20	loewis	create

homepage