homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Ill-formed surrogates not treated as errors during encoding/decoding
Type: behavior Stage: test needed
Components: Unicode Versions: Python 3.1
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: loewis Nosy List: Rhamphoryncus, benjamin.peterson, ezio.melotti, hippietrail, jwilk, lemburg, loewis, pitrou, python-dev
Priority: release blocker Keywords: patch

Created on 2008年08月24日 21:56 by Rhamphoryncus, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
surrogates.diff loewis, 2009年05月02日 09:48
Messages (17)
msg71889 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2008年08月24日 21:56
The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or
UTF-32 should be treated as errors. Lone surrogates in UTF-16 should
probably be treated as errors too (but only during encoding/decoding;
unicode objects on UTF-16 builds should allow them to be created through
slicing).
http://unicode.org/faq/utf_bom.html#30
http://unicode.org/faq/utf_bom.html#42
http://unicode.org/faq/utf_bom.html#40
Lone surrogate in UTF-8 (effectively CESU-8):
>>> '\xED\xA0\x81'.decode('utf-8')
u'\ud801'
Surrogate pair in UTF-8:
>>> '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8')
u'\ud801\udc00'
On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding
again will produce the proper non-surrogate scalar value. This has
security implications, although rare as characters outside the BMP are rare:
>>> u'\ud801\udc00'.encode('utf-16').decode('utf-16')
u'\U00010400'
Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails
(correctly), but encoding one does not:
>>> u'\ud801'.encode('utf-16')
'\xff\xfe\x01\xd8'
I have gotten a report of a user decoding bad data using
x.decode('utf-8', 'replace'), then getting an error from Gtk+ when the
ill-formed surrogates reached it.
Fixing this would cause issue 3297 to blow up loudly, rather than silently.
msg86736 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009年04月28日 13:13
We could fix it for 3.1, and perhaps leave 2.7 unchanged if some people
rely on this (for whatever reason).
msg86817 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009年04月29日 16:54
While it's probably ok to fix the codecs, there's an issue which makes
this difficult at least for the utf-8 codec:
The marshal module uses utf-8 to write Unicode objects and these can and
need to be able to store the full range of supported UCS2/UCS4 code
points, including lone surrogates.
If the utf-8 codec were changed to raise an error for these, marshal
would no longer be able to write/read Unicode objects.
It is likely that other existing Python code (outside the std lib) also
relies on this ability.
Changing this would only be possible in 3.1.
The marshal module would then also have to be changed to use a different
encoding which does support encoding lone surrogates.
See issue 3297 for a discussion of UTF-8/16 vs. UCS2/4, the
implications, motivations, etc.
msg86824 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009年04月29日 20:39
I think we could preserve the marshal format with yet another error
handler - one that emits half surrogates into their intuitive form.
msg86839 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009年04月30日 08:26
On 2009年04月29日 22:39, Martin v. Löwis @psf.upfronthosting.co.za wrote:
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> I think we could preserve the marshal format with yet another error
> handler - one that emits half surrogates into their intuitive form.
That's a good idea. We could have an error handler which then let's
the codec accept lone surrogates for utf-8 or just add a new codec
which does this and use that for marshal.
Still, this can only go into 3.1.
msg86873 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009年05月01日 09:13
Here is a patch that implements this proposed approach. It introduces a
"surrogates" error handler, useful only for the utf-8 codec.
If this is accepted, the implementation of PEP 383 can be simplified
significantly, essentially removing the need for a separate utf-8b codec
(as that could be done in the error handler, as for the other codecs).
msg86874 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009年05月01日 09:21
rietveld: http://codereview.appspot.com/52081 
msg86896 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009年05月01日 19:48
Fixed indexing error.
msg86913 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009年05月01日 21:30
http://codereview.appspot.com/52081/diff/1/5
File Doc/library/codecs.rst (right):
http://codereview.appspot.com/52081/diff/1/5#newcode326
Line 326: In addition, the following error handlers are specific to only
selected
"In addition, the following error handlers are specific to a single
codec." sounds better
http://codereview.appspot.com/52081/diff/1/5#newcode335
Line 335:
There should probably be a versionchanged directive indicating that
"surrogates" was added in 3.1.
http://codereview.appspot.com/52081/diff/1/6
File Lib/test/test_codecs.py (right):
http://codereview.appspot.com/52081/diff/1/6#newcode544
Line 544: def test_surrogates(self):
I think this should be split into 2 tests. "test_lone_surrogates" and
"test_surrogate_handler".
http://codereview.appspot.com/52081/diff/1/4
File Objects/unicodeobject.c (right):
http://codereview.appspot.com/52081/diff/1/4#newcode157
Line 157: static PyObject *unicode_encode_call_errorhandler(const char
*errors,
These prototypes are longer than 80 chars some places. I don't think the
arguments need to line up with the starting parenthesis.
http://codereview.appspot.com/52081/diff/1/4#newcode2393
Line 2393: s, size, &exc, i-1, i, &newpos);
"exc" is never Py_DECREFed.
http://codereview.appspot.com/52081/diff/1/4#newcode4110
Line 4110: if (!PyUnicode_Check(repunicode)) {
Is there a test of this case somewhere?
http://codereview.appspot.com/52081/diff/1/2
File Python/codecs.c (right):
http://codereview.appspot.com/52081/diff/1/2#newcode758
Line 758: if (PyObject_IsInstance(exc, PyExc_UnicodeEncodeError)) {
I believe PyErr_GivenExceptionMatches is more appropriate here, but
given the rest of the file uses PyObject_IsInstance, it likely doesn't
matter.
http://codereview.appspot.com/52081/diff/1/2#newcode771
Line 771: return NULL;
This is leaks "object".
http://codereview.appspot.com/52081 
msg86936 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009年05月02日 09:44
Reviewers: report_bugs.python.org, Benjamin,
Message:
Issues fixed in r72188.
http://codereview.appspot.com/52081/diff/1/5
File Doc/library/codecs.rst (right):
http://codereview.appspot.com/52081/diff/1/5#newcode326
Line 326: In addition, the following error handlers are specific to only
selected
On 2009年05月01日 21:25:44, Benjamin wrote:
> "In addition, the following error handlers are specific to a single
codec."
> sounds better
Done.
http://codereview.appspot.com/52081/diff/1/5#newcode335
Line 335:
On 2009年05月01日 21:25:44, Benjamin wrote:
> There should probably be a versionchanged directive indicating that
"surrogates"
> was added in 3.1.
Done.
http://codereview.appspot.com/52081/diff/1/6
File Lib/test/test_codecs.py (right):
http://codereview.appspot.com/52081/diff/1/6#newcode544
Line 544: def test_surrogates(self):
On 2009年05月01日 21:25:44, Benjamin wrote:
> I think this should be split into 2 tests. "test_lone_surrogates" and
> "test_surrogate_handler".
Done.
http://codereview.appspot.com/52081/diff/1/4
File Objects/unicodeobject.c (right):
http://codereview.appspot.com/52081/diff/1/4#newcode157
Line 157: static PyObject *unicode_encode_call_errorhandler(const char
*errors,
On 2009年05月01日 21:25:44, Benjamin wrote:
> These prototypes are longer than 80 chars some places. I don't think
the
> arguments need to line up with the starting parenthesis.
Done.
http://codereview.appspot.com/52081/diff/1/4#newcode2393
Line 2393: s, size, &exc, i-1, i, &newpos);
On 2009年05月01日 21:25:44, Benjamin wrote:
> "exc" is never Py_DECREFed.
Done.
http://codereview.appspot.com/52081/diff/1/4#newcode4110
Line 4110: if (!PyUnicode_Check(repunicode)) {
On 2009年05月01日 21:25:44, Benjamin wrote:
> Is there a test of this case somewhere?
No. This is temporary - for PEP 383, I will have to support error
handlers returning bytes here, also.
http://codereview.appspot.com/52081/diff/1/2
File Python/codecs.c (right):
http://codereview.appspot.com/52081/diff/1/2#newcode758
Line 758: if (PyObject_IsInstance(exc, PyExc_UnicodeEncodeError)) {
On 2009年05月01日 21:25:44, Benjamin wrote:
> I believe PyErr_GivenExceptionMatches is more appropriate here, but
given the
> rest of the file uses PyObject_IsInstance, it likely doesn't matter.
No. The interface is that an exception instance must be passed;
GivenExceptionMatches would also allow for tuples and types.
http://codereview.appspot.com/52081/diff/1/2#newcode771
Line 771: return NULL;
On 2009年05月01日 21:25:44, Benjamin wrote:
> This is leaks "object".
Done.
Please review this at http://codereview.appspot.com/52081
Affected files:
 M Doc/library/codecs.rst
 M Lib/test/test_bytes.py
 M Lib/test/test_codecs.py
 M Lib/test/test_unicode.py
 M Lib/test/test_unicodedata.py
 M Objects/unicodeobject.c
 M Python/codecs.c
 M Python/marshal.c 
msg86954 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009年05月02日 15:32
I think the new patch looks fine.
msg86966 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009年05月02日 18:54
Something I overlooked is that PyCodec_SurrogateErrors isn't exposed in
any headers.
msg86967 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009年05月02日 18:57
Committed as r72208, blocked as r72209.
As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.
msg86968 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009年05月02日 19:01
2009年5月2日 <"\"Martin v. Löwis\"
<report@bugs.python.org>"@psf.upfronthosting.co.za>:
>
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
> Committed as r72208, blocked as r72209.
>
> As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.
Why? All the other error handlers are exposed.
msg86970 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009年05月02日 19:11
>> As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.
> 
> Why? All the other error handlers are exposed.
Sure - but what for? IMO, they all shouldn't be exposed.
msg86971 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009年05月02日 19:15
2009年5月2日 <"\"Martin v. Löwis\"
<report@bugs.python.org>"@psf.upfronthosting.co.za>:
>
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
>>> As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.
>>
>> Why? All the other error handlers are exposed.
>
> Sure - but what for? IMO, they all shouldn't be exposed.
The only reason I can think of is consistency, but I don't care that much.
msg275006 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016年09月08日 12:47
New changeset 2150eadb54c7 by Serhiy Storchaka in branch 'default':
Remove old typo.
https://hg.python.org/cpython/rev/2150eadb54c7 
History
Date User Action Args
2022年04月11日 14:56:38adminsetgithub: 47922
2016年09月08日 12:47:50python-devsetnosy: + python-dev
messages: + msg275006
2010年04月07日 14:25:41ezio.melottisetnosy: lemburg, loewis, Rhamphoryncus, pitrou, benjamin.peterson, jwilk, ezio.melotti, hippietrail
2009年06月16日 02:47:10hippietrailsetnosy: + hippietrail
2009年05月02日 19:15:09benjamin.petersonsetmessages: + msg86971
2009年05月02日 19:11:02loewissetmessages: + msg86970
2009年05月02日 19:01:14benjamin.petersonsetmessages: + msg86968
2009年05月02日 18:57:28loewissetstatus: open -> closed
resolution: accepted
messages: + msg86967
2009年05月02日 18:54:29benjamin.petersonsetmessages: + msg86966
2009年05月02日 15:32:13benjamin.petersonsetassignee: benjamin.peterson -> loewis
messages: + msg86954
2009年05月02日 09:48:08loewissetfiles: + surrogates.diff
2009年05月02日 09:47:45loewissetfiles: - surrogates.diff
2009年05月02日 09:44:06loewissetmessages: + msg86936
2009年05月01日 21:30:48benjamin.petersonsetmessages: + msg86913
2009年05月01日 19:48:31loewissetfiles: + surrogates.diff

messages: + msg86896
2009年05月01日 19:47:36loewissetfiles: - surrogates.diff
2009年05月01日 09:21:49loewissetmessages: + msg86874
2009年05月01日 09:13:53loewissetfiles: + surrogates.diff
priority: high -> release blocker

assignee: benjamin.peterson

keywords: + patch
nosy: + benjamin.peterson
messages: + msg86873
2009年04月30日 08:27:03lemburgsetmessages: + msg86839
2009年04月29日 20:39:33loewissetmessages: + msg86824
2009年04月29日 16:54:26lemburgsetmessages: + msg86817
2009年04月28日 17:20:22pitrousetnosy: + lemburg, loewis
2009年04月28日 13:13:33pitrousetpriority: high
versions: + Python 3.1
nosy: + pitrou

messages: + msg86736

stage: test needed
2009年04月25日 15:05:34jwilksetnosy: + jwilk
2008年09月02日 06:44:56ezio.melottisetnosy: + ezio.melotti
2008年08月24日 21:57:15Rhamphoryncussettype: behavior
components: + Unicode
2008年08月24日 21:56:51Rhamphoryncuscreate

AltStyle によって変換されたページ (->オリジナル) /