Issue 13916: disallow the "surrogatepass" handler for non utf-* encodings

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/58124

classification

Title:	disallow the "surrogatepass" handler for non utf-* encodings
Type:	enhancement	Stage:
Components:	Unicode	Versions:	Python 3.5

process

Dependencies:	Superseder:
Status:	open	Resolution:
Assigned To:	vstinner	Nosy List:	ezio.melotti, kennyluck, loewis, python-dev, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2012年01月31日 23:51 by kennyluck, last changed 2022年04月11日 14:57 by admin.

Files
File name	Uploaded	Description	Edit
surrogatepass_non_utf.patch	serhiy.storchaka, 2014年05月15日 10:40	review
surrogatepass_cp_utf8.patch	serhiy.storchaka, 2014年05月15日 15:11	review
surrogatepass_cp65001.patch	serhiy.storchaka, 2014年05月16日 12:01	review
cp_encoding_name.patch	vstinner, 2014年05月16日 12:54	review

Messages (20)
msg152416 - (view)	Author: Kang-Hao (Kenny) Lu (kennyluck)	Date: 2012年01月31日 23:51
Currently the "surrogatepass" handler always encodes the surrogates in UTF-8 and hence the behavior for, say, "\udc80".encode("latin-1", "surrogatepass").decode("latin-1") might be unexpected and I don't even know what would, say, "\udc80\udc80".encode("big5", "surrogatepass").decode("big5"), return. Regardless of the fact that the documentation says "surrogatepass" is specific to utf-8", the currently behavior is arguably not too harmful thanks to PyBytesObject's '0円' ending (so that ((p[0] & 0xf0) == 0xe0 \|\| (p[1] & 0xc0) == 0x80 \|\| (p[2] & 0xc0) == 0x80) in PyCodec_SurrogatePassErrors would not crash). However, I suggest we have the system either 1) raise early LookupError 2) raise the original Unicode(Decode\|Encoding)Exception as soon as PyCodec_SurrogatePassErrors is called. I prefer the former. Having this could shorten PyCodec_SurrogatePassErrors significantly in the patch I will shortly submit for issue #12892 as all the error conditions for utf-8, utf-16 and utf-32 are predicable* and almost all the conditionals could be removed. (The * statement is arguable if someone initializes interp->codec_search_path before _PyCodecRegistry_Init and the utf-16/32 encoders are overwritten. I don't think we need to worry about this too much though. Or am I wrong here?)
msg159520 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年04月28日 12:19
I fail to see the problem. If the error handler does not produce meaningful results in some context, then just don't use it. The whole point of error handlers is that they handle errors; using them shouldn't ever cause errors/exceptions.
msg159525 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年04月28日 14:36
The problem is that "surrogatepass" specific to utf-8 and there is no standard way to decode alone surrogates in utf-16. >>> "\udc80\udc80".encode("utf-16", "surrogatepass").decode("utf-16", "surrogatepass") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf16' codec can't decode bytes in position 2-3: illegal encoding With utf-32 this "works" only thanks to the bug -- utf-32 allows alone surrogates (issue #12892). If the "surrogatepass" worked with utf-16 and utf-32, it would be natural to throw ValueError for other encodings.
msg159528 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年04月28日 17:09
I see. The proper reaction for a codec that can't handle a certain error then is to raise the original exception. I'm -1 on raising LookupError when trying to find the error handler - this would suggest that the error handler does not exist, which is not true. As for simplifying the implementation: it might be reasonable to special-case surrogatepass inside the individual codecs, rather than looking up the error handler. Then the error handler could just be identical to "strict", except that UTF-8, UTF-16, and UTF-32 individually special-case this error handler in their encoders and decoders.
msg218597 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年05月15日 07:26
This issue was mainly resolved in issue12892. The surrogatepass error handler now works with UTF-16* and UTF-32* encodings. But for other encodings it behaves as for UTF-8 (preserve old behavior). Should we change the behavior for non-UTF encodings end raise an exception (as for "strict")?
msg218600 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年05月15日 09:47
Here is a patch which disallows the surrogatepass handler for non-utf encodings. Please test it on Windows.
msg218601 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2014年05月15日 10:04
Serhiy Storchaka wrote: > Here is a patch I don't see your patch.
msg218602 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年05月15日 10:40
Oh, sorry.
msg218603 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2014年05月15日 11:22
LGTM
msg218605 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2014年05月15日 11:37
New changeset 5e98a50e0f55 by Serhiy Storchaka in branch 'default': Issue #13916: Disallowed the surrogatepass error handler for non UTF-* http://hg.python.org/cpython/rev/5e98a50e0f55
msg218611 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2014年05月15日 13:12
It makes sense to restrict surrogatepass to UTF-* encodings. UTF-8, UTF-16 and UTF-32 encoders reject surrogate characters, but not UTF-7. Is it a bug? I'm asking because PyCodec_SurrogatePassErrors() doesn't support UTF-7. IMO your change is important enough to be mentionned in What's new Python 3.5 document, and maybe also in the documentation of the codecs module.
msg218612 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2014年05月15日 13:47
Windows buildbots are unhappy. http://buildbot.python.org/all/builders/x86%20Windows7%203.x/builds/8355/steps/test/logs/stdio ====================================================================== ERROR: test_surrogatepass_handler (test.test_codecs.CP65001Test) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\cygwin\home\db3l\buildarea3円.x.bolen-windows7\build\lib\test\test_codecs.py", line 883, in test_surrogatepass_handler self.assertEqual("abc\ud800def".encode("cp65001", "surrogatepass"), UnicodeEncodeError: 'CP_UTF8' codec can't encode character '\ud800' in position 3: invalid character ====================================================================== FAIL: test_encode (test.test_codecs.CP65001Test) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\cygwin\home\db3l\buildarea3円.x.bolen-windows7\build\lib\test\test_codecs.py", line 818, in test_encode encoded = text.encode('cp65001', errors) UnicodeEncodeError: 'CP_UTF8' codec can't encode character '\udc80' in position 0: invalid character During handling of the above exception, another exception occurred: Traceback (most recent call last): File "D:\cygwin\home\db3l\buildarea3円.x.bolen-windows7\build\lib\test\test_codecs.py", line 821, in test_encode 'errors=%r: %s' % (text, errors, err)) AssertionError: Unable to encode '\udc80' to cp65001 with errors='surrogatepass': 'CP_UTF8' codec can't encode character '\udc80' in position 0: invalid character ====================================================================== FAIL: test_cp1252 (test.test_codecs.CodePageTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\cygwin\home\db3l\buildarea3円.x.bolen-windows7\build\lib\test\test_codecs.py", line 2849, in test_cp1252 (b'[\x98]', 'surrogatepass', None), File "D:\cygwin\home\db3l\buildarea3円.x.bolen-windows7\build\lib\test\test_codecs.py", line 2781, in check_decode codecs.code_page_decode, cp, raw, errors, True) AssertionError: UnicodeDecodeError not raised by code_page_decode
msg218613 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年05月15日 15:11
Here is a patch, which adds support for cp65001 and fixes test_cp1252. Please test it on Windows Vista. Lone surrogates are not illegal in UTF-7 (see RFC 1642), so error handler is not called and explicit support of UTF-7 is not needed. Could you please help with documenting this change in What's new Python 3.5 document? I don't think this change is worth special mentioning in codecs documentation, it is already documented that surrogatepass is supported only for utf-8, utf-16* and utf-32*.
msg218615 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2014年05月15日 15:23
> Here is a patch, which adds support for cp65001 The name of the encoding is "cp65001", not something like "cp-utf8". And there is no alias like "cp_65001", there is only "cp65001".
msg218617 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年05月15日 16:23
But an exception reports about CP_UTF8.
msg218655 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年05月16日 12:01
Here is a patch which tests encoding name with "cp65001" instead of "CP_UTF8". I can't test on Windows and don't know which of two patches are correct.
msg218658 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2014年05月16日 12:48
New changeset 8ee2b73cda7a by Victor Stinner in branch 'default': Issue #13916: Fix surrogatepass error handler on Windows http://hg.python.org/cpython/rev/8ee2b73cda7a
msg218659 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2014年05月16日 12:54
> But an exception reports about CP_UTF8. Oh, that's my fault! And it is a bug: "CP_UTF8" is the Windows constant, but it is not a valid Python codec name. Attached patch cp_encoding_name.patch fixes this issue. I don't think that Py_LOWER() is needed because the encoding name of Unicode errors from the code page codec is "cpXXX". It cannot be "CPXXX", except if you pass create manually an Unicode error exception. What do you think? Py_LOWER or not?
msg218740 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年05月18日 10:51
I have no opinion.
msg225448 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年08月17日 15:19
Could you please finish this issue Victor?

History
Date	User	Action	Args
2022年04月11日 14:57:26	admin	set	github: 58124
2016年09月10日 10:18:20	ncoghlan	unlink	issue17909 dependencies
2014年08月17日 15:19:53	serhiy.storchaka	set	assignee: vstinner messages: + msg225448 stage: resolved ->
2014年05月18日 10:51:04	serhiy.storchaka	set	assignee: serhiy.storchaka -> (no value) messages: + msg218740
2014年05月16日 12:54:46	vstinner	set	files: + cp_encoding_name.patch messages: + msg218659
2014年05月16日 12:48:58	python-dev	set	messages: + msg218658
2014年05月16日 12:01:46	serhiy.storchaka	set	files: + surrogatepass_cp65001.patch messages: + msg218655
2014年05月15日 16:23:15	serhiy.storchaka	set	messages: + msg218617 title: disallow the "surrogatepass" handler for non utf-* encodings -> disallow the "surrogatepass" handler for non utf-* encodings
2014年05月15日 15:23:26	vstinner	set	messages: + msg218615
2014年05月15日 15:11:49	serhiy.storchaka	set	files: + surrogatepass_cp_utf8.patch messages: + msg218613
2014年05月15日 13:47:06	vstinner	set	status: closed -> open resolution: fixed -> messages: + msg218612
2014年05月15日 13:12:01	vstinner	set	messages: + msg218611
2014年05月15日 11:40:10	serhiy.storchaka	set	status: open -> closed assignee: serhiy.storchaka resolution: fixed stage: resolved
2014年05月15日 11:37:54	python-dev	set	nosy: + python-dev messages: + msg218605
2014年05月15日 11:22:18	loewis	set	messages: + msg218603
2014年05月15日 10:40:09	serhiy.storchaka	set	files: + surrogatepass_non_utf.patch keywords: + patch messages: + msg218602
2014年05月15日 10:04:06	vstinner	set	messages: + msg218601
2014年05月15日 09:47:15	serhiy.storchaka	set	type: behavior -> enhancement messages: + msg218600 versions: + Python 3.5, - Python 3.1, Python 3.2, Python 3.3
2014年05月15日 07:26:14	serhiy.storchaka	set	messages: + msg218597
2013年05月05日 13:11:10	serhiy.storchaka	link	issue17909 dependencies
2012年04月28日 17:09:37	loewis	set	messages: + msg159528
2012年04月28日 14:36:53	serhiy.storchaka	set	messages: + msg159525
2012年04月28日 12:19:02	loewis	set	nosy: + loewis messages: + msg159520
2012年04月27日 21:48:59	serhiy.storchaka	set	nosy: + serhiy.storchaka
2012年01月31日 23:54:17	vstinner	set	nosy: + vstinner
2012年01月31日 23:51:10	kennyluck	create

homepage