Issue 21331: Reversing an encoding with unicode-escape returns a different result

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65530

classification

Title:	Reversing an encoding with unicode-escape returns a different result
Type:	behavior	Stage:
Components:	Unicode	Versions:	Python 3.4

process

Dependencies:	Superseder:
Status:	open	Resolution:
Assigned To:	Nosy List:	deleted250130, ezio.melotti, lemburg, loewis, ncoghlan, r.david.murray, serhiy.storchaka, vstinner
Priority:	normal	Keywords:

Created on 2014年04月22日 20:58 by deleted250130, last changed 2022年04月11日 14:58 by admin.

Messages (14)
msg217021 - (view)	Author: (deleted250130)	Date: 2014年04月22日 20:58
I have made some tests with encoding/decoding in conjunction with unicode-escape and got some strange results: >>> print('ä') ä >>> print('ä'.encode('utf-8')) b'\xc3\xa4' >>> print('ä'.encode('utf-8').decode('unicode-escape')) Ã¤ >>> print('ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape')) b'\\xc3\\xa4' >>> print('ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape').decode('utf-8')) \xc3\xa4 Shouldn't .decode('unicode-escape').encode('unicode-escape') nullify itself and so "'ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape')" return the same result as 'ä'.encode('utf-8')?
msg217024 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2014年04月22日 21:13
No. x.encode('unicode-escape').decode('unicode-escape') should return the same result, and it does. The bug, I think, is that bytes.decode('unicode-escape') is not objecting to the non-ascii characters. It appears to be treating them as latin1, and that strikes me as broken.
msg217033 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2014年04月22日 21:56
unicode_escape codec is deprecated since Python 3.3. Please use UTF-8 or something else.
msg217055 - (view)	Author: (deleted250130)	Date: 2014年04月23日 06:42
The documentation says that unicode_internal is deprecated since Python 3.3 but not unicode_escape. Also, isn't unicode_escape different from utf-8? For example my original intention was to convert 2 byte string characters to their control characters. For example the file test.txt contains the 17 byte utf-8 raw content "---a---\n---ä---". Now I want to convert '\\n' to '\n': >>> file = open('test.txt', 'r') >>> content = file.read() >>> file.close() >>> content = content.encode('utf-8').decode('unicode-escape') >>> print(content) ---a--- ---Ã¤--- I'm getting now successfully 2 lines but I have noticed not getting the ä anymore. After that I have made a deeper look and opened this ticket. If unicode_escape gets really deprecated maybe I could simply replace the characters 0-31 and 127 to achieve practically the same behavior.
msg217094 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2014年04月23日 22:07
Using unicode_escape to decode non-ascii is simply wrong. It can't work.
msg217095 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2014年04月23日 22:17
To understand why, understand that a byte string has no encoding inherent. So when you call b'utf8string'.decode('unicode_escape'), python has no way to know how to interpret the non-ascii characters in that bytestring. If you want the unicode_escape representation of something, you want to do 'string'.encode('unicode_escape'). If you then want that as a python string, you can do: 'mystring'.encode('unicode_escape').decode('ascii') In theory there ought to be a way to use the codecs module to go directly from unicode string to unicode-escaped string, but I don't know how to do it, since the proposal for the 'transform' method was rejected :) Just to bend your brain a bit further, note that this does work: >>> codecs.decode(codecs.encode('ä', 'unicode-escape').decode('ascii'), 'unicode-escape') 'ä'
msg217096 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2014年04月23日 22:19
Also, I'm not sure what this should do, but what it does do doesn't look right: >>> codecs.decode('ä', 'unicode-escape') 'Ã¤'
msg218519 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年05月14日 11:00
Sworddragon, try to use content.encode('ascii', 'backslashreplace').decode('unicode-escape'). It is too late to change the unicode-escape encoding.
msg221191 - (view)	Author: (deleted250130)	Date: 2014年06月21日 19:22
> It is too late to change the unicode-escape encoding. So it will stay at ISO-8859-1? If yes I think this ticket can be closed as wont fix.
msg221198 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2014年06月21日 20:32
I disagree. The current decoder implementation is clearly incorrect: the unicode-escape encoding only uses bytes < 128. So decoding non-ascii bytes should fail. So the examples in msg217021 should all give UnicodeDecodeErrors. As this is an incompatible change, we need to deprecate the current behavior for 3.5, and change it in 3.6.
msg221204 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2014年06月21日 21:21
The unicode-escape codec was used in Python 2 to convert Unicode literals in source code to Unicode objects. Before PEP 263, Unicode literals in source code were interpreted as Latin-1. See http://legacy.python.org/dev/peps/pep-0263/ for details. The implementation is correct, but doesn't necessarily match today's realities anymore.
msg221308 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2014年06月22日 20:35
As you say, the unicode-escape codec is tied to the Python language definition. So if the language changes, the codec needs to change as well. A Unicode literal in source code might be using any encoding, so to be on the safe side, restricting it to ASCII is meaningful. Or else, if we want to use the default source encoding (as it did in 2.x), we should assume UTF-8 (per PEP 3120). Using ISO-8859-1 is clearly wrong for 3.x.
msg221442 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年06月24日 09:44
Note that 'raw-unicode-escape' is used in pickle protocol 0. Changing it can break compatibility.
msg221447 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2014年06月24日 10:08
On 24.06.2014 11:44, Serhiy Storchaka wrote: > > Note that 'raw-unicode-escape' is used in pickle protocol 0. Changing it can break compatibility. Indeed. unicode-escape was also designed to be able to read back raw-unicode-escape encoded data, so changing the decoder to not accept Latin-1 code points would break that as well. It may be better to simply create a new codec that rejects non-ASCII encoded bytes when decoding and perhaps call that 'unicode-repr'.

History
Date	User	Action	Args
2022年04月11日 14:58:02	admin	set	github: 65530
2014年06月24日 10:08:35	lemburg	set	messages: + msg221447
2014年06月24日 09:44:53	serhiy.storchaka	set	messages: + msg221442
2014年06月22日 20:35:08	loewis	set	messages: + msg221308
2014年06月21日 21:21:50	lemburg	set	messages: + msg221204
2014年06月21日 20:32:23	loewis	set	nosy: + loewis messages: + msg221198
2014年06月21日 19:22:01	deleted250130	set	status: pending -> open messages: + msg221191
2014年05月25日 08:02:34	serhiy.storchaka	set	status: open -> pending
2014年05月14日 11:00:50	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg218519
2014年04月23日 22:19:16	r.david.murray	set	messages: + msg217096
2014年04月23日 22:17:11	r.david.murray	set	messages: + msg217095
2014年04月23日 22:07:41	r.david.murray	set	messages: + msg217094
2014年04月23日 06:42:48	deleted250130	set	messages: + msg217055
2014年04月22日 21:56:14	vstinner	set	messages: + msg217033
2014年04月22日 21:13:34	r.david.murray	set	nosy: + ncoghlan, r.david.murray, lemburg messages: + msg217024
2014年04月22日 20:58:23	deleted250130	create

homepage