This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014年04月22日 20:58 by deleted250130, last changed 2022年04月11日 14:58 by admin.
| Messages (14) | |||
|---|---|---|---|
| msg217021 - (view) | Author: (deleted250130) | Date: 2014年04月22日 20:58 | |
I have made some tests with encoding/decoding in conjunction with unicode-escape and got some strange results:
>>> print('ä')
ä
>>> print('ä'.encode('utf-8'))
b'\xc3\xa4'
>>> print('ä'.encode('utf-8').decode('unicode-escape'))
ä
>>> print('ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape'))
b'\\xc3\\xa4'
>>> print('ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape').decode('utf-8'))
\xc3\xa4
Shouldn't .decode('unicode-escape').encode('unicode-escape') nullify itself and so "'ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape')" return the same result as 'ä'.encode('utf-8')?
|
|||
| msg217024 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2014年04月22日 21:13 | |
No. x.encode('unicode-escape').decode('unicode-escape') should return the same result, and it does.
The bug, I think, is that bytes.decode('unicode-escape') is not objecting to the non-ascii characters. It appears to be treating them as latin1, and that strikes me as broken.
|
|||
| msg217033 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2014年04月22日 21:56 | |
unicode_escape codec is deprecated since Python 3.3. Please use UTF-8 or something else. |
|||
| msg217055 - (view) | Author: (deleted250130) | Date: 2014年04月23日 06:42 | |
The documentation says that unicode_internal is deprecated since Python 3.3 but not unicode_escape. Also, isn't unicode_escape different from utf-8? For example my original intention was to convert 2 byte string characters to their control characters. For example the file test.txt contains the 17 byte utf-8 raw content "---a---\n---ä---". Now I want to convert '\\n' to '\n':
>>> file = open('test.txt', 'r')
>>> content = file.read()
>>> file.close()
>>> content = content.encode('utf-8').decode('unicode-escape')
>>> print(content)
---a---
---ä---
I'm getting now successfully 2 lines but I have noticed not getting the ä anymore. After that I have made a deeper look and opened this ticket.
If unicode_escape gets really deprecated maybe I could simply replace the characters 0-31 and 127 to achieve practically the same behavior.
|
|||
| msg217094 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2014年04月23日 22:07 | |
Using unicode_escape to decode non-ascii is simply wrong. It can't work. |
|||
| msg217095 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2014年04月23日 22:17 | |
To understand why, understand that a byte string has no encoding inherent. So when you call b'utf8string'.decode('unicode_escape'), python has no way to know how to interpret the non-ascii characters in that bytestring. If you want the unicode_escape representation of something, you want to do 'string'.encode('unicode_escape'). If you then want that as a python string, you can do:
'mystring'.encode('unicode_escape').decode('ascii')
In theory there ought to be a way to use the codecs module to go directly from unicode string to unicode-escaped string, but I don't know how to do it, since the proposal for the 'transform' method was rejected :)
Just to bend your brain a bit further, note that this does work:
>>> codecs.decode(codecs.encode('ä', 'unicode-escape').decode('ascii'), 'unicode-escape')
'ä'
|
|||
| msg217096 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2014年04月23日 22:19 | |
Also, I'm not sure what this should do, but what it does do doesn't look right:
>>> codecs.decode('ä', 'unicode-escape')
'ä'
|
|||
| msg218519 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年05月14日 11:00 | |
Sworddragon, try to use content.encode('ascii', 'backslashreplace').decode('unicode-escape').
It is too late to change the unicode-escape encoding.
|
|||
| msg221191 - (view) | Author: (deleted250130) | Date: 2014年06月21日 19:22 | |
> It is too late to change the unicode-escape encoding. So it will stay at ISO-8859-1? If yes I think this ticket can be closed as wont fix. |
|||
| msg221198 - (view) | Author: Martin v. Löwis (loewis) * (Python committer) | Date: 2014年06月21日 20:32 | |
I disagree. The current decoder implementation is clearly incorrect: the unicode-escape encoding only uses bytes < 128. So decoding non-ascii bytes should fail. So the examples in msg217021 should all give UnicodeDecodeErrors. As this is an incompatible change, we need to deprecate the current behavior for 3.5, and change it in 3.6. |
|||
| msg221204 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2014年06月21日 21:21 | |
The unicode-escape codec was used in Python 2 to convert Unicode literals in source code to Unicode objects. Before PEP 263, Unicode literals in source code were interpreted as Latin-1. See http://legacy.python.org/dev/peps/pep-0263/ for details. The implementation is correct, but doesn't necessarily match today's realities anymore. |
|||
| msg221308 - (view) | Author: Martin v. Löwis (loewis) * (Python committer) | Date: 2014年06月22日 20:35 | |
As you say, the unicode-escape codec is tied to the Python language definition. So if the language changes, the codec needs to change as well. A Unicode literal in source code might be using any encoding, so to be on the safe side, restricting it to ASCII is meaningful. Or else, if we want to use the default source encoding (as it did in 2.x), we should assume UTF-8 (per PEP 3120). Using ISO-8859-1 is clearly wrong for 3.x. |
|||
| msg221442 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年06月24日 09:44 | |
Note that 'raw-unicode-escape' is used in pickle protocol 0. Changing it can break compatibility. |
|||
| msg221447 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2014年06月24日 10:08 | |
On 24.06.2014 11:44, Serhiy Storchaka wrote: > > Note that 'raw-unicode-escape' is used in pickle protocol 0. Changing it can break compatibility. Indeed. unicode-escape was also designed to be able to read back raw-unicode-escape encoded data, so changing the decoder to not accept Latin-1 code points would break that as well. It may be better to simply create a new codec that rejects non-ASCII encoded bytes when decoding and perhaps call that 'unicode-repr'. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:02 | admin | set | github: 65530 |
| 2014年06月24日 10:08:35 | lemburg | set | messages: + msg221447 |
| 2014年06月24日 09:44:53 | serhiy.storchaka | set | messages: + msg221442 |
| 2014年06月22日 20:35:08 | loewis | set | messages: + msg221308 |
| 2014年06月21日 21:21:50 | lemburg | set | messages: + msg221204 |
| 2014年06月21日 20:32:23 | loewis | set | nosy:
+ loewis messages: + msg221198 |
| 2014年06月21日 19:22:01 | deleted250130 | set | status: pending -> open messages: + msg221191 |
| 2014年05月25日 08:02:34 | serhiy.storchaka | set | status: open -> pending |
| 2014年05月14日 11:00:50 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg218519 |
| 2014年04月23日 22:19:16 | r.david.murray | set | messages: + msg217096 |
| 2014年04月23日 22:17:11 | r.david.murray | set | messages: + msg217095 |
| 2014年04月23日 22:07:41 | r.david.murray | set | messages: + msg217094 |
| 2014年04月23日 06:42:48 | deleted250130 | set | messages: + msg217055 |
| 2014年04月22日 21:56:14 | vstinner | set | messages: + msg217033 |
| 2014年04月22日 21:13:34 | r.david.murray | set | nosy:
+ ncoghlan, r.david.murray, lemburg messages: + msg217024 |
| 2014年04月22日 20:58:23 | deleted250130 | create | |