Suppose I have the following string that I want to decode as utf-8:
str ='\\u00d7\\u0090\\u00d7\\u0090\\u00d7\\u0090'
# expect 'אאא'
Using python 3, I would expect the following to work, but it doesn't:
bytes(str, 'ascii').decode('unicode-escape')
# prints ×ばつ'
bytes(str, 'ascii').decode('utf-8')
# prints '\\u00d7\\u0090\\u00d7\\u0090\\u00d7\\u0090'
Any help?
asked Oct 31, 2016 at 18:27
user2129817
951 gold badge1 silver badge7 bronze badges
1 Answer 1
You can do it with multiple trips through encode/decode.
print(st.encode('ascii').decode('unicode-escape').encode('iso-8859-1').decode('utf-8'))
The first is the preferred alternate to bytes. The second converts the escape sequences to their equivalent characters. The third takes advantage of Unicode being based on ISO-8859-1 for the first 256 code points to convert those characters directly back into bytes. Finally you can decode the UTF-8 string.
answered Oct 31, 2016 at 18:49
Mark Ransom
310k45 gold badges423 silver badges660 bronze badges
Sign up to request clarification or add additional context in comments.
Comments
lang-py