2

I'm using Python3.5 and I want to change \xe1BA\x06\xbe\x084 into b'\xe1BA\x06\xbe\x084'

But using '\xe1BA\x06\xbe\x084'.encode('ascii') or '\xe1BA\x06\xbe\x084'.encode('utf-8')doesn't work.

In .encode('utf-8'), it will become
b'\xc3\xa1BA\x06\xc2\xbe\x084' differs from
b'\xe1BA\x06\xbe\x084'

How to deal with this?

asked Sep 18, 2016 at 4:53
1
  • It looks like that string should be a bytestring before it gets to your code. What library / interface is giving it to you? Commented Sep 18, 2016 at 18:21

2 Answers 2

4

Use the latin1 codec.

>>> '\xe1BA\x06\xbe\x084'.encode('latin1')
b'\xe1BA\x06\xbe\x084'

The reason why this works (and is the way it is) because originally those bytes sequences were defined to be those characters by the ISO-8859-1 standard, and thus encoding them down back using that encoding well, gets you back those exact bytes.

While the other answer is useful (the loop through all available codecs to get all possible output is great), do keep in mind that while other specific codecs will work for some specific strings, it may or may not end up mapping to the identical base "byte" sequence.

>>> '\xfe'.encode('iso8859_9')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/lib/python3.5/encodings/iso8859_9.py", line 12, in encode
 return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in position 0: character maps to <undefined>
>>> '\xfe'.encode('latin1')
b'\xfe'
>>> 

Of course, the raw_unicode_escape can be useful if your intent is to encode everything to a form of base byte encoding that also allow anything> \xff to be represented through the \\uXXXX form:

>>> 'あ'.encode('latin1')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u3042' in position 0: ordinal not in range(256)
>>> 'あ'.encode('raw_unicode_escape')
b'\\u3042'
>>> 

Naturally, pick the strategy that makes the most sense for your intent.

answered Sep 18, 2016 at 5:23
Sign up to request clarification or add additional context in comments.

3 Comments

@Loïc legacy reasons, but I figured I post the answer right away before I edit in the explanation which I had to wiki because seriously I don't have all the various background in my head.
Thanks for the answer and the explanation.
@DaChen you're welcome, and I am surprised that this wasn't asked before because I can't seem to find a suitable duplicate entry for this question specifically for Python 3 and string conversion to identical raw byte sequence (they all talk about actual encoding, but not at the byte level).
1

You can try all kind of encodings to see if it match what you want.

s = '\xe1BA\x06\xbe\x084'
code_list = ["ascii", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500",
 "cp720", "cp737", "cp775", "cp850", "cp852", "cp855", "cp856", "cp857", "cp858",
 "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874",
 "cp875", "cp932", "cp949", "cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251",
 "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258", "euc_jp",
 "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp",
 "iso2022_jp_1", "iso2022_jp_2", "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext",
 "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", "iso8859_6",
 "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14",
 "iso8859_15", "iso8859_16", "johab", "koi8_r", "koi8_u", "mac_cyrillic", "mac_greek",
 "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis",
 "shift_jis_2004", "shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16",
 "utf_16_be", "utf_16_le", "utf_7", "utf_8", "utf_8_sig", "idna", "mbcs", "palmos",
 "punycode", "raw_unicode_escape", "rot_13", "undefined", "unicode_escape",
 "base64_codec", "bz2_codec", "hex_codec", "quopri_codec",
 "string_escape"]
for i in code_list:
 try:
 if s.encode(i) == b'\xe1BA\x06\xbe\x084':
 print('**{:>20}** ==> {}'.format(i, s.encode(i)))
 except Exception as e:
 pass

RESULT:

** cp1252** ==> b'\xe1BA\x06\xbe\x084'
** cp1254** ==> b'\xe1BA\x06\xbe\x084'
** cp1258** ==> b'\xe1BA\x06\xbe\x084'
** latin_1** ==> b'\xe1BA\x06\xbe\x084'
** iso8859_9** ==> b'\xe1BA\x06\xbe\x084'
** palmos** ==> b'\xe1BA\x06\xbe\x084'
** raw_unicode_escape** ==> b'\xe1BA\x06\xbe\x084'
answered Sep 18, 2016 at 5:42

1 Comment

This is quite a useful way. Thanks.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.