How to read binary strings in Python

Question 1

I am wondering how is binary encoding for a string is given in Python.

For example,

>>> b'\x25'
b'%'

or

>>>b'\xe2\x82\xac'.decode()
'€'

but

>>> b'\xy9'
File "<stdin>", line 1
SyntaxError: (value error) invalid \x escape at position 0

Please, could you explain what \xe2 stand for and how this binary encoding works.

Question 2

That's hexadecimal which uses 0-9 and a-f. It complains because y is not valid.

Question 3

\x is used to introduce a hexadecimal value, and must be followed by exactly two hexadecimal digits. For example, \xe2 represents the byte (in decimal) 226 (= 14 * 16 + 2).

In the first case, the two strings b'\x25' and b'%' are identical; Python displays values using ASCII equivalents where possible.

Question 4

Why does it have to be followed by exactly two hexadecimal digits? Why not just one?

Question 5

Because then \x25 would be ambiguous: is it the single character % or is it a two-character sequence consisting of the ASCII control character START TRANSMISSION followed by the character 5?

Question 6

Right, thanks. I guess it's also because each hexadecimal digit is 4 bits, so two hex digits is 8 bits or one byte

Question 7

Yes. In a bytes value, \x is used to specify a single byte by its hexadecimal value. In a str value, you can use \x (primarily for historical reasons) to specify a Unicode code point using a single value between 0 and 255, \u to specify one using a 16-bit value, or \U using a 32-bit value.

Question 8

I assume that you use a Python 3 version. In Python 3 the default encoding is UTF-8, so b'\xe2\x82\xac'.decode() is in fact b'\xe2\x82\xac'.decode('UTF-8).

It gives the character '€' which is U+20AC in unicode and the UTF8 encoding of U+20AC is indeed `b'\xe2\x82\xac' in 3 bytes.

So all ascii characters (code below 128) are encoded into one single byte with same value as the unicode code. Non ascii characters corresponding to one single 16 bit unicode value are utf8 encoded into 2 or 3 bytes (this is known as the Basic Multilingual Plane).

chepner 538k77 gold badges596 silver badges747 bronze badges · Accepted Answer · 2017-07-27 13:11:29Z

3

\x is used to introduce a hexadecimal value, and must be followed by exactly two hexadecimal digits. For example, \xe2 represents the byte (in decimal) 226 (= 14 * 16 + 2).

In the first case, the two strings b'\x25' and b'%' are identical; Python displays values using ASCII equivalents where possible.

Share

Improve this answer

answered Jul 27, 2017 at 13:11

chepner's user avatar

chepner

538k77 gold badges596 silver badges747 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Sean McCarthy

Sean McCarthy Over a year ago

Why does it have to be followed by exactly two hexadecimal digits? Why not just one?

2021年04月24日T15:52:06.543Z+00:00

chepner

chepner Over a year ago

Because then \x25 would be ambiguous: is it the single character % or is it a two-character sequence consisting of the ASCII control character START TRANSMISSION followed by the character 5?

2021年04月24日T17:43:00.013Z+00:00

Sean McCarthy

Sean McCarthy Over a year ago

Right, thanks. I guess it's also because each hexadecimal digit is 4 bits, so two hex digits is 8 bits or one byte

2021年04月24日T18:54:15.073Z+00:00

chepner

chepner Over a year ago

Yes. In a bytes value, \x is used to specify a single byte by its hexadecimal value. In a str value, you can use \x (primarily for historical reasons) to specify a Unicode code point using a single value between 0 and 255, \u to specify one using a 16-bit value, or \U using a 32-bit value.

2021年04月24日T19:13:52.753Z+00:00

CollectivesTM on Stack Overflow

How to read binary strings in Python

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related