Why do I need to use latin-1 instead of utf-8 when using python with arduino?

Question 1

When reading and writing with a python serial port connection to arduino, if I don't use latin-1 ('ISO-8859-1'), the results are not as expected. Like if I have

int outP = 5;
//...
int outV = Serial.read();
analogWrite(outP, outV);

While with python I have

serial_port.write(chr(255).encode())

I read 3.78 V from the pin, whereas if I use

serial_port.write(chr(255).encode(encoding = 'latin-1'))

I get 5.04 V. I have read latin-1 and utf-8 don't always match, but is there something about arduino that requires using latin-1? Of course, these return different values when testing each encoding, using 255 gives b'\xc3\xbf' (from 'Ã¿') with utf-8 or b'\xff' from ('ÿ') with latin-1, but why does arduino work with latin-1?

FYI other options that work are

v = 255
serial_port.write(v.to_bytes(1, byteorder = 'big'))
serial_port.write(bytes[v])

Question 2

but is there something about arduino that requires using latin-1?

No, not really.

What it comes down to is that Serial.read() reads bytes, irrespective of whatever encoding they may be being used with. ISO-8859 only encodes character code points in the 0-255 range, so when you choose a 0-255 code point and send it as ISO-8859, it gets sent as 1 byte, which is how you're code has been written to receive it.

A 255 code point as utf-8 would require multiple bytes to encode and so would result in multiple Serial.read() calls for a given value.

If you open a python3 REPL and put chr(255).encode() or more explicitly chr(255).encode('utf-8') you will see it results in b'\xc3\xbf'. So on the Arduino side you will see this as two separate Serial.read() results, 0xC3 and 0xBF.

When you're just taking in strings and UTF-8 and splatting them out via Serial.println(), the Arduino is blissfully unaware of that they're UTF-8 and not ISO-8859. Really you could use any encoding you wanted, the the caveat being that if you're using c-strings to store them, the encoding would need to be one that doens't allow for a single null byte to be in the middle of the string.

You may want to look into using struct.pack rather than cobbling together python bytestrings with sequences of chr() and .encode(whatever). struct.pack will also result in a bytestring, but it's more purpose built for doing this.

You can import struct and then struct.pack('B', 255) will result in b'\xff same as chr(255).encode('iso8859-1'), but at least expresses intent. 'B' here signifies an unsigned char range, you'll find the format specifiers in the documentation. You'll get a greater benefit when you begin using multiple fields where struct.pack will be a lot less unwieldy than the chr and encode and string concatenation.

Question 3

Yes, len(chr(255).encode(encoding = 'latin-1')) returns 1, while len(chr(255).encode()) returns 2

Question 4

ok thanks, what format should be used with struct.pack?

Question 5

I have migrated my previous comment into the answer and addressed your question there.

timemage timemage 5,6391 gold badge14 silver badges25 bronze badges · Accepted Answer · 2021-03-09 15:38:21Z

but is there something about arduino that requires using latin-1?

No, not really.

What it comes down to is that Serial.read() reads bytes, irrespective of whatever encoding they may be being used with. ISO-8859 only encodes character code points in the 0-255 range, so when you choose a 0-255 code point and send it as ISO-8859, it gets sent as 1 byte, which is how you're code has been written to receive it.

A 255 code point as utf-8 would require multiple bytes to encode and so would result in multiple Serial.read() calls for a given value.

If you open a python3 REPL and put chr(255).encode() or more explicitly chr(255).encode('utf-8') you will see it results in b'\xc3\xbf'. So on the Arduino side you will see this as two separate Serial.read() results, 0xC3 and 0xBF.

When you're just taking in strings and UTF-8 and splatting them out via Serial.println(), the Arduino is blissfully unaware of that they're UTF-8 and not ISO-8859. Really you could use any encoding you wanted, the the caveat being that if you're using c-strings to store them, the encoding would need to be one that doens't allow for a single null byte to be in the middle of the string.

You may want to look into using struct.pack rather than cobbling together python bytestrings with sequences of chr() and .encode(whatever). struct.pack will also result in a bytestring, but it's more purpose built for doing this.

You can import struct and then struct.pack('B', 255) will result in b'\xff same as chr(255).encode('iso8859-1'), but at least expresses intent. 'B' here signifies an unsigned char range, you'll find the format specifiers in the documentation. You'll get a greater benefit when you begin using multiple fields where struct.pack will be a lot less unwieldy than the chr and encode and string concatenation.

Yes, len(chr(255).encode(encoding = 'latin-1')) returns 1, while len(chr(255).encode()) returns 2
I have migrated my previous comment into the answer and addressed your question there.

Stack Exchange Network

Why do I need to use latin-1 instead of utf-8 when using python with arduino?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Why do I need to use latin-1 instead of utf-8 when using python with arduino?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions