Python Unicode Bug

Question 1

I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.

# The char in the example is á
print len(char)
OUTPUT:
2

I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.

# In this example instr = "á" (including the quotes)
for char in instr:
 print hex(int(ord(char)))
OUTPUT:
0x22
0xc3
0xa1
0x22

As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:

OUTPUT:
0x22
0xe1
0x22

Is there anyway to make the output the same on both machines? The script is exactly the same on each.

Question 2

unrelated: to convert a bytestring into a hex string: print(binascii.hexlify(instr))

Question 3

Your code in the question is for Python 2 (judging by the print statement and the content of '"á"')

Question 4

The program is not being given the same input on the two machines:

In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True

When you type á in a console, you may see the glyph á, but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252, while on a Unix machine it is likely to be utf-8.

So you may see the input as the same, but the console (and thus the program) receives different input.

If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin, then sys.stdin.encoding will be the encoding Python detects the console is using.

Question 5

How would the input change?

Question 6

You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?

The character á is in fact U+00E1, so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).

In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \xc3\xa1. On that system, len(char) will be 4, and you would see your first output.

Question 7

The issue is that you use bytestrings to work with a text data. You should use Unicode instead.

It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.

If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:

unicode_text = bytestring.decode(encoding)

It should resolve your initial issue.

There are also Unicode normalization forms e.g.:

import unicodedata
norm_text = unicodedata.normalize('NFC', unicode_text)

If I don't change the encoding in the program how can I output unicode characters for example?

You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.

In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.

Question 8

Is there any way to decode a string without the decode method? In RPython I can't use the .decode method.

Question 9

@user3566150: I don't know whether RPython supports encodings at all. Where is the data coming from? Why does it use different character encodings on different machines?

Question 10

The data is coming from a text file. RPython uses ascii by default since it's based on Python 2. You can use the unicode() function so long as it only has 1 parameter and you can say u"Some string" to produce a unicode string but you can use "Something".decode("utf8"). There are a couple of functions in RPython for messing with unicode but I found a problem in those as well. They can convert the escaped unicode \uE1 for example, but they can't do every unicode character, it says Unicode Decode Error when I try to decode past \uF5.

Question 11

@user3566150: don't mix the character encoding of the RPython source code (ascii) and a possible encoding of external data that can be anything. á is not ascii. Who writes the file? Why is the character encoding different in the data file?

Question 12

If I don't change the encoding in the program how can I output unicode characters for example?

unutbu 887k197 gold badges1.9k silver badges1.7k bronze badges · Answer 1 · 2014-04-24 22:34:06Z

The program is not being given the same input on the two machines:

In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True

When you type á in a console, you may see the glyph á, but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252, while on a Unix machine it is likely to be utf-8.

So you may see the input as the same, but the console (and thus the program) receives different input.

If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin, then sys.stdin.encoding will be the encoding Python detects the console is using.

Ian Clelland 44.4k8 gold badges90 silver badges88 bronze badges · Answer 2 · 2014-04-25 03:07:17Z

You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?

The character á is in fact U+00E1, so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).

In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \xc3\xa1. On that system, len(char) will be 4, and you would see your first output.

jfs 417k211 gold badges1k silver badges1.7k bronze badges · Answer 3 · 2014-04-25 04:09:31Z

The issue is that you use bytestrings to work with a text data. You should use Unicode instead.

It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.

If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:

unicode_text = bytestring.decode(encoding)

It should resolve your initial issue.

There are also Unicode normalization forms e.g.:

import unicodedata
norm_text = unicodedata.normalize('NFC', unicode_text)

If I don't change the encoding in the program how can I output unicode characters for example?

You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.

In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.

Is there any way to decode a string without the decode method? In RPython I can't use the .decode method.
@user3566150: I don't know whether RPython supports encodings at all. Where is the data coming from? Why does it use different character encodings on different machines?
The data is coming from a text file. RPython uses ascii by default since it's based on Python 2. You can use the unicode() function so long as it only has 1 parameter and you can say u"Some string" to produce a unicode string but you can use "Something".decode("utf8"). There are a couple of functions in RPython for messing with unicode but I found a problem in those as well. They can convert the escaped unicode \uE1 for example, but they can't do every unicode character, it says Unicode Decode Error when I try to decode past \uF5.
@user3566150: don't mix the character encoding of the RPython source code (ascii) and a possible encoding of external data that can be anything. á is not ascii. Who writes the file? Why is the character encoding different in the data file?
If I don't change the encoding in the program how can I output unicode characters for example?

CollectivesTM on Stack Overflow

Python Unicode Bug

3 Answers 3

1 Comment

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

1 Comment

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related