1

I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.

# The char in the example is á
print len(char)
OUTPUT:
2

I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.

# In this example instr = "á" (including the quotes)
for char in instr:
 print hex(int(ord(char)))
OUTPUT:
0x22
0xc3
0xa1
0x22

As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:

OUTPUT:
0x22
0xe1
0x22

Is there anyway to make the output the same on both machines? The script is exactly the same on each.

asked Apr 24, 2014 at 22:30
2
  • unrelated: to convert a bytestring into a hex string: print(binascii.hexlify(instr)) Commented Apr 25, 2014 at 4:07
  • Your code in the question is for Python 2 (judging by the print statement and the content of '"á"') Commented Apr 25, 2014 at 4:13

3 Answers 3

1

The program is not being given the same input on the two machines:

In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True

When you type á in a console, you may see the glyph á, but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252, while on a Unix machine it is likely to be utf-8.

So you may see the input as the same, but the console (and thus the program) receives different input.

If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin, then sys.stdin.encoding will be the encoding Python detects the console is using.

answered Apr 24, 2014 at 22:34
Sign up to request clarification or add additional context in comments.

1 Comment

How would the input change?
1

You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?

The character á is in fact U+00E1, so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).

In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \xc3\xa1. On that system, len(char) will be 4, and you would see your first output.

answered Apr 25, 2014 at 3:07

Comments

0

The issue is that you use bytestrings to work with a text data. You should use Unicode instead.

It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.

If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:

unicode_text = bytestring.decode(encoding)

It should resolve your initial issue.

There are also Unicode normalization forms e.g.:

import unicodedata
norm_text = unicodedata.normalize('NFC', unicode_text)

If I don't change the encoding in the program how can I output unicode characters for example?

You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.

In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.

answered Apr 25, 2014 at 4:09

5 Comments

Is there any way to decode a string without the decode method? In RPython I can't use the .decode method.
@user3566150: I don't know whether RPython supports encodings at all. Where is the data coming from? Why does it use different character encodings on different machines?
The data is coming from a text file. RPython uses ascii by default since it's based on Python 2. You can use the unicode() function so long as it only has 1 parameter and you can say u"Some string" to produce a unicode string but you can use "Something".decode("utf8"). There are a couple of functions in RPython for messing with unicode but I found a problem in those as well. They can convert the escaped unicode \uE1 for example, but they can't do every unicode character, it says Unicode Decode Error when I try to decode past \uF5.
@user3566150: don't mix the character encoding of the RPython source code (ascii) and a possible encoding of external data that can be anything. á is not ascii. Who writes the file? Why is the character encoding different in the data file?
If I don't change the encoding in the program how can I output unicode characters for example?

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.