11

In python:

u'\u3053\n'

Is it utf-16?

I'm not really aware of all the unicode/encoding stuff, but this type of thing is coming up in my dataset, like if I have a=u'\u3053\n'.

print gives an exception and decoding gives an exception.

a.encode("utf-16") > '\xff\xfeS0\n\x00'
a.encode("utf-8") > '\xe3\x81\x93\n'
print a.encode("utf-8") > πüô
print a.encode("utf-16") > しかくS0

What's going on here?

SilentGhost
322k67 gold badges312 silver badges294 bronze badges
asked Aug 4, 2009 at 19:22
1

4 Answers 4

11

It's a unicode character that doesn't seem to be displayable in your terminals encoding. print tries to encode the unicode object in the encoding of your terminal and if this can't be done you get an exception.

On a terminal that can display utf-8 you get:

>>> print u'\u3053'
こ

Your terminal doesn't seem to be able to display utf-8, else at least the print a.encode("utf-8") line should produce the correct character.

answered Aug 4, 2009 at 19:35
Sign up to request clarification or add additional context in comments.

1 Comment

thanks yes, powershell , even powershell ISE doesn't seem "compatable" (for lack of a better understanding) with unicode in python. stackoverflow.com/questions/2105022/…
8

You ask:

u'\u3053\n'

Is it utf-16?

The answer is no: it's unicode, not any specific encoding. utf-16 is an encoding.

To print a Unicode string effectively to your terminal, you need to find out what encoding that terminal is willing to accept and able to display. For example, the Terminal.app on my laptop is set to UTF-8 and with a rich font, so:

screenshot
(source: aleax.it)

...the Hiragana letter displays correctly. On a Linux workstation I have a terminal program that keeps resetting to Latin-1 so it would mangle things somewhat like yours -- I can set it to utf-8, but it doesn't have huge number of glyphs in the font, so it would display somewhat-useless placeholder glyphs instead.

Glorfindel
22.8k13 gold badges97 silver badges124 bronze badges
answered Aug 5, 2009 at 2:15

1 Comment

Is it possible to print utf-16 characters in python?
3

Character U+3053 "HIRAGANA LETTER KO".

The \xff\xfe bit at the start of the UTF-16 binary format is the encoded byte order mark (U+FEFF), then "S0" is \x5e\x30, then there's the \n from the original string. (Each of the characters has its bytes "reversed" as it's using little endian UTF-16 encoding.)

The UTF-8 form represents the same Hiragana character in three bytes, with the bit pattern as documented here.

Now, as for whether you should really have it in your data set... where is this data coming from? Is it reasonable for it to have Hiragana characters in it?

answered Aug 4, 2009 at 19:37

Comments

1

Here's the Unicode HowTo Doc for Python 2.6.2:

http://docs.python.org/howto/unicode.html

Also see the links in the Reference section of that document for other explanations, including one by Joel Spolsky.

answered Aug 4, 2009 at 19:33

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.