Python: print unicode char escaped

Question 1

I have tried to convert an ascii string to an escaped pseudo unicode escaped string using python, but failed so far.

What I want to do: Convert ASCII 'a' to ASCII String "<U0061>"

I can convert "a" with unicode('a'), but can not safe the numerical value of a in an ascii string.

How can I do that?

Question 2

What I want to do: Convert ASCII 'a' to ASCII String "U+0061"

Question 3

What if it's outside the BMP?

Question 4

Why is the BMP relevant? A code point is a code point. Doesn't python have abstract characters?

Question 5

You can use ord() to convert a character to its character value (str) or code point (unicode). You can then use the appropriate string formatting to convert it into a text representation.

'U+%04X' % (ord(u'A'),)

Question 6

This fails: ` 'U+%04X' % (ord(u'\U000010eeb'),). But in Perl, printf "U+%05X\n", ord(chr(0x10eebA))` correctly prints the expected U+10EEBA. What’s wrong with Python’s model of strings and/or characters? Something seems broken here.

Question 7

@tchrist, do you have a wide or narrow python build? Unicode character in narrow build (typical on Windows) is limited to 0-0xffff range.

Question 8

@Constantine, apparently I have a "narrow build" — whatever that means. I’m on an Apple, not a Microsoft box. How do you write portable Python if you are forced to always think about characters’ physical layouts in the various encodings? I just want to process a UTF-8 stream, no matter whether characters are ASCII, in the BMP, in the SMP, or anywhere else. How come Unicode characters aren’t just Unicode characters? Why is it possible to build Python so it cannot handle Unicode correctly?

Question 9

@tchrist: If you have a UCS-4 build then it works as expected. The problem is that most people do not use characters outside the BMP so they complain about the doubled memory usage.

Question 10

@Constantin: If I had to choose between fast and correct, I’d choose correct every time. After all, if it doesn’t have to be correct, I can any bit of code arbitrarily fast.

Question 11

Here goes a minimalist sample that allows you to use Ignacio's solution with Python's built-in coding/decoding engine. Check http://docs.python.org/library/codecs.html if you need something more consistent (with proper error handling, etc...)

import codecs
def encode(text, error="strict"):
 return ("".join("<U%04x>" % ord(char) for char in text), len(text))
def search(name):
 if name == "unicode_ltgt":
 info = codecs.CodecInfo(encode, None, None, None)
 info.name = "unicode_ltgt"
 info.encode = encode
 return info
 return None
codecs.register(search)
if __name__ == "__main__":
 a = u"maçã"
 print a.encode("unicode_ltgt")

(just by importing this as a module, the codec "unicode_ltgt" will be installed and be available to any ".encode" call, like in the given example )

Ignacio Vazquez-Abrams 804k160 gold badges1.4k silver badges1.4k bronze badges · Accepted Answer · 2010-11-02 09:06:46Z

7

You can use ord() to convert a character to its character value (str) or code point (unicode). You can then use the appropriate string formatting to convert it into a text representation.

'U+%04X' % (ord(u'A'),)

Share

Improve this answer

answered Nov 2, 2010 at 9:06

Ignacio Vazquez-Abrams's user avatar

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

tchrist

tchrist Over a year ago

This fails: ` 'U+%04X' % (ord(u'\U000010eeb'),). But in Perl, printf "U+%05X\n", ord(chr(0x10eebA))` correctly prints the expected U+10EEBA. What’s wrong with Python’s model of strings and/or characters? Something seems broken here.

2010年11月02日T12:20:31.55Z+00:00

Constantin

Constantin Over a year ago

@tchrist, do you have a wide or narrow python build? Unicode character in narrow build (typical on Windows) is limited to 0-0xffff range.

2010年11月02日T15:35:28.54Z+00:00

tchrist

tchrist Over a year ago

@Constantine, apparently I have a "narrow build" — whatever that means. I’m on an Apple, not a Microsoft box. How do you write portable Python if you are forced to always think about characters’ physical layouts in the various encodings? I just want to process a UTF-8 stream, no matter whether characters are ASCII, in the BMP, in the SMP, or anywhere else. How come Unicode characters aren’t just Unicode characters? Why is it possible to build Python so it cannot handle Unicode correctly?

2010年11月02日T17:42:22.47Z+00:00

Ignacio Vazquez-Abrams

Ignacio Vazquez-Abrams Over a year ago

@tchrist: If you have a UCS-4 build then it works as expected. The problem is that most people do not use characters outside the BMP so they complain about the doubled memory usage.

2010年11月02日T21:59:31.117Z+00:00

tchrist

tchrist Over a year ago

@Constantin: If I had to choose between fast and correct, I’d choose correct every time. After all, if it doesn’t have to be correct, I can any bit of code arbitrarily fast.

2010年11月03日T17:36:52.61Z+00:00

|

CollectivesTM on Stack Overflow

Python: print unicode char escaped

2 Answers 2

11 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

11 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related