2

I have tried to convert an ascii string to an escaped pseudo unicode escaped string using python, but failed so far.

What I want to do: Convert ASCII 'a' to ASCII String "<U0061>"

I can convert "a" with unicode('a'), but can not safe the numerical value of a in an ascii string.

How can I do that?

kennytm
526k111 gold badges1.1k silver badges1k bronze badges
asked Nov 2, 2010 at 9:04
3
  • What I want to do: Convert ASCII 'a' to ASCII String "U+0061" Commented Nov 2, 2010 at 9:05
  • 1
    What if it's outside the BMP? Commented Nov 2, 2010 at 9:07
  • Why is the BMP relevant? A code point is a code point. Doesn't python have abstract characters? Commented Nov 2, 2010 at 12:21

2 Answers 2

7

You can use ord() to convert a character to its character value (str) or code point (unicode). You can then use the appropriate string formatting to convert it into a text representation.

'U+%04X' % (ord(u'A'),)
answered Nov 2, 2010 at 9:06
Sign up to request clarification or add additional context in comments.

11 Comments

This fails: ` 'U+%04X' % (ord(u'\U000010eeb'),). But in Perl, printf "U+%05X\n", ord(chr(0x10eebA))` correctly prints the expected U+10EEBA. What’s wrong with Python’s model of strings and/or characters? Something seems broken here.
@tchrist, do you have a wide or narrow python build? Unicode character in narrow build (typical on Windows) is limited to 0-0xffff range.
@Constantine, apparently I have a "narrow build" — whatever that means. I’m on an Apple, not a Microsoft box. How do you write portable Python if you are forced to always think about characters’ physical layouts in the various encodings? I just want to process a UTF-8 stream, no matter whether characters are ASCII, in the BMP, in the SMP, or anywhere else. How come Unicode characters aren’t just Unicode characters? Why is it possible to build Python so it cannot handle Unicode correctly?
@tchrist: If you have a UCS-4 build then it works as expected. The problem is that most people do not use characters outside the BMP so they complain about the doubled memory usage.
@Constantin: If I had to choose between fast and correct, I’d choose correct every time. After all, if it doesn’t have to be correct, I can any bit of code arbitrarily fast.
|
1

Here goes a minimalist sample that allows you to use Ignacio's solution with Python's built-in coding/decoding engine. Check http://docs.python.org/library/codecs.html if you need something more consistent (with proper error handling, etc...)

import codecs
def encode(text, error="strict"):
 return ("".join("<U%04x>" % ord(char) for char in text), len(text))
def search(name):
 if name == "unicode_ltgt":
 info = codecs.CodecInfo(encode, None, None, None)
 info.name = "unicode_ltgt"
 info.encode = encode
 return info
 return None
codecs.register(search)
if __name__ == "__main__":
 a = u"maçã"
 print a.encode("unicode_ltgt")

(just by importing this as a module, the codec "unicode_ltgt" will be installed and be available to any ".encode" call, like in the given example )

Sam
1,5103 gold badges19 silver badges28 bronze badges
answered Nov 2, 2010 at 14:50

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.