I have tried to convert an ascii string to an escaped pseudo unicode escaped string using python, but failed so far.
What I want to do: Convert ASCII 'a' to ASCII String "<U0061>"
I can convert "a" with unicode('a'), but can not safe the numerical value of a in an ascii string.
How can I do that?
-
What I want to do: Convert ASCII 'a' to ASCII String "U+0061"kwinsch– kwinsch2010年11月02日 09:05:25 +00:00Commented Nov 2, 2010 at 9:05
-
1What if it's outside the BMP?Ignacio Vazquez-Abrams– Ignacio Vazquez-Abrams2010年11月02日 09:07:48 +00:00Commented Nov 2, 2010 at 9:07
-
Why is the BMP relevant? A code point is a code point. Doesn't python have abstract characters?tchrist– tchrist2010年11月02日 12:21:20 +00:00Commented Nov 2, 2010 at 12:21
2 Answers 2
You can use ord() to convert a character to its character value (str) or code point (unicode). You can then use the appropriate string formatting to convert it into a text representation.
'U+%04X' % (ord(u'A'),)
11 Comments
. But in Perl, printf "U+%05X\n", ord(chr(0x10eebA))` correctly prints the expected U+10EEBA. What’s wrong with Python’s model of strings and/or characters? Something seems broken here.Here goes a minimalist sample that allows you to use Ignacio's solution with Python's built-in coding/decoding engine. Check http://docs.python.org/library/codecs.html if you need something more consistent (with proper error handling, etc...)
import codecs
def encode(text, error="strict"):
return ("".join("<U%04x>" % ord(char) for char in text), len(text))
def search(name):
if name == "unicode_ltgt":
info = codecs.CodecInfo(encode, None, None, None)
info.name = "unicode_ltgt"
info.encode = encode
return info
return None
codecs.register(search)
if __name__ == "__main__":
a = u"maçã"
print a.encode("unicode_ltgt")
(just by importing this as a module, the codec "unicode_ltgt" will be installed and be available to any ".encode" call, like in the given example )