If I have a unicode string such as:
s = u'c\r\x8f\x02\x00\x00\x02\u201d'
how can I convert this to just a regular string that isn't in unicode format; i.e. I want to extract:
f = '\x00\x00\x02\u201d'
and I do not want it in unicode format. The reason why I need to do this is because I need to convert the unicode in s to an integer value, but if I try it with just s:
int((s[-4]+s[-3]+s[-2]+s[-1]).encode('hex'), 16)
Traceback (most recent call last):
File "<pyshell#48>", line 1, in <module>
int((s[-4]+s[-3]+s[-2]+s[-1]).encode('hex'), 16)
File "C:\Python27\lib\encodings\hex_codec.py", line 24, in hex_encode
output = binascii.b2a_hex(input)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 3: ordinal not in range(128)
yet if I do it with f:
int(f.encode('hex'), 16)
664608376369508L
And this is the correct integer value I want to extract from s. Is there a method where I can do this?
1 Answer 1
Normally, the device sends back something like: \x00\x00\x03\xcc which I can easily convert to 972
OK, so I think what's happening here is you're trying to read four bytes from a byte-oriented device, and decode that to an integer, interpreting the bytes as a 32-bit word in big-endian order.
To do this, use the struct module and byte strings:
>>> struct.unpack('>i', '\x00\x00\x03\xCC')[0]
972
(I'm not sure why you were trying to reverse the string then hex-encode; that would put the bytes in the wrong order and give much too large output.)
I don't know how you're reading from the device, but at some point you've decoded the bytes into a text (Unicode) string. Judging from the U+201D character in there I would guess that the device originally gave you a byte 0x94 and you decoded it using code page 1252 or another similar Windows default (‘ANSI’) code page.
>>> struct.unpack('>i', '\x00\x00\x02\x94')[0]
660
It may be possible to reverse the incorrect decoding step by encoding back to bytes using the same mapping, but this is dicey and depends on which encoding are involved (not all bytes are mapped to anything usable in all encodings). Better would be to look at where the input is coming from, find where that decode step is happening, and get rid of it so you keep hold of the raw bytes the device sent you.
\u201din there then by definition you want a Unicode string. You should review your requirements and probably update your question with an unambiguous problem statement.c\r\x8f\x02? Also,sis not UTF-8, and\u201din a bytestring literal produces an actual backslash and the charactersu201d, so if you really want that result (and 664608376369508L would seem to indicate you do), you've got a really weird conversion in mind. Maybe you messed up your data somewhere upstream, and you should fix it there.\u201dcharacter is. This protocol talks to a device that sends backs. Ins, only what's listed infcontains data. I need to decodefinto an integer. (The 664608376369508L I listed is not correct). Normally, the device sends back something like:\x00\x00\x03\xccwhich I can easily convert to972, but when I receive something like:\u201dor similar, I don't know how to handle it.