Okay. So I have a library that's giving me a value like this:
>>> x
'ADC (10^-6 mm\xb2/s):Sep 05 2017 11-58-19 CDT'
>>> type(x)
str
>>> print(x)
ADC (10^-6 mm?/s):Sep 05 2017 11-58-19 CDT
It's not ascii, and it doesn't appear to be UTF-8 either:
>>> x.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb2 in position 13: invalid start byte
and I can't just convert it:
>>> y = unicode(x)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb2 in position 13: ordinal not in range(128)
But I can do this, with straight-up copy and paste:
>>> y = u'ADC (10^-6 mm\xb2/s):Sep 05 2017 11-58-19 CDT'
>>> type(y)
unicode
>>> y.encode('utf-8')
'ADC (10^-6 mm\xc2\xb2/s):Sep 05 2017 11-58-19 CDT'
>>> print(y)
ADC (10^-6 mm2/s):Sep 05 2017 11-58-19 CDT
I want to turn x into type unicode. Assigning the value as a literal works for some reason. Is there some way to use the same rules for literal assignment to decode my x?
Sorry. I know I'm missing something super basic here.
1 Answer 1
It looks like the library is giving you strings in the latin-1 encoding (or possibly code page 1252). This is annoying, isn't it... you have to guess what the correct encoding is! (This is one of the motivating factors for Python 3.)
y = x.decode('latin-1')
Note that in latin-1, '\xb2' becomes u'\xb2' when decoded. This is true for all latin-1 characters, since the bottom 256 code points for Unicode are the same as latin-1.
5 Comments
x = u'\xb2' work? sys.getdefaultencoding() is ascii, sys.stdin.encoding is utf-8. Not a latin-1 or 8859 in there so why does it try that encoding?u'\xb2' is a Unicode string, it is not encoded (well, technically it is, but the encoding is a technical detail hidden in the implementation of the unicode class). It's the same as u'\u00b2', or unichr(0xb2), or however you want to specify "a Unicode string containing the character U+00B2".
x.decode('latin1'). (See PEP-263).