1

Okay. So I have a library that's giving me a value like this:

>>> x
'ADC (10^-6 mm\xb2/s):Sep 05 2017 11-58-19 CDT'
>>> type(x)
str
>>> print(x)
ADC (10^-6 mm?/s):Sep 05 2017 11-58-19 CDT

It's not ascii, and it doesn't appear to be UTF-8 either:

>>> x.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb2 in position 13: invalid start byte

and I can't just convert it:

>>> y = unicode(x)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb2 in position 13: ordinal not in range(128)

But I can do this, with straight-up copy and paste:

>>> y = u'ADC (10^-6 mm\xb2/s):Sep 05 2017 11-58-19 CDT'
>>> type(y)
unicode
>>> y.encode('utf-8')
'ADC (10^-6 mm\xc2\xb2/s):Sep 05 2017 11-58-19 CDT'
>>> print(y)
ADC (10^-6 mm2/s):Sep 05 2017 11-58-19 CDT

I want to turn x into type unicode. Assigning the value as a literal works for some reason. Is there some way to use the same rules for literal assignment to decode my x?

Sorry. I know I'm missing something super basic here.

asked Sep 6, 2017 at 16:15
1
  • x.decode('latin1'). (See PEP-263). Commented Sep 6, 2017 at 16:20

1 Answer 1

1

It looks like the library is giving you strings in the latin-1 encoding (or possibly code page 1252). This is annoying, isn't it... you have to guess what the correct encoding is! (This is one of the motivating factors for Python 3.)

y = x.decode('latin-1')

Note that in latin-1, '\xb2' becomes u'\xb2' when decoded. This is true for all latin-1 characters, since the bottom 256 code points for Unicode are the same as latin-1.

answered Sep 6, 2017 at 16:25
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks! For some reason, I thought latin-1 and ascii were identical encodings.
Thinking about this more: why does x = u'\xb2' work? sys.getdefaultencoding() is ascii, sys.stdin.encoding is utf-8. Not a latin-1 or 8859 in there so why does it try that encoding?
@Nate: Encoding doesn't matter here. u'\xb2' is a Unicode string, it is not encoded (well, technically it is, but the encoding is a technical detail hidden in the implementation of the unicode class). It's the same as u'\u00b2', or unichr(0xb2), or however you want to specify "a Unicode string containing the character U+00B2".
Or what I should say is that it appears as ASCII in the source code, so the encoding of the source code doesn't matter.
Aaaaaahhhhhhhh that makes sense. I would have thought I needed to actually say \u00b2 if I wanted a unicode char, and that \xb2 would be, y'know, a byte value that required an encoding. Thanks!

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.