What encoding does python 2.7's unicode literal expect?

Question 1

Okay. So I have a library that's giving me a value like this:

>>> x
'ADC (10^-6 mm\xb2/s):Sep 05 2017 11-58-19 CDT'
>>> type(x)
str
>>> print(x)
ADC (10^-6 mm?/s):Sep 05 2017 11-58-19 CDT

It's not ascii, and it doesn't appear to be UTF-8 either:

>>> x.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb2 in position 13: invalid start byte

and I can't just convert it:

>>> y = unicode(x)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb2 in position 13: ordinal not in range(128)

But I can do this, with straight-up copy and paste:

>>> y = u'ADC (10^-6 mm\xb2/s):Sep 05 2017 11-58-19 CDT'
>>> type(y)
unicode
>>> y.encode('utf-8')
'ADC (10^-6 mm\xc2\xb2/s):Sep 05 2017 11-58-19 CDT'
>>> print(y)
ADC (10^-6 mm2/s):Sep 05 2017 11-58-19 CDT

I want to turn x into type unicode. Assigning the value as a literal works for some reason. Is there some way to use the same rules for literal assignment to decode my x?

Sorry. I know I'm missing something super basic here.

Question 2

x.decode('latin1'). (See PEP-263).

Question 3

It looks like the library is giving you strings in the latin-1 encoding (or possibly code page 1252). This is annoying, isn't it... you have to guess what the correct encoding is! (This is one of the motivating factors for Python 3.)

y = x.decode('latin-1')

Note that in latin-1, '\xb2' becomes u'\xb2' when decoded. This is true for all latin-1 characters, since the bottom 256 code points for Unicode are the same as latin-1.

Question 4

Thanks! For some reason, I thought latin-1 and ascii were identical encodings.

Question 5

Thinking about this more: why does x = u'\xb2' work? sys.getdefaultencoding() is ascii, sys.stdin.encoding is utf-8. Not a latin-1 or 8859 in there so why does it try that encoding?

Question 6

@Nate: Encoding doesn't matter here. u'\xb2' is a Unicode string, it is not encoded (well, technically it is, but the encoding is a technical detail hidden in the implementation of the unicode class). It's the same as u'\u00b2', or unichr(0xb2), or however you want to specify "a Unicode string containing the character U+00B2".

Question 7

Or what I should say is that it appears as ASCII in the source code, so the encoding of the source code doesn't matter.

Question 8

Aaaaaahhhhhhhh that makes sense. I would have thought I needed to actually say \u00b2 if I wanted a unicode char, and that \xb2 would be, y'know, a byte value that required an encoding. Thanks!

Dietrich Epp 216k39 gold badges366 silver badges427 bronze badges · Accepted Answer · 2017-09-06 16:25:28Z

1

It looks like the library is giving you strings in the latin-1 encoding (or possibly code page 1252). This is annoying, isn't it... you have to guess what the correct encoding is! (This is one of the motivating factors for Python 3.)

y = x.decode('latin-1')

Note that in latin-1, '\xb2' becomes u'\xb2' when decoded. This is true for all latin-1 characters, since the bottom 256 code points for Unicode are the same as latin-1.

Share

Improve this answer

answered Sep 6, 2017 at 16:25

Dietrich Epp's user avatar

Dietrich Epp

216k39 gold badges366 silver badges427 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Nate

Nate Over a year ago

Thanks! For some reason, I thought latin-1 and ascii were identical encodings.

2017年09月06日T16:38:24.113Z+00:00

Nate

Nate Over a year ago

Thinking about this more: why does x = u'\xb2' work? sys.getdefaultencoding() is ascii, sys.stdin.encoding is utf-8. Not a latin-1 or 8859 in there so why does it try that encoding?

2017年09月06日T21:08:42.863Z+00:00

Dietrich Epp

Dietrich Epp Over a year ago

@Nate: Encoding doesn't matter here. u'\xb2' is a Unicode string, it is not encoded (well, technically it is, but the encoding is a technical detail hidden in the implementation of the unicode class). It's the same as u'\u00b2', or unichr(0xb2), or however you want to specify "a Unicode string containing the character U+00B2".

2017年09月06日T21:23:44.983Z+00:00

Dietrich Epp

Dietrich Epp Over a year ago

Or what I should say is that it appears as ASCII in the source code, so the encoding of the source code doesn't matter.

2017年09月06日T21:24:06.397Z+00:00

Nate

Nate Over a year ago

Aaaaaahhhhhhhh that makes sense. I would have thought I needed to actually say \u00b2 if I wanted a unicode char, and that \xb2 would be, y'know, a byte value that required an encoding. Thanks!

2017年09月07日T14:51:42.803Z+00:00

CollectivesTM on Stack Overflow

What encoding does python 2.7's unicode literal expect?

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related