I have some external data I need to import. How do I encode the input string as unicode/utf8?
Here is an example of a probematic line
>>>'Compa\xf1\xeda Dominicana de Tel\xe9fonos, C. por A. - CODETEL'.encode("utf8")
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 5: ordinal not in range(128)
-
1The answer is given. But I suggest you invest some time learn about unicode, you won't regreat it :) nedbatchelder.com/text/unipain.htmlNiclas Nilsson– Niclas Nilsson2012年11月26日 08:19:01 +00:00Commented Nov 26, 2012 at 8:19
3 Answers 3
.encode("utf8") expects the source to be a unicode string. You are using it with a "regular" string which has "ascii" encoding by default. You should do something like:
original_string.decode('original_encoding').encode('utf-8')
In your case my guess would be:
'Compa\xf1\xeda Dominicana de Tel\xe9fonos, C. por A. - CODETEL'.decode("iso8859-1").encode("utf8")
Comments
To convert bytes to a Unicode string use decode instead of encode.
Also that is not UTF-8. I guess it's Latin-1:
>>> print 'Compa\xf1\xeda Dominicana de Tel\xe9fonos, C. por A. - CODETEL'.decode("latin1")
Compañía Dominicana de Teléfonos, C. por A. - CODETEL
Comments
encode converts from a unicode string to a sequence of bytes. decode converts from a sequence of bytes to a unicode string. You want decode, because your data are already encoded.
More generally, if you're reading a string from an external source, you always want to decode, because there's no such thing as a "unicode string" out there in the world. There are only representations of that unicode string in various encodings. Unicode strings are like a Platonic ideal that can only be transmitted through the corporeal medium of encodings.