UTF8 encoding error

Asked 13 years, 1 month ago

Viewed 3k times

I have some external data I need to import. How do I encode the input string as unicode/utf8?

Here is an example of a probematic line

>>>'Compa\xf1\xeda Dominicana de Tel\xe9fonos, C. por A. - CODETEL'.encode("utf8")
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 5: ordinal not in range(128)

Improve this question

edited Nov 26, 2012 at 8:40

dda's user avatar

dda

6,2212 gold badges27 silver badges37 bronze badges

asked Nov 26, 2012 at 8:11

Tzury Bar Yochay's user avatar

Tzury Bar Yochay

9,0126 gold badges53 silver badges75 bronze badges

1

The answer is given. But I suggest you invest some time learn about unicode, you won't regreat it :) nedbatchelder.com/text/unipain.html

Niclas Nilsson
– Niclas Nilsson

2012年11月26日 08:19:01 +00:00
Commented Nov 26, 2012 at 8:19

Add a comment |

3 Answers 3

Sorted by: Reset to default

.encode("utf8") expects the source to be a unicode string. You are using it with a "regular" string which has "ascii" encoding by default. You should do something like:

original_string.decode('original_encoding').encode('utf-8')

In your case my guess would be:

'Compa\xf1\xeda Dominicana de Tel\xe9fonos, C. por A. - CODETEL'.decode("iso8859-1").encode("utf8")

Improve this answer

edited Nov 26, 2012 at 8:41

dda's user avatar

dda

6,2212 gold badges27 silver badges37 bronze badges

answered Nov 26, 2012 at 8:15

abbot's user avatar

abbot

28k6 gold badges57 silver badges57 bronze badges

Comments

To convert bytes to a Unicode string use decode instead of encode.

Also that is not UTF-8. I guess it's Latin-1:

>>> print 'Compa\xf1\xeda Dominicana de Tel\xe9fonos, C. por A. - CODETEL'.decode("latin1")
Compañía Dominicana de Teléfonos, C. por A. - CODETEL

Improve this answer

edited Nov 26, 2012 at 8:42

dda's user avatar

dda

6,2212 gold badges27 silver badges37 bronze badges

answered Nov 26, 2012 at 8:15

Mark Byers's user avatar

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

Comments

encode converts from a unicode string to a sequence of bytes. decode converts from a sequence of bytes to a unicode string. You want decode, because your data are already encoded.

More generally, if you're reading a string from an external source, you always want to decode, because there's no such thing as a "unicode string" out there in the world. There are only representations of that unicode string in various encodings. Unicode strings are like a Platonic ideal that can only be transmitted through the corporeal medium of encodings.

Improve this answer

answered Nov 26, 2012 at 8:15

BrenBarn's user avatar

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

UTF8 encoding error

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related