This seems to be a common question among international developers but I haven't found a straight answer yet. I'm getting from a feed the following string: "Carlos e Carlos mostram o que há de melhor na internet"
The following error is returned to the console: UnicodeDecodeError: 'utf8' codec can't decode bytes in position 31-33: invalid data
thanks in advance,
fbr
-
6We're unable to see the code you're using, so it's really hard to give a "straight" answer. Also, it's hard to know where you find this "string" and what encoding it uses when you found it. Without any code or any data, there can't be a straight answer.S.Lott– S.Lott2011年02月15日 20:12:58 +00:00Commented Feb 15, 2011 at 20:12
1 Answer 1
You can't just decode using some random encoding, even if it is UTF-8; you must decode using the encoding returned in the HTTP headers or an equivalent within the document (such as within the META element of HTML).
If the encoding isn't available or is incorrect then you should specify in the decode operation what will happen on an invalid byte sequence; usually 'replace' suffices for this.
>>> print u'Carlos e Carlos mostram o que há de melhor na internet'.encode('latin1').decode('utf-8', 'replace')
Carlos e Carlos mostram o que h�e melhor na internet