Python encoding UnicodeDecodeError

Question 1

I have been plugging away at this for hours and I just can't seem to get to the bottom of it. I have been through this website in detail and although others seem to have a similar problem but their solutions given just don't work for me.

I have a python script which reads the HTML of a website and uses beautiful soup to find things like the head, body, H1's etc... and then store them in a utf-8 MySQL table.

Seems straight forward but I keep running into:

UnicodeDecodeError: 'ascii' codec can't decode byte xxxxxx

When I encode. I have tried everything I can find to stop this happening but to no avail. Here is one version of the code:

soup = BeautifulSoup(strIndexPage)
strIndexPageBody = str(soup.body)
strIndexPageBody = strIndexPageBody.encode('ascii', 'ignore') # I know ignore is not best practice but I am really not interested in anything outside the ascii character set
strIndexPageBody = strIndexPageBody .replace('"','&quot;')
strIndexPageBody = strIndexPageBody .replace("'","&rsquo")

An earlier version where I tried to convert to utf-8 works better, but I end up with the

character present in some of the HTML which breaks the MySQL insert/update. Obviously I have tried searching for this character and replacing it, but then python tells be I have a non ascii character in my code!

I have read tons are articles that say I should be looking at the encoding for the HTML first, decode and then encode to suit, but the encoding does not always come back from BS, and/or not declared within the HTML.

I am sure there is a simple way around this but I can't find it.

Thanks for any help.

Question 2

Shouldn't &rsquo end in a semicolon? Also it's not the same as '.

Question 3

Please stop focusing on the last two lines - they are not where the error is. It errors on the encoding as the error message suggests.

Question 4

When Python complains about a non-ascii character in your code, it probably means you need to add a # coding: utf-8 magic comment at the top (it needs to be one of the first two lines). That's assuming you're saving the Python file in UTF-8.

Question 5

Very similar to stackoverflow.com/questions/5236437/…

Question 6

Interesting - will give it a shot tomorrow - thanks for your input.

Question 7

Note that you're getting a decode error from a call to encode. This is the ugliest part of Python 2: it lets you try to encode a string that is already encoded, by first decoding it as ascii. What you're doing is equivalent to this:

s.decode('ascii', 'strict').encode('ascii', 'ignore')

I think this should do what you expect:

soup = BeautifulSoup(strIndexPage)
strIndexPageBody = unicode(soup.body)
strIndexPageBody = strIndexPageBody.encode('ascii', 'ignore')

Note that we're calling unicode, so we get a unicode string that we can validly try to encode.

Question 8

If soup.body is a str object which encodes non-ascii characters, then passing it to unicode will give a UnicodeDecodeError; on the other hand, if it's already a unicode object, then passing it to unicode is redundant.

Question 9

@ekhumoro: Its a BeautifulSoup.Tag object, which can be flattened with either str or unicode.

Question 10

Yeah, sorry - I probably should have guessed that :/

Question 11

@Thomas K: Thank you very much for your help - Your explanation is succinct and I now understand where I was going wrong.

Question 12

BeautifulSoup's UnicodeDammit should be able to detect the encoding of a document even when it isn't specified.

What happens when you run this on the page in question?:

from BeautifulSoup import UnicodeDammit
UnicodeDammit(html_string).unicode

What specific line of code is throwing the error and can we have a sample of problematic HTML?

Question 13

I Skimmed over that earlier - I will give it a try and report back - thanks for your help.

Question 14

The thing is, UnicodeDammit is by default when parsing a page with BeautifulSoup, you shouldn't have to do anything special.

Question 15

I see - BS does not error - the error occurs when I try to encode it.

Question 16

If you're encoding unicode to ascii, and you're setting it to ignore characters that can't be encoded, it shouldn't be raising UnicodeDecodeError exceptions. What is the line of code that raises the exception and what is the object being encoded?

Thomas K 40.7k7 gold badges88 silver badges89 bronze badges · Accepted Answer · 2011-11-10 23:50:04Z

6

Note that you're getting a decode error from a call to encode. This is the ugliest part of Python 2: it lets you try to encode a string that is already encoded, by first decoding it as ascii. What you're doing is equivalent to this:

s.decode('ascii', 'strict').encode('ascii', 'ignore')

I think this should do what you expect:

soup = BeautifulSoup(strIndexPage)
strIndexPageBody = unicode(soup.body)
strIndexPageBody = strIndexPageBody.encode('ascii', 'ignore')

Note that we're calling unicode, so we get a unicode string that we can validly try to encode.

Share

Improve this answer

answered Nov 10, 2011 at 23:50

Thomas K's user avatar

Thomas K

40.7k7 gold badges88 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ekhumoro

ekhumoro Over a year ago

If soup.body is a str object which encodes non-ascii characters, then passing it to unicode will give a UnicodeDecodeError; on the other hand, if it's already a unicode object, then passing it to unicode is redundant.

2011年11月11日T00:19:38.623Z+00:00

Thomas K

Thomas K Over a year ago

@ekhumoro: Its a BeautifulSoup.Tag object, which can be flattened with either str or unicode.

2011年11月11日T00:24:04.767Z+00:00

ekhumoro

ekhumoro Over a year ago

Yeah, sorry - I probably should have guessed that :/

2011年11月11日T00:55:09.827Z+00:00

dan360

dan360 Over a year ago

@Thomas K: Thank you very much for your help - Your explanation is succinct and I now understand where I was going wrong.

2011年11月11日T11:47:29.667Z+00:00

CollectivesTM on Stack Overflow

Python encoding UnicodeDecodeError

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related