I have been plugging away at this for hours and I just can't seem to get to the bottom of it. I have been through this website in detail and although others seem to have a similar problem but their solutions given just don't work for me.
I have a python script which reads the HTML of a website and uses beautiful soup to find things like the head, body, H1's etc... and then store them in a utf-8 MySQL table.
Seems straight forward but I keep running into:
UnicodeDecodeError: 'ascii' codec can't decode byte xxxxxx
When I encode. I have tried everything I can find to stop this happening but to no avail. Here is one version of the code:
soup = BeautifulSoup(strIndexPage)
strIndexPageBody = str(soup.body)
strIndexPageBody = strIndexPageBody.encode('ascii', 'ignore') # I know ignore is not best practice but I am really not interested in anything outside the ascii character set
strIndexPageBody = strIndexPageBody .replace('"','"')
strIndexPageBody = strIndexPageBody .replace("'","&rsquo")
An earlier version where I tried to convert to utf-8 works better, but I end up with the
`
character present in some of the HTML which breaks the MySQL insert/update. Obviously I have tried searching for this character and replacing it, but then python tells be I have a non ascii character in my code!
I have read tons are articles that say I should be looking at the encoding for the HTML first, decode and then encode to suit, but the encoding does not always come back from BS, and/or not declared within the HTML.
I am sure there is a simple way around this but I can't find it.
Thanks for any help.
2 Answers 2
Note that you're getting a decode error from a call to encode. This is the ugliest part of Python 2: it lets you try to encode a string that is already encoded, by first decoding it as ascii. What you're doing is equivalent to this:
s.decode('ascii', 'strict').encode('ascii', 'ignore')
I think this should do what you expect:
soup = BeautifulSoup(strIndexPage)
strIndexPageBody = unicode(soup.body)
strIndexPageBody = strIndexPageBody.encode('ascii', 'ignore')
Note that we're calling unicode, so we get a unicode string that we can validly try to encode.
4 Comments
soup.body is a str object which encodes non-ascii characters, then passing it to unicode will give a UnicodeDecodeError; on the other hand, if it's already a unicode object, then passing it to unicode is redundant.BeautifulSoup.Tag object, which can be flattened with either str or unicode.BeautifulSoup's UnicodeDammit should be able to detect the encoding of a document even when it isn't specified.
What happens when you run this on the page in question?:
from BeautifulSoup import UnicodeDammit
UnicodeDammit(html_string).unicode
What specific line of code is throwing the error and can we have a sample of problematic HTML?
4 Comments
UnicodeDecodeError exceptions. What is the line of code that raises the exception and what is the object being encoded?
&rsquoend in a semicolon? Also it's not the same as'.# coding: utf-8magic comment at the top (it needs to be one of the first two lines). That's assuming you're saving the Python file in UTF-8.