6

Heres what I did..

>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>> 
>>> soup.find('div')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>> 
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>> 

How can I simply remove troubling unicode characters from html ?
Or is there any cleaner solution ?

asked Mar 8, 2011 at 18:04

4 Answers 4

10

Try this way: soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

Jonas Byström
26.4k23 gold badges106 silver badges155 bronze badges
answered Mar 8, 2011 at 18:46
Sign up to request clarification or add additional context in comments.

2 Comments

Didn't work! Heres what happened.. >>> html.decode('utf-8', 'strip') Traceback (most recent call last): ..... LookupError: unknown error handler name 'strip' >>> >>> html.decode('utf-8') Traceback (most recent call last): ..... UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 98071: unexpected code byte >>>
I am very sorry, 'ignore' instead of 'strip'. Also I recommend to read the Unicode HOWTO docs.python.org/howto/unicode.html
2

The error you see is due to repr(soup)tries to mix Unicode and bytestrings. Mixing Unicode and bytestrings frequently leads to errors.

Compare:

>>> u'1' + '©'
Traceback (most recent call last):
 File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

And:

>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'

Here's an example for classes:

>>> class A:
... def __repr__(self):
... return u'copyright ©'.encode('utf-8')
... 
>>> A()
copyright ©
>>> class B:
... def __repr__(self):
... return u'copyright ©'
... 
>>> B()
Traceback (most recent call last):
 File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
... def __repr__(self):
... return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
 File "<input>", line 1, in <module>
 File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)

Similar thing happens with BeautifulSoup:

>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
 File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)

To workaround it:

>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'
answered Mar 9, 2011 at 12:39

Comments

1

First of all, "troubling" unicode characters could be letters in some language but assuming you won't have to worry about non-english characters then you can use a python lib to convert unicode to ansi. Check out the answer to this question: How do I convert a file's format from Unicode to ASCII using Python?

The accepted answer there seems like a good solution (that I didn't know about beforehand).

answered Mar 8, 2011 at 18:13

1 Comment

That solution isn't working for me as html is not unicode, its just str [>>> unicodedata.normalize('NFKD', html).encode('ascii','ignore') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: normalize() argument 2 must be unicode, not str ]
0

I had the same problem, spent hours on it. Notice the error occurs whenever the interpreter has to display content, this is because the interpreter is trying to convert to ascii, causing problems. Take a look at the top answer here:

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

answered Jan 2, 2012 at 22:21

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.