python UnicodeEncodeError > How can I simply remove troubling unicode characters?

Question 1

Heres what I did..

>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>> 
>>> soup.find('div')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>> 
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>

How can I simply remove troubling unicode characters from html ?
Or is there any cleaner solution ?

Question 2

Try this way: soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

Question 3

Didn't work! Heres what happened.. >>> html.decode('utf-8', 'strip') Traceback (most recent call last): ..... LookupError: unknown error handler name 'strip' >>> >>> html.decode('utf-8') Traceback (most recent call last): ..... UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 98071: unexpected code byte >>>

Question 4

I am very sorry, 'ignore' instead of 'strip'. Also I recommend to read the Unicode HOWTO docs.python.org/howto/unicode.html

Question 5

The error you see is due to repr(soup)tries to mix Unicode and bytestrings. Mixing Unicode and bytestrings frequently leads to errors.

Compare:

>>> u'1' + '©'
Traceback (most recent call last):
 File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

And:

>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'

Here's an example for classes:

>>> class A:
... def __repr__(self):
... return u'copyright ©'.encode('utf-8')
... 
>>> A()
copyright ©
>>> class B:
... def __repr__(self):
... return u'copyright ©'
... 
>>> B()
Traceback (most recent call last):
 File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
... def __repr__(self):
... return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
 File "<input>", line 1, in <module>
 File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)

Similar thing happens with BeautifulSoup:

>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
 File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)

To workaround it:

>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'

Question 6

First of all, "troubling" unicode characters could be letters in some language but assuming you won't have to worry about non-english characters then you can use a python lib to convert unicode to ansi. Check out the answer to this question: How do I convert a file's format from Unicode to ASCII using Python?

The accepted answer there seems like a good solution (that I didn't know about beforehand).

Question 7

That solution isn't working for me as html is not unicode, its just str [>>> unicodedata.normalize('NFKD', html).encode('ascii','ignore') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: normalize() argument 2 must be unicode, not str ]

Question 8

I had the same problem, spent hours on it. Notice the error occurs whenever the interpreter has to display content, this is because the interpreter is trying to convert to ascii, causing problems. Take a look at the top answer here:

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

esv 1242 bronze badges · Accepted Answer · 2011-03-08 18:46:28Z

10

Try this way: soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

Share

Improve this answer

edited Nov 1, 2012 at 17:10

Jonas Byström's user avatar

Jonas Byström

26.4k23 gold badges106 silver badges155 bronze badges

answered Mar 8, 2011 at 18:46

esv's user avatar

esv

1242 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nullpoet

Nullpoet Over a year ago

Didn't work! Heres what happened.. >>> html.decode('utf-8', 'strip') Traceback (most recent call last): ..... LookupError: unknown error handler name 'strip' >>> >>> html.decode('utf-8') Traceback (most recent call last): ..... UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 98071: unexpected code byte >>>

2011年03月08日T19:04:17.637Z+00:00

esv

esv Over a year ago

I am very sorry, 'ignore' instead of 'strip'. Also I recommend to read the Unicode HOWTO docs.python.org/howto/unicode.html

2011年03月08日T19:08:38.297Z+00:00

CollectivesTM on Stack Overflow

python UnicodeEncodeError > How can I simply remove troubling unicode characters?

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related