1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

UnicodeEncodeError while parsing HTML

Asked 11 years, 10 months ago

Viewed 134 times

I'm trying to parse an HTML page with python3's HTMLParser.

Edit:
Trying to print the character using:

 print ('\u25bc') #Prints the '▼' character

throws the UnicodeEncodeError.

The code is the one supplied in the documentation samples:

class MyHTMLParser(HTMLParser):
 def handle_starttag(self, tag, attrs):
 print("Start tag:", tag)
 for attr in attrs:
 print(" attr:", attr)
 def handle_endtag(self, tag):
 print("End tag :", tag)
 def handle_data(self, data):
 print("Data :", data)
 def handle_comment(self, data):
 print("Comment :", data)
 def handle_entityref(self, name):
 c = chr(name2codepoint[name])
 print("Named ent:", c)
 def handle_charref(self, name):
 if name.startswith('x'):
 c = chr(int(name[1:], 16))
 else:
 c = chr(int(name))
 print("Num ent :", c)
 def handle_decl(self, data):
 print("Decl :", data)

and when feeding an HTML document (UTF-8 string) I'm getting the error:

UnicodeExcodeError
'ascii' codec can't encode character '\u25bc' in position 0: ordinal not in range(128)

The offending line, from parser getpos() method is:

# |-- Parser stopped here.
 <li><a href="#" class="dir">&#9660; Community</a>

The read bytes are correctly decoded as a UTF-8 string and is then fed to the feed() method of the parser, which for some reason tries to encode it to ASCII.

The system locale is set to 'POSIX' by default but locally set to en_US.UTF-8 using

export LANG=en_US.UTF-8

How can I solve this issue?

Improve this question

edited Mar 4, 2014 at 1:48

Charles's user avatar

Charles

51.5k13 gold badges107 silver badges146 bronze badges

asked Mar 4, 2014 at 1:01

NeonMan's user avatar

NeonMan

66311 silver badges25 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default

I've solved this issue by reconfiguring the locales

In debian:

sudo dpkg-reconfigure locales

select the locale

en_US.UTF-8

Then select this locale as the default system locale.

Improve this answer

answered Mar 4, 2014 at 1:24

NeonMan's user avatar

NeonMan

66311 silver badges25 bronze badges

1 Comment

jfs

jfs Over a year ago

If you need to redirect to a file or a pipe then you could use: PYTHONIOENCODING=utf-8 python3 -c'print("\u25bc")' | cat

2014年03月04日T12:17:41.573Z+00:00

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

UnicodeEncodeError while parsing HTML

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related