0

I'm trying to parse an HTML page with python3's HTMLParser.


Edit:
Trying to print the character using:

 print ('\u25bc') #Prints the '▼' character

throws the UnicodeEncodeError.


The code is the one supplied in the documentation samples:

class MyHTMLParser(HTMLParser):
 def handle_starttag(self, tag, attrs):
 print("Start tag:", tag)
 for attr in attrs:
 print(" attr:", attr)
 def handle_endtag(self, tag):
 print("End tag :", tag)
 def handle_data(self, data):
 print("Data :", data)
 def handle_comment(self, data):
 print("Comment :", data)
 def handle_entityref(self, name):
 c = chr(name2codepoint[name])
 print("Named ent:", c)
 def handle_charref(self, name):
 if name.startswith('x'):
 c = chr(int(name[1:], 16))
 else:
 c = chr(int(name))
 print("Num ent :", c)
 def handle_decl(self, data):
 print("Decl :", data)

and when feeding an HTML document (UTF-8 string) I'm getting the error:

UnicodeExcodeError
'ascii' codec can't encode character '\u25bc' in position 0: ordinal not in range(128)

The offending line, from parser getpos() method is:

# |-- Parser stopped here.
 <li><a href="#" class="dir">&#9660; Community</a>

The read bytes are correctly decoded as a UTF-8 string and is then fed to the feed() method of the parser, which for some reason tries to encode it to ASCII.

The system locale is set to 'POSIX' by default but locally set to en_US.UTF-8 using

export LANG=en_US.UTF-8

How can I solve this issue?

Charles
51.5k13 gold badges107 silver badges146 bronze badges
asked Mar 4, 2014 at 1:01

1 Answer 1

1

I've solved this issue by reconfiguring the locales

In debian:

sudo dpkg-reconfigure locales

select the locale

en_US.UTF-8

Then select this locale as the default system locale.

answered Mar 4, 2014 at 1:24
Sign up to request clarification or add additional context in comments.

1 Comment

If you need to redirect to a file or a pipe then you could use: PYTHONIOENCODING=utf-8 python3 -c'print("\u25bc")' | cat

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.