I'm trying to parse an HTML page with python3's HTMLParser.
Edit:
Trying to print the character using:
print ('\u25bc') #Prints the '▼' character
throws the UnicodeEncodeError.
The code is the one supplied in the documentation samples:
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
def handle_endtag(self, tag):
print("End tag :", tag)
def handle_data(self, data):
print("Data :", data)
def handle_comment(self, data):
print("Comment :", data)
def handle_entityref(self, name):
c = chr(name2codepoint[name])
print("Named ent:", c)
def handle_charref(self, name):
if name.startswith('x'):
c = chr(int(name[1:], 16))
else:
c = chr(int(name))
print("Num ent :", c)
def handle_decl(self, data):
print("Decl :", data)
and when feeding an HTML document (UTF-8 string) I'm getting the error:
UnicodeExcodeError
'ascii' codec can't encode character '\u25bc' in position 0: ordinal not in range(128)
The offending line, from parser getpos() method is:
# |-- Parser stopped here.
<li><a href="#" class="dir">▼ Community</a>
The read bytes are correctly decoded as a UTF-8 string and is then fed to the feed() method of the parser, which for some reason tries to encode it to ASCII.
The system locale is set to 'POSIX' by default but locally set to en_US.UTF-8 using
export LANG=en_US.UTF-8
How can I solve this issue?
1 Answer 1
I've solved this issue by reconfiguring the locales
In debian:
sudo dpkg-reconfigure locales
select the locale
en_US.UTF-8
Then select this locale as the default system locale.
1 Comment
PYTHONIOENCODING=utf-8 python3 -c'print("\u25bc")' | catExplore related questions
See similar questions with these tags.