homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser decode issue
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: HTMLParser cannot handle '&' and non-ascii characters in attribute names
View: 3932
Assigned To: ezio.melotti Nosy List: eric.araujo, ezio.melotti, rednaks
Priority: normal Keywords:

Created on 2012年03月11日 02:23 by rednaks, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
patch.txt rednaks, 2012年03月11日 02:23 patch
Messages (7)
msg155366 - (view) Author: rednaks (rednaks) Date: 2012年03月11日 02:23
Hello !
while parsing a HTML code i got an decode Error :
but this issue can be fixed by replacing the last string by s.decode() like in
the diff file.
I also tried to execute my script under python3.2 and it does not parsing any thing 
 File "/usr/lib/python2.7/HTMLParser.py", line 111, in feed
 self.goahead(0)
 File "/usr/lib/python2.7/HTMLParser.py", line 155, in goahead
 k = self.parse_starttag(i)
 File "/usr/lib/python2.7/HTMLParser.py", line 260, in parse_starttag
 attrvalue = self.unescape(attrvalue)
 File "/usr/lib/python2.7/HTMLParser.py", line 410, in unescape
 return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
 File "/usr/lib/python2.7/re.py", line 151, in sub
 return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1: ordinal
not in range(128)
msg155367 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012年03月11日 02:32
Can you provide a minimal example to reproduce this error?
On Python 2 it's always better to decode the HTML first and then pass unicode to the parser. Even though on Python 2 the parser accepts bytes string too, there are a few corner cases where it fails.
On Python 3 the parser only accepts unicode, and it should work fine with it (especially if you have an updated clone of cpython). Can you show what failure you get with Python 3? Also, can you reproduce the error if you use strict=False?
msg155368 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012年03月11日 02:35
See also #3932.
msg155400 - (view) Author: rednaks (rednaks) Date: 2012年03月11日 18:12
So we cant make decode by default ? !
Concerning python 3, it seems that it's not reading tags and attributes, i didn't get any error, but i don't have any result 
the example i used is there : http://docs.python.org/library/htmlparser.html#module-HTMLParser 
Of course, I replaced HTMLParser by html.parser
msg155403 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012年03月11日 18:32
I don't think the patch can be applied as is -- in order to work s should be an ascii-only str. I will look at this again as soon as I have some time and see if something can be done.
FTR the Python 3 doc for html.parser can be found here: http://docs.python.org/py3k/library/html.parser.html#example-html-parser-application 
msg155412 - (view) Author: rednaks (rednaks) Date: 2012年03月11日 21:53
thank you for giving me a little of your time !
Yes that's what i've tested, i used the html.parser module and and I have no result!
msg155533 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012年03月13日 00:11
I test this again and indeed a bare s.decode() is not enough to fix the problem. The attribute might contain non-ascii characters, and that will result in an error (see for example the "test.py" script attached to #3932). The correct solution is to decode the page before passing it to the parser.
History
Date User Action Args
2022年04月11日 14:57:27adminsetgithub: 58459
2012年03月13日 00:11:30ezio.melottisetstatus: open -> closed
versions: - Python 3.2
superseder: HTMLParser cannot handle '&' and non-ascii characters in attribute names
messages: + msg155533

resolution: duplicate
stage: resolved
2012年03月11日 21:53:43rednakssetmessages: + msg155412
2012年03月11日 18:32:53ezio.melottisetmessages: + msg155403
2012年03月11日 18:12:26rednakssetmessages: + msg155400
2012年03月11日 10:21:55eric.araujosetnosy: + eric.araujo

title: [PATCH]HTMLParser decode issue -> HTMLParser decode issue
2012年03月11日 02:35:16ezio.melottisetmessages: + msg155368
2012年03月11日 02:32:20ezio.melottisetnosy: + ezio.melotti
messages: + msg155367

assignee: ezio.melotti
type: crash -> behavior
2012年03月11日 02:23:14rednakscreate

AltStyle によって変換されたページ (->オリジナル) /