Message117762
| Author |
yotam |
| Recipients |
Hunanyan, cpalmer, ezio.melotti, fantoozler, fdrake, georg.brandl, gsf, momat, yotam |
| Date |
2010年09月30日.21:50:03 |
| SpamBayes Score |
2.802493e-05 |
| Marked as misclassified |
No |
| Message-id |
<1285883406.35.0.460129064114.issue670664@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
The HTMLParser.py fails when inside
<script> ... </script>
it can fooled by JavaScript with less-than '<' conditional expressions.
In the attached example:
$ tar tvzf lt-in-script-example.tgz | cut -c24-
796 2010年09月30日 16:52 h2t.py
23678 2010年09月30日 16:39 t.html
here's what happens:
$ python h2t.py t.html /tmp/t.txt
HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py
Traceback (most recent call last):
File "h2t.py", line 31, in <module>
text = html2text(f_html.read())
File "h2t.py", line 23, in html2text
te = TextExtractor(html)
File "h2t.py", line 15, in __init__
self.feed(html)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in check_for_whole_start_tag
self.error("malformed start tag")
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332
I have a suggested patch
HTMLParser.diff
fixing this problem, soon to be attached.
-- yotam |
|