Message95162
| Author |
pluskid |
| Recipients |
pluskid |
| Date |
2009年11月12日.16:25:41 |
| SpamBayes Score |
4.7186185e-07 |
| Marked as misclassified |
No |
| Message-id |
<1258043150.41.0.372876851796.issue7311@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
Hi all,
I'm using BeautifulSoup to parsing an HTML page and find it refused to
parse the page. By looking at the backtrace, I found it is a problem
with the python built-in HTMLParser.py. In fact, the web page I'm
parsing is with some Chinese characters. there is a tag like <img
src=/foo/bar.png alt=中文> , note this is legacy html page where the
attributes are not quoted. However, the regexp defined in
HTMLParser.py is :
attrfind = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')
Note that the Chinese character (also any other non-english
characters), so it fire an error parsing this. I'm not sure whether
the HTML standard allow un-quoted non-ASCII characters in the
attributes. If it allows, this seems to be a bug. and the regexp to
better be [^>\s] IMHO.
BTW: It seems something like :
<script>
var st = "<a></";
</script>
can not be parsed. :-/ |
|
History
|
|---|
| Date |
User |
Action |
Args |
| 2009年11月12日 16:25:50 | pluskid | set | recipients:
+ pluskid |
| 2009年11月12日 16:25:50 | pluskid | set | messageid: <1258043150.41.0.372876851796.issue7311@psf.upfronthosting.co.za> |
| 2009年11月12日 16:25:42 | pluskid | link | issue7311 messages |
| 2009年11月12日 16:25:42 | pluskid | create |
|