This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年08月12日 15:53 by Hunanyan, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Messages (3) | |||
|---|---|---|---|
| msg113688 - (view) | Author: Arman (Hunanyan) | Date: 2010年08月12日 15:53 | |
When HTMLParser reaches CDATA element it enters cdata mode by calling set_cdata_mode (file html/parser.py line 270). this method assigns self.interesting member new value r'<(/|\Z)'. But this is not correct. Consider following case
<script language="javascript">
<!--
if (window.adgroupid == undefined) {
window.adgroupid = Math.round(Math.random() * 1000);
}
document.write('<scr'+'ipt language="javascript1.1" src="http://adserver.adtech.de/addyn|3.0|876|2378574|0|225|ADTECH;loc=100;target=_blank;key=;grp='+window.adgroupid+';misc='+new Date().getTime()+'"></scri'+'pt>');
//-->
</script>
</scri'+'pt> matches with r'<(/|\Z)' and parser gets confused and produce wrong results. You can see such real htmls in
www.ahram.org.eg
www.chefkoch.de
www.chemieonline.de
www.eip.gov.eg
www.rezepte.li
www.scienceworld.com
The solution can be to keep
interesting_cdata_script = re.compile(r'<(/|\z)script')
interesting_cdata_style = re.compile(r'<(/|\z)style')
instead of
interesting_cdata = re.compile(r'<(/|\Z)')
and depending on what tag is begins (script or style) set_cdata_mode can assign correct regexp to self.interesting member.
Please contact with me via email if you need more details.
arman.hunanyan@gmail.com
|
|||
| msg113713 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2010年08月12日 22:32 | |
I believe this is a duplicate of Issue670664. If you disagree please reopen with additional information. |
|||
| msg113743 - (view) | Author: Arman (Hunanyan) | Date: 2010年08月13日 05:21 | |
Yes I agree. This is the same issue. On Fri, Aug 13, 2010 at 3:32 AM, R. David Murray <report@bugs.python.org>wrote: > > R. David Murray <rdmurray@bitdance.com> added the comment: > > I believe this is a duplicate of Issue670664. If you disagree please > reopen with additional information. > > ---------- > nosy: +r.david.murray > resolution: -> duplicate > stage: -> committed/rejected > status: open -> closed > superseder: -> HTMLParser.py - more robust SCRIPT tag parsing > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue9577> > _______________________________________ > |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:05 | admin | set | github: 53786 |
| 2010年08月13日 11:51:30 | r.david.murray | set | files: - unnamed |
| 2010年08月13日 05:21:50 | Hunanyan | set | files:
+ unnamed messages: + msg113743 |
| 2010年08月12日 22:32:32 | r.david.murray | set | status: open -> closed superseder: HTMLParser.py - more robust SCRIPT tag parsing nosy: + r.david.murray messages: + msg113713 resolution: duplicate stage: resolved |
| 2010年08月12日 15:53:02 | Hunanyan | create | |