This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2012年07月22日 15:01 by AliDD, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Messages (12) | |||
|---|---|---|---|
| msg166139 - (view) | Author: (AliDD) | Date: 2012年07月22日 15:01 | |
Attempt to parse the string
<script type="text/javascript">document.write('<bb></b>');</script>
by calling minidom.parseString()
causes ExpatError: mismatched tag: line 1, column 53
in lib\xml\dom\expatbuilder.py at line 223.
It looks like the parser detects open / closing tag mismatch, trying to parse <bb></b>.
|
|||
| msg166140 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2012年07月22日 15:06 | |
Why do you think this is a bug? To my understanding, XML does not treat the <script> tag as anything special, so to an XML parser, those *are* mismatched tags. |
|||
| msg166141 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2012年07月22日 15:09 | |
I'm sorry, my phrasing there was awkward. I didn't mean "Why do you think this is a bug", I meant "I don't think this is a bug, am I missing something?" |
|||
| msg166145 - (view) | Author: (AliDD) | Date: 2012年07月22日 15:46 | |
> Why do you think this is a bug? IMHO XML parser must not and can not parse content of <script> tag because it don't "understand" JavaScript or any other scripting language. I think content of <script> should be returned as is. > so to an XML parser, those *are* mismatched tags. Here they are not tags at all, but part of JavaScript string. |
|||
| msg166146 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2012年07月22日 16:16 | |
You are exactly correct that an XML parser doesn't know how to parse javascript. That's my point. To my understanding <script> has no special meaning to XML. Thus a <script> tag follows all the same content rules as any arbitrary other tag (<h1>, <bold>, <foo>). An XML parser will not (again, to my understanding) treat the content of any of these tags as anything other than XML. I believe that in XML the CDATA construct is used for embedding arbitrary content inside a tag. Unless you have a reference to an XML standard that indicates the script tag should be treated specially, I'm going to close this as invalid. |
|||
| msg166151 - (view) | Author: Martin v. Löwis (loewis) * (Python committer) | Date: 2012年07月22日 17:07 | |
R. David is correct. The XML specification *clearly* defines that your XML document is ill-formed, since the closing tag for '<bb>' is '</b>', which doesn't match. XML is not HTML: the XML parser must not treat the script element specially. |
|||
| msg166154 - (view) | Author: (AliDD) | Date: 2012年07月22日 17:26 | |
Document Object Model (DOM) Level 1 Specification at http://www.w3.org/TR/REC-DOM-Level-1/ states that "The Document Object Model provides a standard set of objects for representing HTML and XML documents" That's why I try to use xml.dom.minidom to parse some HTML. And my HTML sample is valid HTML. Possibly I misunderstand purpose of xml.dom.minidom and / or expat and they should be used only with XML. Or I should use minidom.parseString() with some parser other as default expat. If so could you recommend some built in one? If xml.dom.minidom should process only XML and not HTML (which not always is valid XML), than I'm sorry for disturbance and please feel free to close this issue as invalid. |
|||
| msg166158 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2012年07月22日 19:40 | |
*xml*.dom.minidom parses only XML. The documentation mentions only xml, not html. I suppose that confusion could arise from the fact that that the w3c-dom API model can be provided by things that parse html as well. I'm not sure if it is worth adding a documentation note to that effect or not, but if so that would be a different issue. |
|||
| msg166159 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2012年07月22日 19:57 | |
> That's why I try to use xml.dom.minidom to parse some HTML. Have you tried HTMLParser? |
|||
| msg166177 - (view) | Author: (AliDD) | Date: 2012年07月22日 22:44 | |
@Ezio Melotti: Do you mean as parser with minidom.parseString() or stand alone? If second, than, result will be the sequence of handler calls instead of DOM. This is not bad, but entirely different story. |
|||
| msg166178 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2012年07月22日 22:51 | |
> Do you mean as parser with minidom.parseString() or stand alone? Standalone. > If second, than, result will be the sequence of handler calls instead of DOM. This is not bad, but entirely different story. Indeed. If you want a tree you could try BeautifulSoup or lxml; there's no tool in the stdlib that specifically parses an HTML document and builds a tree. |
|||
| msg166180 - (view) | Author: (AliDD) | Date: 2012年07月22日 23:06 | |
Will give a try to some of them later. Thanks for info. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:33 | admin | set | github: 59628 |
| 2012年07月22日 23:06:38 | AliDD | set | messages: + msg166180 |
| 2012年07月22日 22:51:17 | ezio.melotti | set | messages: + msg166178 |
| 2012年07月22日 22:44:29 | AliDD | set | messages: + msg166177 |
| 2012年07月22日 19:57:07 | ezio.melotti | set | resolution: not a bug messages: + msg166159 nosy: + ezio.melotti |
| 2012年07月22日 19:40:27 | r.david.murray | set | status: open -> closed messages: + msg166158 |
| 2012年07月22日 17:26:09 | AliDD | set | status: closed -> open resolution: not a bug -> (no value) messages: + msg166154 |
| 2012年07月22日 17:07:37 | loewis | set | status: pending -> closed nosy: + loewis messages: + msg166151 |
| 2012年07月22日 16:16:44 | r.david.murray | set | status: open -> pending resolution: not a bug messages: + msg166146 stage: resolved |
| 2012年07月22日 15:46:56 | AliDD | set | messages: + msg166145 |
| 2012年07月22日 15:09:29 | r.david.murray | set | messages: + msg166141 |
| 2012年07月22日 15:06:32 | r.david.murray | set | nosy:
+ r.david.murray messages: + msg166140 |
| 2012年07月22日 15:01:19 | AliDD | create | |