This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009年03月17日 11:19 by once-off, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| sgml_error.py | once-off, 2009年03月17日 11:19 | |||
| Messages (3) | |||
|---|---|---|---|
| msg83667 - (view) | Author: (once-off) | Date: 2009年03月17日 11:19 | |
The attached script (sgml_error.py) was designed to output XML files unchanged, other than expanding <empty/> tags into an opening and closing tag, such as <empty></empty>. It seems the SGMLParser class recognizes an empty tag, but does not emit the closing tag until the NEXT forward slash it sees. So everything from the forward slash in <empty/> (even the closing angle bracket) until the next forward slash is considered to be textual data. See the following line output. Have I missed something here (like a conscious design limitation on the class, an error on my part, etc), or is this really a bug with the class? C:\Python24\Lib>python sgmllib.py H:\input.xml start tag: <root> data: '\n ' start tag: <tag1> end tag: </tag1> data: '\n ' start tag: <tag2> data: '>\n <tag3>hello<' end tag: </tag2> data: 'tag3>\n' end tag: </root> C:\Python24\Lib>python ActivePython 2.4.3 Build 12 (ActiveState Software Inc.) based on Python 2.4.3 (#69, Apr 11 2006, 15:32:42) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sgml_error Input: <root> <tag1></tag1> <tag2/> <tag3>hello</tag3> </root> Output: <root> <tag1></tag1> <tag2>> <tag3>hello<</tag2>tag3> </root> Expected: <root> <tag1></tag1> <tag2></tag2> <tag3>hello</tag3> </root> |
|||
| msg98424 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年01月27日 13:40 | |
Hello XML of the form <tag/> are an SGML hack, or more precisely the combination of two features of SGML. The forward slash closes the tag, and the following angle bracket is character data, not part of the tag. The W3C validator uses a real SGML parser for HTML doctypes, and fails on XML-like /> constructs: http://validator.w3.org/check?uri=data%3Atext%2Fhtml%2C%3C!DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+HTML+4.01%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fhtml4%2Fstrict.dtd%22%3E+%3Chtml%3E+%3Chead%3E+++%3Ctitle%3ETest%3C%2Ftitle%3E+++%3Cmeta+name%3Dtest+content%3Done%2F%3E+++%3Cmeta+name%3Dbug+content%3Dtwo%3E+%3C%2Fhead%3E+%3Cbody%3E+++%3Cp%3ETest%3C%2Fp%3E+%3C%2Fbody%3E+%3C%2Fhtml%3E&charset=%28detect+automatically%29&doctype=Inline&group=1&verbose=1 The complete explanation can be read at http://www.cs.tut.fi/~jkorpela/html/empty.html In conclusion, sgmllib is right. Use an XML parser for XML or an HTML5 parser for HTML. Kind regards |
|||
| msg98425 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年01月27日 13:45 | |
Damn, the URI got fubared :/ Anyway, I just wanted to give an example of the verbose error message, but the second link will contain enough explanation. Regards |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:46 | admin | set | github: 49748 |
| 2010年02月05日 16:23:31 | ezio.melotti | set | nosy:
+ ezio.melotti |
| 2010年02月05日 16:00:44 | ezio.melotti | set | status: open -> closed priority: normal resolution: not a bug stage: test needed -> resolved |
| 2010年01月27日 13:45:03 | eric.araujo | set | messages: + msg98425 |
| 2010年01月27日 13:40:36 | eric.araujo | set | nosy:
+ eric.araujo messages: + msg98424 |
| 2009年04月22日 14:38:05 | ajaksu2 | set | keywords:
+ easy stage: test needed versions: + Python 2.6, - Python 2.5, Python 2.4, 3rd party |
| 2009年03月17日 11:19:34 | once-off | create | |