homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Can SGMLParser properly handle tags?
Type: behavior Stage: resolved
Components: Extension Modules, Library (Lib), XML Versions: Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eric.araujo, ezio.melotti, once-off
Priority: normal Keywords: easy

Created on 2009年03月17日 11:19 by once-off, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
sgml_error.py once-off, 2009年03月17日 11:19
Messages (3)
msg83667 - (view) Author: (once-off) Date: 2009年03月17日 11:19
The attached script (sgml_error.py) was designed to output XML files
unchanged, other than expanding <empty/> tags into an opening and
closing tag, such as <empty></empty>.
It seems the SGMLParser class recognizes an empty tag, but does not emit
the closing tag until the NEXT forward slash it sees. So everything from
the forward slash in <empty/> (even the closing angle bracket) until the
next forward slash is considered to be textual data. See the following
line output.
Have I missed something here (like a conscious design limitation on the
class, an error on my part, etc), or is this really a bug with the class?
C:\Python24\Lib>python sgmllib.py H:\input.xml
start tag: <root>
data: '\n '
start tag: <tag1>
end tag: </tag1>
data: '\n '
start tag: <tag2>
data: '>\n <tag3>hello<'
end tag: </tag2>
data: 'tag3>\n'
end tag: </root>
C:\Python24\Lib>python
ActivePython 2.4.3 Build 12 (ActiveState Software Inc.) based on
Python 2.4.3 (#69, Apr 11 2006, 15:32:42) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sgml_error
Input:
<root>
 <tag1></tag1>
 <tag2/>
 <tag3>hello</tag3>
</root>
Output:
<root>
 <tag1></tag1>
 <tag2>>
 <tag3>hello<</tag2>tag3>
</root>
Expected:
<root>
 <tag1></tag1>
 <tag2></tag2>
 <tag3>hello</tag3>
</root>
msg98424 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010年01月27日 13:40
Hello
XML of the form <tag/> are an SGML hack, or more precisely the combination of two features of SGML. The forward slash closes the tag, and the following angle bracket is character data, not part of the tag.
The W3C validator uses a real SGML parser for HTML doctypes, and fails on XML-like /> constructs: http://validator.w3.org/check?uri=data%3Atext%2Fhtml%2C%3C!DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+HTML+4.01%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fhtml4%2Fstrict.dtd%22%3E+%3Chtml%3E+%3Chead%3E+++%3Ctitle%3ETest%3C%2Ftitle%3E+++%3Cmeta+name%3Dtest+content%3Done%2F%3E+++%3Cmeta+name%3Dbug+content%3Dtwo%3E+%3C%2Fhead%3E+%3Cbody%3E+++%3Cp%3ETest%3C%2Fp%3E+%3C%2Fbody%3E+%3C%2Fhtml%3E&charset=%28detect+automatically%29&doctype=Inline&group=1&verbose=1
The complete explanation can be read at http://www.cs.tut.fi/~jkorpela/html/empty.html
In conclusion, sgmllib is right. Use an XML parser for XML or an HTML5 parser for HTML.
Kind regards
msg98425 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010年01月27日 13:45
Damn, the URI got fubared :/ Anyway, I just wanted to give an example of the verbose error message, but the second link will contain enough explanation.
Regards
History
Date User Action Args
2022年04月11日 14:56:46adminsetgithub: 49748
2010年02月05日 16:23:31ezio.melottisetnosy: + ezio.melotti
2010年02月05日 16:00:44ezio.melottisetstatus: open -> closed
priority: normal
resolution: not a bug
stage: test needed -> resolved
2010年01月27日 13:45:03eric.araujosetmessages: + msg98425
2010年01月27日 13:40:36eric.araujosetnosy: + eric.araujo
messages: + msg98424
2009年04月22日 14:38:05ajaksu2setkeywords: + easy
stage: test needed
versions: + Python 2.6, - Python 2.5, Python 2.4, 3rd party
2009年03月17日 11:19:34once-offcreate

AltStyle によって変換されたページ (->オリジナル) /