Issue 5498: Can SGMLParser properly handle <empty/> tags?

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49748

classification

Title:	Can SGMLParser properly handle tags?
Type:	behavior	Stage:	resolved
Components:	Extension Modules, Library (Lib), XML	Versions:	Python 2.6

process

Dependencies:	Superseder:
Status:	closed	Resolution:	not a bug
Assigned To:	Nosy List:	eric.araujo, ezio.melotti, once-off
Priority:	normal	Keywords:	easy

Created on 2009年03月17日 11:19 by once-off, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
sgml_error.py	once-off, 2009年03月17日 11:19

Messages (3)
msg83667 - (view)	Author: (once-off)	Date: 2009年03月17日 11:19
The attached script (sgml_error.py) was designed to output XML files unchanged, other than expanding <empty/> tags into an opening and closing tag, such as <empty></empty>. It seems the SGMLParser class recognizes an empty tag, but does not emit the closing tag until the NEXT forward slash it sees. So everything from the forward slash in <empty/> (even the closing angle bracket) until the next forward slash is considered to be textual data. See the following line output. Have I missed something here (like a conscious design limitation on the class, an error on my part, etc), or is this really a bug with the class? C:\Python24\Lib>python sgmllib.py H:\input.xml start tag: <root> data: '\n ' start tag: <tag1> end tag: </tag1> data: '\n ' start tag: <tag2> data: '>\n <tag3>hello<' end tag: </tag2> data: 'tag3>\n' end tag: </root> C:\Python24\Lib>python ActivePython 2.4.3 Build 12 (ActiveState Software Inc.) based on Python 2.4.3 (#69, Apr 11 2006, 15:32:42) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sgml_error Input: <root> <tag1></tag1> <tag2/> <tag3>hello</tag3> </root> Output: <root> <tag1></tag1> <tag2>> <tag3>hello<</tag2>tag3> </root> Expected: <root> <tag1></tag1> <tag2></tag2> <tag3>hello</tag3> </root>
msg98424 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2010年01月27日 13:40
Hello XML of the form <tag/> are an SGML hack, or more precisely the combination of two features of SGML. The forward slash closes the tag, and the following angle bracket is character data, not part of the tag. The W3C validator uses a real SGML parser for HTML doctypes, and fails on XML-like /> constructs: http://validator.w3.org/check?uri=data%3Atext%2Fhtml%2C%3C!DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+HTML+4.01%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fhtml4%2Fstrict.dtd%22%3E+%3Chtml%3E+%3Chead%3E+++%3Ctitle%3ETest%3C%2Ftitle%3E+++%3Cmeta+name%3Dtest+content%3Done%2F%3E+++%3Cmeta+name%3Dbug+content%3Dtwo%3E+%3C%2Fhead%3E+%3Cbody%3E+++%3Cp%3ETest%3C%2Fp%3E+%3C%2Fbody%3E+%3C%2Fhtml%3E&charset=%28detect+automatically%29&doctype=Inline&group=1&verbose=1 The complete explanation can be read at http://www.cs.tut.fi/~jkorpela/html/empty.html In conclusion, sgmllib is right. Use an XML parser for XML or an HTML5 parser for HTML. Kind regards
msg98425 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2010年01月27日 13:45
Damn, the URI got fubared :/ Anyway, I just wanted to give an example of the verbose error message, but the second link will contain enough explanation. Regards

History
Date	User	Action	Args
2022年04月11日 14:56:46	admin	set	github: 49748
2010年02月05日 16:23:31	ezio.melotti	set	nosy: + ezio.melotti
2010年02月05日 16:00:44	ezio.melotti	set	status: open -> closed priority: normal resolution: not a bug stage: test needed -> resolved
2010年01月27日 13:45:03	eric.araujo	set	messages: + msg98425
2010年01月27日 13:40:36	eric.araujo	set	nosy: + eric.araujo messages: + msg98424
2009年04月22日 14:38:05	ajaksu2	set	keywords: + easy stage: test needed versions: + Python 2.6, - Python 2.5, Python 2.4, 3rd party
2009年03月17日 11:19:34	once-off	create

homepage