[Python-Dev] sgmllib Comments

Mon Jun 12 06:01:23 CEST 2006

Fred L. Drake, Jr. wrote:
> On Sunday 11 June 2006 16:26, Sam Ruby wrote:
> > Planet is a feed aggregator written in Python. It depends heavily on
> > SGMLLib. A recent bug report turned out to be a deficiency in sgmllib,
> > and I've submitted a test case and a patch[1] (use or discard the patch,
> > it is the test that I care about).
>> And it's a nice aggregator to use, indeed!
>> > While looking around, a few things surfaced. For starters, it would
> > seem that the version of sgmllib in SVN HEAD will selectively unescape
> > certain character references that might appear in an attribute. I say
> > selectively, as:
> >
> > * it will unescape &amp;
> > * it won't unescape &copy;
> > * it will unescape &#38;
> > * it won't unescape &#x26;
> > * it will unescape &#146;
> > * it won't unescape &#8217;
>> And just why would you use sgmllib to handle RSS or ATOM feeds? Neither is 
> defined in terms of SGML. The sgmllib documentation also notes that it isn't 
> really a fully general SGML parser (it isn't), but that it exists primarily 
> as a foundation for htmllib.

The feed itself is read first with SAX (then with a fallback using 
sgmllib if the feed is not well formed, but that's beside the point). 
Then the embedded HTML portions are then processed with subclasses of 
sgmllib.
> > There are a number of issues here. While not unescaping anything is
> > suboptimal, at least the recipient is aware of exactly which characters
> > have been unescaped (i.e., none of them). The proposed solution makes
> > it impossible for the recipient to know which characters are unescaped,
> > and which are original. (Note: feeds often contain such abominations as
> > &amp;copy; which the new code will treat indistinguishably from &copy;)
>> My suspicion is that the "right" thing to do at the sgmllib level is to 
> categorize the markup and call a method depending on what the entity 
> reference is, and let that handle whatever it is. For SGML, that means we 
> have things like &name; (entity references), &#123; (character references), 
> and that's it. &#x123; isn't legal SGML under any circumstance; 
> the "&#x<number>;" syntax was introduced with XML.

... but it effectively is valid HTML. And as you point out below 
sgmllib's raison d’être is to support htmllib.
> > Additionally, there is a unicode issue here - one that is shared by
> > handle_charref, but at least that method is overrideable. If unescaping
> > remains, do it for hex character references and for values greather than
> > 8-bits, i.e., use unichr instead of chr if the value is greater than 127.
>> For SGML, it's worse than that, since the document character set is defined in 
> the SGML declaration, which is a far hairier beast than an XML 
> declaration. :-)

understood
> It really sounds like sgmllib is the wrong foundation for this. While the 
> module has some questionable behaviors, none of them are signifcant in the 
> context it's intended context (support for htmllib). Now, I understand that 
> RSS has historical issues, with HTML-as-practiced getting embedded as payload 
> data with various flavors of escaping applied, and I'm not an expert in the 
> details of that. Have you looked at HTMLParser as an alternate to sgmllib? 
> It has better support for XHTML constructs.

HTMLParser is less forgiving, and generally less suitable for consuming 
HTML as practiced.
- Sam Ruby