html-parsing: Permissive Parsing of HTML to SXML🔗 i

Neil Van Dyke

License: LGPLv3 Web: http://www.neilvandyke.org/racket/html-parsing/

(require html-parsing ) package: html-parsing

1Introduction🔗 i

The html-parsing library provides a permissive HTML parser. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. html-parsing emits SXML/xexp, so that conventional HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a permissive tokenizer, but html-parsing extends this by attempting to recover syntactic structure.

The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML.

html-parsing also has some support for XHTML, although XML namespace qualifiers are accepted but stripped from the resulting SXML/xexp. Note that valid XHTML input might be better handled by a validating XML parser like Kiselyov’s SSAX.

2Interface🔗 i

procedure
( html->xexpinput)→xexp
input:(or/cinput-port?string?)

Parse HTML permissively from input, which is either an input port or a string, and emit an SXML/xexp equivalent or approximation. To borrow and slightly modify an example from Kiselyov’s discussion of his HTML parser:

> (html->xexp
(string-append
"<html><head><title></title><title>whatever</title></head>"
"<body> <a href=\"url\">link</a>"
"<ul compact style=\"aa\"> BLah"
" italic bold <tt> ened still < bold "
"</body> But not done yet..."))

(*TOP*(html(head(title)(title"whatever"))

(body"\n"

(a(@(href"url"))"link")

(p(@(align"center"))

(ul(@(compact)(style"aa"))"\n"))

(p"BLah"

(*COMMENT*" comment <comment> ")

" "

(i" italic "(b" bold "(tt" ened")))

"\n"

"still < bold "))

(p" But not done yet...")))

Note that, in the emitted SXML/xexp, the text token "still < bold" is not inside the b element. This is one old Web browser quirk-handling of invalid HTML that this parser does not try to emulate.

3History🔗 i

Version 11:0 — 2022年07月19日
- An object element is no longer considered always-empty. Incrementing major version again, because this could break parses.
Version 10:0 — 2022年07月19日
- To pass a "microformats" test suite (impliedname.html), an area element can now be a child of a span element. In the future, we might be even more flexible about where span elements are permitted. (Thanks to Jacob Hall for discussing.)
Version 9:0 — 2022年04月16日
- Header elements may once again appear as children of li elements (which we broke in the previous version), as we see how far we can stretch a 20 year-old hack for invalid HTML. (Thanks for Simon Budinsky for reporting.)
Version 8:0 — 2022年04月03日
- The original "H" elements (h1, h2, etc.) now are parsed with "parent constraints" for handling invalid HTML, to accommodate a need to parse mid-1990s HTML in which was used as a separator or terminator, rather than a start delimeter. There is a chance that change this will break a real-world scraper or other tool.
Version 7:1 — 2022年04月02日
- Include a test case #:fail that was unsaved in DrRacket.
Version 7:0 — 2022年04月02日
- Fixed a quirks-handling bug in which p elements would be (directly or indirectly) nested under other p elements in cases in which there was no body element, but there was an html element. (Thanks to Jonathan Simpson for reporting.)
Version 6:1 — 2022年01月22日
- Permit details element to be parent of p element in quirks handling. (Thanks to Jacder for reporting.)
Version 6:0 — 2018年05月22日
- Fix to permit p elements as children of blockquote elements. Incrementing major version number because this is a breaking change of 17 years, but seems an appropriate change for modern HTML, and fixes a particular real-world problem. (Thanks to Sorawee Porncharoenwase for reporting.)
Version 5:0 — 2018年05月15日
- In a breaking change of handing invalid HTML, most named character entity references that are invalid because (possibly among multiple reasons) they are not terminated by semicolon, now are treated as literal strings (including the ampersand), rather than as named character entites. For example, parser input string "A&B Co." will now parse as (p"A&B Co.") rather than as (p"A"(&B)" Co."). (Thanks for Greg Hendershott for suggesting this, and discussing.)
- For support of historical quirks handling, five early HTML named character entities (specifically, amp, apos, lt, gt, quot) do not need to be terminated with a semicolon, and will even be recognized if followed immediately by an alphabetic. For example, "a&ltz" will now parse as (p"a<z"), rather than as (p(&ltz)).
- Invalid character entity references that are terminated by EOF rather than semicolon may now be parsed as literal strings, rather than as entity references.
Version 4:3 — 2016年12月15日
- Error message “%html-parsing:parse-html:invalidinputtype:” now abbreviates the invalid value, to avoid possibly huge messages. (Thanks to John B. Clements.)
Version 4:2 — 2016年03月02日
- Tweaked info.rkt, filenames.
Version 4:1 — 2016年02月25日
- Updated deps.
- Documentation tweaks.
Version 4:0 — 2016年02月21日
- Moving from PLaneT to new package system.
- Moved unit tests into main source file.
Version 3:0 — 2015年04月24日
- Numeric character entities now parse to Racket strings instead of Racket characters, to bring SXML/xexp back closer to SXML. (Thanks to John Clements for reporting.)
Version 2:0 — 2012年06月13日
- Converted to McFly.
Version 0.3 — Version 1:2 — 2011年08月27日
- Converted test suite from Testeez to Overeasy.
Version 0.2 — Version 1:1 — 2011年08月27日
- Fixed embarrassing bug due to code tidying. (Thanks to Danny Yoo for reporting.)
Version 0.1 — Version 1:0 — 2011年08月21日
- Part of forked development from HtmlPrag, parser originally written 2001-04.

4Legal🔗 i

Copyright 2001–2012, 2015–2016, 2018, 2022 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.

top ← prev up next →