>
This is a preliminary document which will be updated over time. Upates will be available for downloading.
Introduction and a simple example
LHTML parse output format
Case mode notes
Parsing HTML comments
Parsing <SCRIPT> and <STYLE> tags
Parsing SGML <! tags
Parsing Illegal and Deprecated Tags
Default Attribute Values
Parsing Interleaved Character Formatting Tags
parse-html reference
methods
phtml-internal
The parse-html generic function processes HTML input, returning a list of HTML tags, attributes, and text.
The :phtml module is loaded with the form (require :phtml). Symbols naming functionality in the module are in the net.html.parser package. Examples in this document assume (use-package :net.html.parser) has been evaluated.
Here is a simple example (we have added carriage returns in the string for readability):
(pprint (parse-html "<HTML> <HEAD> <TITLE>Example HTML input</TITLE> <BODY> <P>Here is some text with a <B>bold</B> word <br>and a <A HREF=\"help.html\">link</A></P> </HTML>"))
generates:
((:html (:head (:title "Example HTML input")) (:body (:p "Here is some text with a " (:b "bold") "word" :br "and a " ((:a :href "help.html") "link")))))
The output format is known as LHTML (Lisp HTML) format; it is the same format that the aserve htmlgen macro accepts. As the example shows, it is a nested collection of lists where the first element of each list is a keyword associated with an HTML marker, or a list whose first element is a keyword associated with an HTML marker and the other elements are attribute names and values. (The link is an example where the first element is a list.)
LHTML is a list representation of HTML tags and content.
Each list member may be:
More on the list (# 3) member: if the HTML tag does not have associated attributes, then the first list member will be a keyword package symbol representing the HTML tag, and the other elements will represent the content, which can be a string (text content), a keyword package symbol (HTML tag with no attributes or content), or list (nested HTML tag with associated attributes and/or content). If there are associated attributes, then the first list member will be a list containing a keyword package symbol followed by two list members for each associated attribute; the first member is a keyword package symbol representing the attribute, and the next member is a string corresponding to the attribute value.
If excl:*current-case-mode*
is :CASE-INSENSITIVE-UPPER
,
keyword package symbols will be in upper case; otherwise, they will be in lower case.
HTML comments are represented with a :comment
symbol. For example,
(parse-html "<!-- this is a comment-->") --> ((:comment " this is a comment"))
All <SCRIPT> and <STYLE> content is not parsed; it is returned as text content.
For example,
(parse-html "<SCRIPT>this <B>will not</B> be parsed</SCRIPT>") --> ((:script "this <B>will not</B> be parsed"))
Since, some HTML pages contain special XML/SGML tags, non-comment tags starting with '<!' are treated specially:
(parse-html "<!doctype this is some text>") --> ((:!doctype " this is some text"))
There is plenty of illegal and deprecated HTML on the web that popular browsers nonetheless successfully display. The parse-html parser is generous - it will not raise an error condition upon encountering most input. In particular, it does not maintain a list of legal HTML tags and will successfully parse nonsense input.
For example,
(parse-html "<this> <is> <some> <nonsense> <input>") --> ((:this (:is (:some (:nonsense :input)))))
In some situations, you may prefer a two-pass parse that results in a parse where deep nesting related to unrecognized tags is minimized:
(let ((string "<this> <is> <some> <nonsense> </some> <input>")) (multiple-value-bind (res rogues) (parse-html string :collect-rogue-tags t) (declare (ignorable res)) (parse-html string :no-body-tags rogues))) --> (:this :is (:some (:nonsense)) :input)
See the descriptions of the collect-rogue-tags and no-body-tags keyword arguments to parse-html descriptions in the reference section below for more information.
As per the HTML 4.0 specification, attributes without specified values are given a lower case string value that matches the attribute name.
For example,
(parse-html "<P here ARE some attributes>") --> (((:p :here "here" :are "are" :some "some" :attributes "attributes")))
Existing HTML pages often have character format tags that are interleaved among other tags. Such interleaving is removed in a manner consistent with the HTML 4.0 specification.
For example,
(parse-html "<P>Here is <B>bold text<P>that spans</B>two paragraphs") --> ((:p "Here is " (:b "bold text")) (:p (:b "that spans") "two paragraphs"))
Arguments: input-source &key callbacks callback-only collect-rogue-tags no-body-tags eliminate-blank-strings parse-entities
Returns LHTML output, as described above in this document.
parse-html Methods
Methods specializing on the required argument are defined for stream
(reads output from the stream) and string
(parses the argument string).
parse-html (p stream) &key callbacks callback-only collect-rogue-tags no-body-tagsparse-html (str string) &key callbacks callback-only collect-rogue-tags no-body-tagsparse-html (file t) &key callbacks callback-only collect-rogue-tags no-body-tags
The t method assumes the argument is a pathname suitable for use with the with-open-file macro.
phtml-internal [Function]
Arguments: stream read-sequence-func callback-only callbacks collect-rogue-tags no-body-tags
This function may be used when more control is needed for supplying the HTML input. The read-sequence-func argument, if non-nil, should be a function object or a symbol naming a function. When phtml-internal requires another buffer of HTML input, it will invoke the read-sequence-func function with two arguments - the first argument is an internal buffer character array and the second argument is the phtml-internal stream argument. If read-sequence-fun is nil, phtml-internal will invoke read-sequence to fill the buffer. The read-sequence-func function must return the number of character array elements successfully stored in the buffer.
Copyright (c) Franz Inc. Lafayette, CA. All rights reserved.