2:0
top
← prev up next →

webscraperhelper: Generating SXPath Queries from SXML ExamplesπŸ”— i

Neil Van Dyke

License: LGPLv3 Web: http://www.neilvandyke.org/racket/webscraperhelper/

1IntroductionπŸ”— i

The webscraperhelper package is intended as a programmer’s aid for crafting SXPath queries to extract information (e.g., news items, prices) from HTML Web pages that have been parsed by the html-parsing package. webscraperhelper accepts an example SXML document and an example “goal” subtree of the document, and yields up to three different SXPath queries. A generated query can often be incorporated into a Web-scraping program as-is, for extracting information from documents with very similar formatting. Generated queries can also be used as starting points for hand-crafted queries.
For example, given the SXML document doc:
(definedoc
'(*TOP*(html(head(title"My Title"))
(body(@(bgcolor"white"))
(p"Summary: This is a document.")
(div(@(id"ResultsSection"))
(h2"Results")
(p"These are the results.")
(table(@(id"ResultTable"))
(tr(td(b"Input:"))
(td"2 + 2"))
(tr(td(b"Output:"))
(td"Four")))
(p"Lookin' good!"))))))
evaluating the expression:

> (webscraperhelper'(td"Four")doc)

will display generated queries like:
AbsoluteSXPath:(htmlbodydivtable(tr2)(td2))
AbsoluteSXPathwithIDs:(htmlbody
(div(@(equal?(id"ResultsSection"))))
(table(@(equal?(id"ResultTable"))))
(tr2)(td2))
RelativeSXPathwithIDs:(//(table(@(equal?(id"ResultTable"))))
(tr2)(td2))
 
The queries can then be compiled with the sxpath procedure of the SXPath library:
> (definequery
(sxpath'(//(table(@(equal?(id"ResultTable"))))
(tr2)(td2))))
> (querydoc)
((td"Four"))
webscraperhelper comes with an advertising jingle (with apologies to greasy ground bovine additive Americana):

Webscraperhelper
helps a programmer
scrape the
Web a great deal!

This package was originally written for R5RS Scheme with SRFI-11 and SRFI-16.

2Interactive InterfaceπŸ”— i

In this version, the ‘interactive” interface is a procedure intended to be invoked manually from a REPL.

procedure

( webscraperhelpergoalsxml[ids])void/c

goal:any/c
sxml:sxml?
ids:(listofsymbol?)='(id)
Displays some XPath queries yielding SXML goal from document sxml.
goal is the desired SXML element node.
sxml is the document in SXML First Normal Form (1NF). Some nested nodelists emitted by SXML transformation tools, such as attributes nested in extra list levels, are not permitted.
The optional ids is a list of name symbols for element attributes that can be treate as unique identifiers. If ids is not given, then the default is '(id).

3Programmatic InterfaceπŸ”— i

The following procedures were exposed only for tinkering, and were documented badly.

procedure

( find-wsh-pathgoalsxml)(or/c#fwsh-path?)

goal:any/c
sxml:sxml?
Yields a wsh-path? to goal within sxml, or #f if no path could be found. The yielded path might share structure with sxml.

procedure

( wsh-path->sxpath-abspath)any

path:any/c
( wsh-path->sxpath-absids+relidspath)any
path:any/c
( wsh-path->sxpath-abs+absids+relidspath)any
path:any/c
Translate a wsh-path? to various SXPath queries. The yielded SXPath query lists should be considered immutable, as they might share structure with the original SXML from which path was generated, or multiple queries might share structure with each other.

4HistoryπŸ”— i

  • Version 2:0 — 2016εΉ΄02月28ζ—₯
    • Moving from PLaneT to new package system.

  • Version 1:2 — 2009εΉ΄03月14ζ—₯
    • Minor documentation change.

  • Version 1:1 — 2009εΉ΄02月24ζ—₯
    • License now LGPL 3.

    • Converted to author’s new Scheme administration system.

  • Version 1:0 — 2005εΉ΄07月04ζ—₯
    • Documentation update, plus get it into PLaneT 299/3xx.

  • Version 0.2 — 2004εΉ΄08月16ζ—₯
    • Corrected typographical error in attributions.

  • Version 0.1 — 2004εΉ΄07月31ζ—₯
    • Initial version.

5LegalπŸ”— i

Copyright 2004, 2005, 2009, 2016 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.

top
← prev up next →

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /