Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

brandonrobertz/autoscrape-py

Repository files navigation

AutoScrape

A project of Artificial Informer Labs.

AutoScrape is an automated scraper of structured data from interactive web pages. You point this scraper at a site, give it a little information and structured data can then be extracted. No brittle, site-specific programming necessary.

This is an implementation of the web scraping framework described in the paper, Robust Web Scraping in the Public Interest with AutoScrape and presented at Computation + Journalism Symposium 2019. This is an experimental work in progress!

Currently there are a few ways to use AutoScrape:

  • via a full Web interface for scraping (see bottom of page, make sure to pull in the submodule!)
  • as a local CLI python script
  • as a simplified web scraping framework

Installation and running instructions are provided for both below.

Quickstart

Two ways, easiest first.

pip install autoscrape[all]
autoscrape --backend requests --output outdir --maxdepth 2 https://bxroberts.org

This will install all dependencies for all backends and various options.

Or:

git clone https://github.com/brandonrobertz/autoscrape-py
cd autoscrape-py/
pip install .[all]
autoscrape --backend requests --output outdir --maxdepth 2 https://bxroberts.org

Either way, you can now use autoscrape from the command line.

Usage Examples

Here are some straightforward use cases for AutoScrape and how you'd use the CLI tool to execute them. These, of course, assume you have the dependencies installed.

Crawler Backends

There are two backends available for driving AutoScrape: requests, selenium and warc. The requests backend (the default) is based on the Python requests library and is only capable of crawling sites and submitting simple HTTP forms. For any interaction with forms or JavaScript powered buttons, you'll need to use the selenium backend.

You can control the backened with the --backend option:

autoscrape \
 --backend requests \
 --output requests_crawled_site \
 'https://some.page/to-crawl'

In order to use backends other than requests, you need to install the proper dependencies. pip install autoscrape[all] will install everything required for all backends/functionality, but you can also install dependencies in isolation:

::

Selenium backend: pip install autoscrape[selenium-backend]

Crawl graph builder (for use in --save-graph) pip install autoscrape[graph]

WARC backend: pip install autoscrape[warc-backend]

Note that for the Selenium backend, you need to install geckodriver or chromedriver, depending if you're using Firefox or Chrome, respectively. More information is below in the External Dependencies section.

Crawl

Crawl an entire website, saving all HTML and stylesheets (no screenshots):

autoscrape \
 --backend requests \
 --maxdepth -1 \
 --output crawled_site \
 'https://some.page/to-crawl'

Archive Page (Screenshot & Code)

Archive a single webpage, both code and full-content screenshot (PNG), for future reference:

autoscrape \
 --backend selenium \
 --full-page-screenshots \
 --load-images --maxdepth 0 \
 --save-screenshots --driver Firefox \
 --output archived_webpage \
 'https://some.page/to-archive'

Search Forms and Crawl Result Pages

Query a web form, identified by containing the text "I'm a search form", entering "NAME" into the first (0th) text input field and select January 20th, 1992 in the second (1st) date field. Then click all buttons with the text "Next ->" to get all results pages:

autoscrape \
 --backend selenium \
 --output search_query_data \
 --form-match "I'm a search form" \
 --input "i:0:NAME,d:1:1992年01月20日" \
 --next-match "Next ->" \
 'https://some.page/search?s=newquery'

Setup for Standalone Local CLI

External Dependencies

If you want to use the selenium backend for interactive crawling, you need to have geckodriver installed. You can do that here:

https://github.com/mozilla/geckodriver/releases

Or through your package manager:

::
apt install firefox-geckodriver

Your geckodriver needs to be compatible with your current version of Firefox or you will get errors. If you install FF and the driver through your package manager, you should be okay, but it's not guaranteed. We have specific versions of both pinned in the Dockerfile.

If you prefer to use Chrome, you will need the ChromeDriver (we've tested using v2.41). It can be found in your distribution's package manager or here:

https://sites.google.com/a/chromium.org/chromedriver/downloads

Installing the remaining Python dependencies can be done using pip.

Pip Install Method

Next you need to set up your python virtual environment (Python 3.6 required) and install the Python dependencies:

pip install -r requirements.txt

Running Standalone Scraper

Environment Test Crawler

You can run a test to ensure your webdriver is set up correctly by running the test crawler:

./autoscrape --backend selenium --show-browser [SITE_URL]

The test crawler will just do a depth-first click-only crawl of an entire website. It will not interact with forms or POST data. Data will be saved to ./autoscrape-data/ (the default output directory).

Manual Config-Based Scraper

Autoscrape has a manually controlled mode, similar to wget, except this uses interactive capabilities and can input data to search forms, follow "next page"-type buttons, etc. This functionality can be used either as a standalone crawler/scraper or as a method to build a training set for the automated scrapers.

Autoscrape manual-mode full options:

AUTOSCRAPE - Interactively crawl, find searchable forms,
input data to them and scrape data on the results, from an
initial BASEURL.
Usage:
 autoscrape [options] BASEURL
General Options:
 --backend BACKEND
 The backend to use. Currently one of "selenium", "requests" or
 "warc". The requests browser is only capable of crawling, but
 is approximately 2-3.5x faster. WARC is for emulating browsing
 through Common Crawl archival data.
 [default: selenium]
 --loglevel LEVEL
 Loglevel, note that DEBUG is extremely verbose.
 [default: INFO]
 --quiet
 This will silence all logging to console.
Crawl-Specific Options:
 --maxdepth DEPTH
 Maximum depth to crawl a site (in search of form
 if the option --form-match STRING is specified,
 see below). Setting to 0 means don't crawl at all,
 all operations are limited to the BASEURL page.
 Setting to -1 means unlimited maximum crawl depth.
 [default: 10]
 --max-pages NUM
 Maximum number of unique pages, in total, to fetch.
 AutoScrape will stop crawling once this is hit.
 --leave-host
 By default, autoscrape will not leave the host given
 in the BASEURL. This option lets the scraper leave
 the host.
 --only-links MATCH_STREING
 A whitelist of links to follow. All others will
 be ignored. Can be a string or a regex with
 multiple strings to match separated by a pipe
 (|) character.
 --ignore-links MATCH_STRING
 This option can be used to remove any links matching
 MATCH_STRING (can be a regex or just a string match)
 from consideration for clicking. Accepts the same
 argument format as --only-links.
 --link-priority SORT_STRING
 A string to sort the links by. In this case, any link
 containing "SORT_STRING" will be clicked before any other
 links. In most cases you probably want to use the
 whitelist, --only-links, option.
 --ignore-extensions IGNORE_EXTENSIONS
 Don't click on or download URLs pointing to files with
 these extensions.
 --result-page-links MATCH_STRINGS_LIST
 If specified, AutoScrape will click on any links matching
 this string when it arrives on a search result page.
Interactive Form Search Options:
 --form-match SEARCH_STRING
 The crawler will identify a form to search/scrape if it
 contains the specified string. If matched, it will be
 interactively scraped using the below instructions.
 --input INPUT_DESCRIPTION
 Interactive search descriptor. This describes how to
 interact with a matched form. The inputs are
 described in the following format:
 "c:0:True,i:0:atext,s:1:France:d:0:1991年01月20日"
 A single-input type can be one of three types:
 checkbox ("c"), input box ("i"), option select
 ("s"), and date inputs ("d", with inputs in the
 "YYYY-MM-DD" format). The type is separated by a
 colon, and the input index position is next. (Each
 input type has its own list, so a form with one
 input, one checkbox, and one option select, will all
 be at index 0.) The final command, sepearated by
 another colon, describes what to do with the input.
 Multiple inputs are separated by a comma, so you can
 interact with multiple inputs before submitting the
 form.
 To illustrate this, the above command does the following:
 - first input checkbox is checked (uncheck is False)
 - first input box gets filled with the string "first"
 - second select input gets the "France" option chosen
 - first date input gets set to Jan 20, 1991
 --next-match NEXT_BTN_STRING
 A string to match a "next" button with, after
 searching a form. The scraper will continue to
 click "next" buttons after a search until no matches
 are found, unless limited by the --formdepth option
 (see below). [default: next page]
 --formdepth DEPTH
 How deep the scraper will iterate, by clicking
 "next" buttons. Zero means infinite depth.
 [default: 0]
 --form-submit-natural-click
 Some webpages make clicking a link element difficult
 due to JavaScript onClick events. In cases where a
 click does nothing, you can use this option to get
 the scraper to emulate a mouse click over the link's
 poition on the page, activating any higher level JS
 interactions.
 --form-submit-wait SECONDS
 How many seconds to force wait after a submit to a form.
 This should be used in cases where the builtin
 wait-for-page-load isn't working properly (JS-heavy
 pages, etc). [default: 5]
Webdriver-Specific and General Options:
 --load-images
 By default, images on a page will not be fetched.
 This speeds up scrapes on sites and lowers bandwidth
 needs. This option fetches all images on a page.
 --show-browser
 By default, we hide the browser during operation.
 This option displays a browser window, mostly
 for debugging purposes.
 --driver DRIVER
 Which browser to use. Current support for "Firefox",
 "Chrome", and "remote". [default: Firefox]
 --browser-binary PATH_TO_BROWSER
 Path to a specific browser binary. If left blank
 selenium will pull the browser found on your path.
 --remote-hub URI
 If using "remote" driver, specify the hub URI to
 connect to. Needs the proto, address, port, and path.
 [default: http://localhost:4444/wd/hub]
WARC Options:
 --warc-directory PATH_TO_WARCS
 Path to the folder containing GZipped WARC files. These can be
 downloaded from Common Crawl. Required when using the "warc"
 backend.
 --warc-index-file PATH_TO_LEVELDB
 Path to the level DB database holding the URL-to-file
 index: URL => (filename, record_number)
 This will be generated from the WARCS in the --warc-directory
 speficied if it's not already. Required when using the "warc"
 backend.
Data Saving Options:
 --output DIRECTORY_OR_URL
 If specified, this indicates where to save pages during a
 crawl. This directory will be created if it does not
 currently exist. This directory will have several
 sub-directories that contain the different types of pages
 found (i.e., search_pages, data_pages, screenshots).
 This can also accept a URL (i.e., http://localhost:5000/files)
 and AutoScrape will POST to that endpoint with each
 file scraped.
 [default: autoscrape-data]
 --keep-filename
 By default, we hash the files in a scrape in order to
 account for dynamic content under a single-page app
 (SPA) website implmentation. This option will force
 the scraper to retain the original filename, from the
 URL when saving scrape data.
 --save-screenshots
 This option makes the scraper save screenshots of each
 page, interaction, and search. Screenshots will be
 saved to the screenshots folder of the output dir.
 --full-page-screenshots
 By default, we only save the first displayed part of the
 webpage. The remaining portion that you can only see
 by scrolling down isn't captured. Setting this option
 forces AutoScrape to scroll down and capture the entire
 web content. This can fail in certain circumstances, like
 in API output mode and should be used with care.
 --save-graph
 This option allows the scraper to build a directed graph
 of the entire scrape and will save it to the "graph"
 subdirectory under the output dir. The output file
 is a timestamped networkx pickled graph.
 --disable-style-saving
 By default, AutoScrape saves the stylesheets associated
 with a scraped page. To save storage, you can disable this
 functionality by using this option.

AutoScrape Web UI (Docker)

AutoScrape can be ran as a containerized cluster environment, where scrapes can be triggered and stopped via API calls and data can be streamed to this server.

This requires the autoscrape-www submodule to be pulled:

git submodule init
git submodule update

This will pull the browser-based UI into the www/ folder.

You need docker-ce and docker-compose. Once you have these dependencies installed, simply run:

docker-compose build --pull
docker-compose up

This will build the containers and launch a API server running on local port 5000. More information about the API calls can be found in autoscrape-server.py.

If you have make installed, you can simply run make start.

About

An automated, programming-free web scraper for interactive sites

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

Languages

AltStyle によって変換されたページ (->オリジナル) /