.github/workflows/build.yml Go Report Card Coverage Status
A versatile tool to crawl dozens of URLs from a given source, like a sitemap or an URL list.
Useful for:
- Warming site caches
- Checking response times
- Identifying dead or broken pages
#Linux (Debian/Ubuntu) & MacOS $ go build -o crab cmd/crab/main.go #Windows $ go build -o crab.exe cmd/crab/main.go
$ docker pull atomicptr/crab
# Example
$ docker run --rm atomicptr/crab --help
$ docker run --rm atomicptr/crab crawl:sitemap https://domain.com/sitemap.xmlNot available in nixpkgs but I have my own nix repository which you can use:
let atomicptr = import (fetchTarball "https://github.com/atomicptr/nix/archive/refs/heads/master.tar.gz") {}; in { environment.systemPackages = with pkgs; [ atomicptr.crab ]; }
$ brew install atomictr/tools/crab
$ scoop bucket add atomicptr https://github.com/atomicptr/scoop-bucket $ scoop install crab
Crawl singular URLs:
$ crab crawl https://domain.com https://domain.com/test
{"status": 200, "url": "https://domain.com", ...}
...Crawl through a sitemap:
$ crab crawl:sitemap https://domain.com/sitemap.xml
Replace all URLs with a different one:
$ crab crawl:sitemap https://domain.com/sitemap.xml --prefix-url=https://staging.domain.com
Add some cookies/headers:
$ crab crawl:sitemap https://domain.com/sitemap.xml --cookie auth_token=12345 --header X-Bypass-Cache=1
You can filter the output by it's status code
# This will only return responses with a 200 OK $ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=200 # This will only return responses that are not OK $ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=!200 # This will only return responses between 500-599 (range) $ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=500-599 # This will only return responses with 200 or 404 (multiple, be aware if one condition is true they all are) $ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=200,404 # This will only return responses with a code greater than 500 $ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=>500
You can save the url list to a file
# This will save the output to a file called output.txt
$ crab crawl:sitemap https://domain.com/sitemap.xml --output-file ./output/output.txtYou can save the output to a JSON file
# This will save the output to a file called output.json
$ crab crawl:sitemap https://domain.com/sitemap.xml --output-json ./output/output.json