Name	Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github/workflows	.github/workflows
docs	docs
src/secretscraper	src/secretscraper
tests	tests
.editorconfig	.editorconfig
.gitattributes	.gitattributes
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
DevelopJournal.md	DevelopJournal.md
LICENSE	LICENSE
README.md	README.md
pyproject.toml	pyproject.toml
requirements.txt	requirements.txt
settings.yml	settings.yml
tox.ini	tox.ini
uv.lock	uv.lock

SecretScraper

Overview

SecretScraper is a highly configurable web scrape tool that crawl links from target websites and scrape sensitive data via regular expression.

Shows an illustrated sun in light mode and a moon with stars in dark mode.

Feature

Web crawler: extract links via both DOM hierarchy and regex
Support domain white list and black list
Support multiple targets, input target URLs from a file
Support local file scan
Scalable customization: header, proxy, timeout, cookie, scrape depth, follow redirect, etc.
Built-in per-domain rate limiting and HTTP connection pool limits
Built-in regex to search for sensitive information
Flexible configuration in yaml format

Prerequisite

Platform: Test on MacOS, Ubuntu and Windows.
Python Versions: 3.9 - 3.14

Usage

Install

pip install secretscraper

Update

pip install --upgrade secretscraper

Note that, since Secretscraper generates a default configuration under the work directory if settings.yml is absent, so remember to update the settings.yml to the latest version(just copy from Customize Configuration).

Development

uv sync
uv run tox

Basic Usage

Start with single target:

secretscraper -u https://scrapeme.live/shop/

Start with multiple targets:

secretscraper -f urls

# urls
http://scrapeme.live/1
http://scrapeme.live/2
http://scrapeme.live/3
http://scrapeme.live/4
http://scrapeme.live/1

Sample output: image

image

All supported options:

> secretscraper --help
Usage: secretscraper [OPTIONS]
 Main commands
Options:
 -V, --version Show version and exit.
 --debug Enable debug.
 -a, --ua TEXT Set User-Agent
 -c, --cookie TEXT Set cookie
 -d, --allow-domains TEXT Domain white list, wildcard(*) is supported,
 separated by commas, e.g. *.example.com,
 example*
 -D, --disallow-domains TEXT Domain black list, wildcard(*) is supported,
 separated by commas, e.g. *.example.com,
 example*
 -f, --url-file FILE Target urls file, separated by line break
 -i, --config FILE Set config file, defaults to settings.yml
 -m, --mode [1|2] Set crawl mode, 1(normal) for max_depth=1,
 2(thorough) for max_depth=2, default 1
 --max-page INTEGER Max page number to crawl, default 100000
 --max-depth INTEGER Max depth to crawl, default 1
 --max-connections INTEGER Max total HTTP connections
 --max-keepalive-connections INTEGER
 Max keep-alive HTTP connections
 --max-concurrent-per-domain INTEGER
 Max simultaneous requests per domain
 --min-request-interval FLOAT
 Minimum seconds between requests to the same
 domain
 -o, --outfile FILE Output result to specified file in csv format
 -s, --status TEXT Filter response status to display, seperated by
 commas, e.g. 200,300-400
 -x, --proxy TEXT Set proxy, e.g. http://127.0.0.1:8080,
 socks5://127.0.0.1:7890
 -H, --hide-regex Hide regex search result
 -F, --follow-redirects Follow redirects
 -u, --url TEXT Target url
 --detail Show detailed result
 --validate Validate the status of found urls
 -l, --local PATH Local file or directory, scan local
 file/directory recursively
 --help Show this message and exit.

Advanced Usage

Validate the Status of Links

Use --validate option to check the status of found links, this helps reduce invalid links in the result.

secretscraper -u https://scrapeme.live/shop/ --validate --max-page=10

Thorough Crawl

The max depth is set to 1, which means only the start urls will be crawled. To change that, you can specify via --max-depth <number>. Or in a simpler way, use -m 2 to run the crawler in thorough mode which is equivalent to --max-depth 2. By default the normal mode -m 1 is adopted with max depth set to 1.

secretscraper -u https://scrapeme.live/shop/ -m 2

Write Results to Csv File

secretscraper -u https://scrapeme.live/shop/ -o result.csv

Tune Crawl Rate Limits

Use these options to reduce pressure on a target domain and cap local socket usage:

secretscraper -u https://scrapeme.live/shop/ \
 --max-connections 100 \
 --max-keepalive-connections 50 \
 --max-concurrent-per-domain 5 \
 --min-request-interval 0.2

Domain White/Black List

Support wildcard(*), white list:

secretscraper -u https://scrapeme.live/shop/ -d *scrapeme*

Black list:

secretscraper -u https://scrapeme.live/shop/ -D *.gov

Hide Regex Result

Use -H option to hide regex-matching results. Only found links will be displayed.

secretscraper -u https://scrapeme.live/shop/ -H

Extract secrets from local file

secretscraper -l <dir or file>

Switch to hyperscan

I have implemented the regex matching functionality with both hyperscan and re module, re module is used as default, if you purse higher performance, you can switch to hyperscan by changing the handler_type to hyperscan in settings.yml.

There are some pitfalls of hyperscan which you have to take caution to use it:

not support regex group: you can not extract content by parentheses.
different syntax from re

You'd better write regex separately for the two regex engine.

Customize Configuration

The built-in config is shown as below. You can assign custom configuration via -i settings.yml.

verbose: false
debug: false
loglevel: critical
logpath: log
handler_type: re
proxy: "" # http://127.0.0.1:7890
max_depth: 1 # 0 for no limit
max_page_num: 1000 # 0 for no limit
timeout: 5
follow_redirects: true
workers_num: 1000
max_connections: 100 # total HTTP connection pool size
max_keepalive_connections: 50 # keep-alive connections retained in the pool
max_concurrent_per_domain: 5 # simultaneous requests allowed per domain
min_request_interval: 0.2 # seconds between requests to the same domain
headers:
 Accept: "*/*"
 Cookie: ""
 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0
urlFind:
 - "[\"'‘"`]\\s{0,6}(https{0,1}:[-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250}?)\\s{0,6}[\"'‘"`]"
 - "=\\s{0,6}(https{0,1}:[-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250})"
 - "[\"'‘"`]\\s{0,6}([#,.]{0,2}/[-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250}?)\\s{0,6}[\"'‘"`]"
 - "\"([-a-zA-Z0-9()@:%_\\+.~#?&//={}]+?[/]{1}[-a-zA-Z0-9()@:%_\\+.~#?&//={}]+?)\""
 - "href\\s{0,6}=\\s{0,6}[\"'‘"`]{0,1}\\s{0,6}([-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250})|action\\s{0,6}=\\s{0,6}[\"'‘"`]{0,1}\\s{0,6}([-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250})"
jsFind:
 - (https{0,1}:[-a-zA-Z0-9()@:%_\+.~#?&//=]{2,100}?[-a-zA-Z0-9()@:%_\+.~#?&//=]{3}[.]js)
 - '["''‘"`]\s{0,6}(/{0,1}[-a-zA-Z0-9()@:%_\+.~#?&//=]{2,100}?[-a-zA-Z0-9()@:%_\+.~#?&//=]{3}[.]js)'
 - =\s{0,6}[",',’,"]{0,1}\s{0,6}(/{0,1}[-a-zA-Z0-9()@:%_\+.~#?&//=]{2,100}?[-a-zA-Z0-9()@:%_\+.~#?&//=]{3}[.]js)
dangerousPath:
 - logout
 - update
 - remove
 - insert
 - delete
rules:
 - name: Swagger
 regex: \b[\w/]+?((swagger-ui.html)|(\"swagger\":)|(Swagger UI)|(swaggerUi)|(swaggerVersion))\b
 loaded: true
 - name: ID Card
 regex: \b((\d{8}(0\d|10|11|12)([0-2]\d|30|31)\d{3})|(\d{6}(18|19|20)\d{2}(0[1-9]|10|11|12)([0-2]\d|30|31)\d{3}(\d|X|x)))\b
 loaded: true
 - name: Phone
 regex: "['\"](1(3([0-35-9]\\d|4[1-8])|4[14-9]\\d|5([\\d]\\d|7[1-79])|66\\d|7[2-35-8]\\d|8\\d{2}|9[89]\\d)\\d{7})['\"]"
 loaded: true
 - name: JS Map
 regex: \b([\w/]+?\.js\.map)
 loaded: true
 - name: URL as a Value
 regex: (\b\w+?=(https?)(://|%3a%2f%2f))
 loaded: false
 - name: Email
 regex: "['\"]([\\w]+(?:\\.[\\w]+)*@(?:[\\w](?:[\\w-]*[\\w])?\\.)+[\\w](?:[\\w-]*[\\w])?)['\"]"
 loaded: true
 - name: Internal IP
 regex: '[^0-9]((127\.0\.0\.1)|(10\.\d{1,3}\.\d{1,3}\.\d{1,3})|(172\.((1[6-9])|(2\d)|(3[01]))\.\d{1,3}\.\d{1,3})|(192\.168\.\d{1,3}\.\d{1,3}))'
 loaded: true
 - name: Cloud Key
 regex: \b((accesskeyid)|(accesskeysecret)|\b(LTAI[a-z0-9]{12,20}))\b
 loaded: true
 - name: Shiro
 regex: (=deleteMe|rememberMe=)
 loaded: true
 - name: Suspicious API Key
 regex: "[\"'][0-9a-zA-Z]{32}['\"]"
 loaded: true
 - name: Jwt
 regex: "['\"](ey[A-Za-z0-9_-]{10,}\\.[A-Za-z0-9._-]{10,}|ey[A-Za-z0-9_\\/+-]{10,}\\.[A-Za-z0-9._\\/+-]{10,})['\"]"
 loaded: true

TODO

Support headless browser
Add regex doc reference
Fuzz path that are 404
Separate subdomains in the result
Optimize url collector [//]: # (- [ ] Employ jsbeautifier)
Generate configuration file
Detect dangerous paths and avoid requesting them
Support url-finder output format, add --detail option
Support windows
Scan local file
Extract links via regex

Change Log

2024年5月25日 Version 1.4

Support csv output
Set re module as regex engine by default
Support to select regex engine by configuration handler_type

2024年4月30日 Version 1.3.9

Add --validate option: Validate urls after the crawler finish, which helps reduce useless links
Optimize url collector
Optimize built-in regex

2024年4月29日 Version 1.3.8

Optimize log output
Optimize the performance of --debug option

2024年4月29日 Version 1.3.7

Test on multiple python versions
Support python 3.9~3.11

2024年4月29日 Version 1.3.6

Repackage

2024年4月28日 Version 1.3.5

New Features
- Support windows
- Optimize crawler
- Prettify output, add --detail option
- Generate default configuration to settings.yml
- Avoid requesting dangerous paths

2024年4月28日 Version 1.3.2

New Features
- Extract links via regex

2024年4月26日 Version 1.3.1

New Features
- Support scan local files

2024年4月15日

Add status to url result
All crawler test passed

Folders and files

Latest commit

History

Repository files navigation

SecretScraper

Overview

Feature

Prerequisite

Usage

Install

Update

Development

Basic Usage

Advanced Usage

Validate the Status of Links

Thorough Crawl

Write Results to Csv File

Tune Crawl Rate Limits

Domain White/Black List

Hide Regex Result

Extract secrets from local file

Switch to hyperscan

Customize Configuration

TODO

Change Log

2024年5月25日 Version 1.4

2024年4月30日 Version 1.3.9

2024年4月29日 Version 1.3.8

2024年4月29日 Version 1.3.7

2024年4月29日 Version 1.3.6

2024年4月28日 Version 1.3.5

2024年4月28日 Version 1.3.2

2024年4月26日 Version 1.3.1

2024年4月15日

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages