- Python 92.8%
- CSS 5%
- HTML 2.2%
| lib | Move blocklists to database | |
| src | Move blocklists to database | |
| .gitignore | Tally keywords for each field | |
| external-dependencies | Add body text extraction heuristics | |
| LICENSE | Fix more errors | |
| pyproject.toml | Move blocklists to database | |
| README.md | Move blocklists to database | |
Ariadne
Ariadne is the web crawler for the Clew search engine.
User Agent and robots.txt
Ariadne crawls using the following User Agent, where "{node_id}" is replaced with the ID of the crawler node making the request:
Ariadne (web crawler for Clew; https://clew.se/about/; crawler node {node_id})
You can block it in your robots.txt as Ariadne.
Installation
The binaries for all three parts of the Ariadne architecture can be installed with a simple call to pipx: pipx install git+https://codeberg.org/Clew/ariadne.
If running ariadne, you will need the external package dependencies listed in the external-dependencies file in this repository to be able to parse webpages. If not running ariadne, you should only need to install libicu-dev.
When running the ariadne and daedalus services, you will need to configure a PostgreSQL database (as shown below) and run ariadne setup-database. You can create icarus notes using daedalus create-node <name> <email>.
To run an icarus node, you do not need a database; you can simply create the configuration file as shown below and run the icarus command. To get a TOTP code independently for use wit the Daedalus dashboard, run icarus totp.
Configuration
By default, Ariadne looks at ~/.config/ariadne/config.toml for its setup. This one file is used by all three parts of the architecture. You can change the location of the configuration file by setting the ARIADNE_CONFIG_PATH environment variable.
An example configuration file:
[logging]
verbosity="INFO"
[database]
name="clew_index"
host="localhost"
port="5432"
user="ariadne"
password="<password goes here>"
timeout=500
[ariadne]
task_processors=3
discovery_cap=100000
[daedalus]
max_parcel_size=80
host="127.0.0.1"
port=5400
user_agent="Ariadne (web crawler for Clew; https://clew.se/about/; crawler node {node_id})"
user_agent_short="Ariadne"
[icarus]
daedalus_instance="https://daedalus.clew.se"
simultaneous_requests = 5
name="<node name goes here>"
secret="<node secret goes here>"
If you are hosting an Icarus node, you do not need the database, ariadne, or daedalus sections of the configuration.
License
Ariadne is licensed under the AGPLv3 License.