jacobevermore/webcrawler

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.mvn/wrapper		.mvn/wrapper
src		src
.gitignore		.gitignore
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
run_crawler.sh		run_crawler.sh

Repository files navigation

SC WebCrawler

to run on linux ( java installed )

chmod a+x ./run_crawler.sh

./run_crawler.sh URL_TO_CRAWL

theres a small site added in script by default

Results are in sitemap.txt

Remote Execution

also if you don't have java or linux based system

you can ssh to ubuntu@crawler.shaneconnolly.io

with the private key shared in email

follow the terminal instructions then.

Logic:

get all links from domain given
add domain to set of visited links
while unvisited has links, take a link from unvisited and get all its links
if link follows rules and is not in unvisited or visited sets
add it to unvisited
there is a max loop count too of MAX_PAGES_TO_LOAD = 2000

Would be nice to have:

Tests, they are not completed yet.
endpoint to execute crawler over http
returns sitemap

About

No description, website, or topics provided.

Releases

No releases published

Packages

No packages published

Languages

Shell 38.9%
Java 34.0%
Batchfile 27.1%

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jacobevermore/webcrawler

Folders and files

Latest commit

History

Repository files navigation

SC WebCrawler

Results are in sitemap.txt

Remote Execution

you can ssh to ubuntu@crawler.shaneconnolly.io

Logic:

Would be nice to have:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

jacobevermore/webcrawler

Folders and files

Latest commit

History

Repository files navigation

SC WebCrawler

Results are in sitemap.txt

Remote Execution

you can ssh to ubuntu@crawler.shaneconnolly.io

Logic:

Would be nice to have:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages