Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

jacobevermore/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

14 Commits

Repository files navigation

SC WebCrawler

to run on linux ( java installed )

chmod a+x ./run_crawler.sh

./run_crawler.sh URL_TO_CRAWL

theres a small site added in script by default

Results are in sitemap.txt

Remote Execution

also if you don't have java or linux based system

with the private key shared in email

follow the terminal instructions then.

Logic:

  • get all links from domain given

  • add domain to set of visited links

  • while unvisited has links, take a link from unvisited and get all its links

  • if link follows rules and is not in unvisited or visited sets

  • add it to unvisited

  • there is a max loop count too of MAX_PAGES_TO_LOAD = 2000

Would be nice to have:

  • Tests, they are not completed yet.
  • endpoint to execute crawler over http
  • returns sitemap

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 38.9%
  • Java 34.0%
  • Batchfile 27.1%

AltStyle によって変換されたページ (->オリジナル) /