Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

64bit/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

22 Commits

Repository files navigation

Intro

Crawls the web within same domain, for example if start url is https://www.google.com, then it won't crawl https://maps.google.com. Crawler limits webpages visited to 2000. Crawler runs a BFS from start url.

####Usage

The command below prints the static assets (js, css, image, txt, pdf, doc, docx files) url.

./crawl https://www.google.com

Output of above command:

[ 
 {
 "url": "http://www.google.com/history/optout?hl=en",
 "assets": [
 "https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_light_color_74x24dp.png",
 "http://www.gstatic.com/history/static/myactivity_20170215-0135_1/angular-material.css",
 "https://fonts.googleapis.com/css?family=RobotoDraft:400,500"
 ]
 },
 ...
 ...
 {
 "url": "http://www.google.com/preferences?hl=en",
 "assets": [
 "https://www.google.com/images/branding/searchlogo/1x/googlelogo_desk_heirloom_color_150x55dp.gif",
 "https://www.google.com/images/warning.gif"
 ]
 }
]

To run in verbose mode, so as to see what crawler is doing at current moment:

DEBUG=true ./crawl https://www.google.com

Snippet of intermediate output of above command, where "Queue Size" is the BFS queue size at given moment, also relative urls are getting converted to absolute url:

...
Queue Size: 334
Fetching: https://www.google.com/intl/en/about/products/products/
converted: //www.google.com/ --> https://www.google.com/
converted: //www.google.com/ --> https://www.google.com/
Queue Size: 333
Fetching: https://www.google.com/intl/en/about/products/products/assistant/
converted: //www.google.com/ --> https://www.google.com/
converted: //www.google.com/ --> https://www.google.com/
Queue Size: 332
Fetching: https://www.google.com/intl/en/about/products/products/pixel/
converted: //www.google.com/ --> https://www.google.com/
converted: //www.google.com/ --> https://www.google.com/
Queue Size: 331
Fetching: https://www.google.com/intl/en/about/products/products/allo-duo/
converted: //www.google.com/ --> https://www.google.com/
converted: //www.google.com/ --> https://www.google.com/
Queue Size: 330
...

Running Unittests

Unittests runs an actual webserver (flask) instead of mocking requests.get(url)

./run_tests.sh

####Dependencies Python 2.7, requests, beautifulsoup, flask( required by unittests )

####Install Dependencies

pip install requests
pip install beautifulsoup4
pip install flask

Install other required modules using pip

About

Crawls the web within same domain

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /