A simple little Python web crawler

Question 1

The crawler is in need of a mechanism that will dispatch threads based on network latency and system load. How does one keep track of network latency in Python without using system tools like ping?

import sys
import re
import urllib2
import urlparse
import requests
import socket
import threading
import gevent
from gevent import monkey
import time
monkey.patch_all(
 socket=True,
 dns=True,
 time=True,
 select=True,
 thread=True,
 os=True,
 ssl=True,
 httplib=False,
 subprocess=False,
 sys=False,
 aggressive=True,
 Event=False)
# The stack
tocrawl = set([sys.argv[1]])
crawled = set([])
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
def Update(links):
 if links != None:
 for link in (links.pop(0) for _ in xrange(len(links))):
 link = ( "http://%s" %(urlparse.urlparse(link).netloc) )
 if link not in crawled:
 tocrawl.add(link)
def getLinks(crawling):
 crawled.add(crawling)
 try:
 Update(linkregex.findall(requests.get(crawling).content))
 except:
 return None
def crawl():
 try:
 print"%d Threads running" % (threading.activeCount())
 crawling = tocrawl.pop()
 print crawling
 print len(crawled)
 walk = gevent.spawn(getLinks,crawling)
 walk.run()
 except:quit()
def dispatcher():
 while True:
 T = threading.Thread(target=crawl)
 T.start()
 time.sleep(1)
dispatcher()

Question 2

The idea is to hit n domains and after n domains have been hit stop crawling and check each domain in the set of crawled domains for rss feeds

Question 3

Rolled back Rev 5 → 2. Please don't edit code in the question after it has been answered; you have several options for follow-ups.

Question 4

I see a flurry of downloading activity, but I don't see that you do anything with the pages that you download except parse some URLs for more downloading. There's no rate limiting or any attempt to check robots.txt, making your web crawler a poor Internet citizen.

PEP 8 mandates four spaces per level of indentation. Since whitespace is significant in Python, you should stick to the convention. Furthermore, function names should be lower_case(), so Update() and getLinks() should be renamed.

Just a simple call to gevent.monkey.patch_all() will do. There is no need to from gevent import monkey, nor is there any need to list all of the keyword parameters, since you're accepting all of the defaults.

Your linkregex fails if the <a> tag contains any intervening attributes before href. For example, <a target="_blank" href="..."> will cause a link to be skipped.

I don't believe that your code is a well behaved multithreaded program. For one thing, you indiscriminately spawn and start one thread per second. If the average processing time per request exceeds one second, you'll end up with an uncontrolled proliferation of threads.

Another issue is that you add() and pop() tocrawl elements without any kind of locking. Also, if one thread fails to pop() anything (probably when the tocrawl list becomes empty), you rudely call quit() without giving other threads a chance to finish what they are doing.

Finally, you process the URLs using a stack. Web crawling is usually done using a queue, to avoid processing clusters of closely related URLs together and concentrating the load on one unfortunate webserver at a time.

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Answer 1 · 2014-04-12 09:55:51Z

I see a flurry of downloading activity, but I don't see that you do anything with the pages that you download except parse some URLs for more downloading. There's no rate limiting or any attempt to check robots.txt, making your web crawler a poor Internet citizen.

PEP 8 mandates four spaces per level of indentation. Since whitespace is significant in Python, you should stick to the convention. Furthermore, function names should be lower_case(), so Update() and getLinks() should be renamed.

Just a simple call to gevent.monkey.patch_all() will do. There is no need to from gevent import monkey, nor is there any need to list all of the keyword parameters, since you're accepting all of the defaults.

Your linkregex fails if the <a> tag contains any intervening attributes before href. For example, <a target="_blank" href="..."> will cause a link to be skipped.

I don't believe that your code is a well behaved multithreaded program. For one thing, you indiscriminately spawn and start one thread per second. If the average processing time per request exceeds one second, you'll end up with an uncontrolled proliferation of threads.

Another issue is that you add() and pop() tocrawl elements without any kind of locking. Also, if one thread fails to pop() anything (probably when the tocrawl list becomes empty), you rudely call quit() without giving other threads a chance to finish what they are doing.

Finally, you process the URLs using a stack. Web crawling is usually done using a queue, to avoid processing clusters of closely related URLs together and concentrating the load on one unfortunate webserver at a time.

Stack Exchange Network

A simple little Python web crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A simple little Python web crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions