6
\$\begingroup\$

The crawler is in need of a mechanism that will dispatch threads based on network latency and system load. How does one keep track of network latency in Python without using system tools like ping?

import sys
import re
import urllib2
import urlparse
import requests
import socket
import threading
import gevent
from gevent import monkey
import time
monkey.patch_all(
 socket=True,
 dns=True,
 time=True,
 select=True,
 thread=True,
 os=True,
 ssl=True,
 httplib=False,
 subprocess=False,
 sys=False,
 aggressive=True,
 Event=False)
# The stack
tocrawl = set([sys.argv[1]])
crawled = set([])
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
def Update(links):
 if links != None:
 for link in (links.pop(0) for _ in xrange(len(links))):
 link = ( "http://%s" %(urlparse.urlparse(link).netloc) )
 if link not in crawled:
 tocrawl.add(link)
def getLinks(crawling):
 crawled.add(crawling)
 try:
 Update(linkregex.findall(requests.get(crawling).content))
 except:
 return None
def crawl():
 try:
 print"%d Threads running" % (threading.activeCount())
 crawling = tocrawl.pop()
 print crawling
 print len(crawled)
 walk = gevent.spawn(getLinks,crawling)
 walk.run()
 except:quit()
def dispatcher():
 while True:
 T = threading.Thread(target=crawl)
 T.start()
 time.sleep(1)
dispatcher()
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Apr 12, 2014 at 9:16
\$\endgroup\$
2
  • \$\begingroup\$ The idea is to hit n domains and after n domains have been hit stop crawling and check each domain in the set of crawled domains for rss feeds \$\endgroup\$ Commented Apr 12, 2014 at 15:55
  • 1
    \$\begingroup\$ Rolled back Rev 5 → 2. Please don't edit code in the question after it has been answered; you have several options for follow-ups. \$\endgroup\$ Commented Jun 19, 2014 at 17:57

1 Answer 1

8
\$\begingroup\$

I see a flurry of downloading activity, but I don't see that you do anything with the pages that you download except parse some URLs for more downloading. There's no rate limiting or any attempt to check robots.txt, making your web crawler a poor Internet citizen.

PEP 8 mandates four spaces per level of indentation. Since whitespace is significant in Python, you should stick to the convention. Furthermore, function names should be lower_case(), so Update() and getLinks() should be renamed.

Just a simple call to gevent.monkey.patch_all() will do. There is no need to from gevent import monkey, nor is there any need to list all of the keyword parameters, since you're accepting all of the defaults.

Your linkregex fails if the <a> tag contains any intervening attributes before href. For example, <a target="_blank" href="..."> will cause a link to be skipped.

I don't believe that your code is a well behaved multithreaded program. For one thing, you indiscriminately spawn and start one thread per second. If the average processing time per request exceeds one second, you'll end up with an uncontrolled proliferation of threads.

Another issue is that you add() and pop() tocrawl elements without any kind of locking. Also, if one thread fails to pop() anything (probably when the tocrawl list becomes empty), you rudely call quit() without giving other threads a chance to finish what they are doing.

Finally, you process the URLs using a stack. Web crawling is usually done using a queue, to avoid processing clusters of closely related URLs together and concentrating the load on one unfortunate webserver at a time.

answered Apr 12, 2014 at 9:55
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.