Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 12:40

Community Bot

edited May 23, 2017 at 12:40

Community Bot

Ie, remove parenthesis and an early return early return.

Ie, remove parenthesis and an early return.

Fix answer, adding feedback.

Source Link

edited Feb 9, 2017 at 8:32

mayo

edited Feb 9, 2017 at 8:32

mayo

-- Edit: As Guimo pointed out, the previous sentences are not equal.

So I would suggest to change

def get_links(self, url):
 req = self.get_req(url)
 if(req is not None):
 html = req.text # the html of the page crawled
 urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
 return urls
 return []

def get_links(self, url):
 req = self.get_req(url)
 if req is None:
 return []
 html = req.text # the html of the page crawled
 urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
 return urls

Ie, remove parenthesis and an early return .

-- Edit: As Guimo pointed out, the previous sentences are not equal.

So I would suggest to change

def get_links(self, url):
 req = self.get_req(url)
 if(req is not None):
 html = req.text # the html of the page crawled
 urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
 return urls
 return []

def get_links(self, url):
 req = self.get_req(url)
 if req is None:
 return []
 html = req.text # the html of the page crawled
 urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
 return urls

Ie, remove parenthesis and an early return .

Source Link

answered Feb 9, 2017 at 0:51

mayo

answered Feb 9, 2017 at 0:51

mayo

1.- Yes, you can use semicolon to have more than one sentence in the same line, but..

Compound statements (multiple statements on the same line) are generally discouraged.

if(not urls): print('[-] Provide start urls'); return # one of the ways ternary operator in python, ';' if continue in same line

Source: PEP-8

2.- Wait some milliseconds all those `while True:` (ie: `wait_kill_threads`, `print_manager`, maybe in `producer` too) your processor will thank you.

3.- Avoid [Magic Numbers][2] or [Hardcoded numbers][3]. On the last line in `producer`:

.....threading.active_count()-3)) # -3 because mainthread and deamon thread

a better practice is use a CONSTANT_VALUE at the beginning of the file;

QTY_THREAD_OFFSET = 3
.....threading.active_count() - QTY_THREAD_OFFSET))

(maybe you can think in a better constant name, but you get the idea right?)

The same for:

while(threading.active_count() > 2): # main thread (this one) count as 1 and deamon too

using the previous const it will be:

while(threading.active_count() >= QTY_THREAD_OFFSET):

4.- Regex: If you are going to use a regex multiple times a good practice is compile the regex before. Instead of:

re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls

you can do this:

URL_RE_PATTERN = '(?<=href=["\'])https?://.+?(?=["\'])'
...
class Crawler():
 regex_urls = re.compile(URL_RE_PATTERN)
 def __init__(self):
 ... 
...
...
def get_links(self, url):
....
 urls = urls_re.match(regex_urls)

Also you can use BeautifulSoup to get urls.

5.- In `verify_and_add` you can try to avoid nested if's:

if(not self.terminate):
 if(url not in self.data):
 self.data.add(url) #

It could be replaced by:

if self.terminate or url in self.data:
 return False
self.data.add(url)
...

6.- A little detail:

if(req is not None):

It could be:

if req:

7.- Are you removing the visited urls from list_urls after is visited? In `renew_url` you are getting a random url but I can't see if you are removing it from that list.

8.- Wait some random secs between requesting pages on the same server, you are crawling a website most of them aren't happy with that.

That is what I see in a first approach! Hope it helps !

lang-py