Skip to main content
Code Review

Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link

Ie, remove parenthesis and an early return early return.

Ie, remove parenthesis and an early return.

Ie, remove parenthesis and an early return.

Fix answer, adding feedback.
Source Link
mayo
  • 501
  • 4
  • 6

-- Edit: As Guimo pointed out, the previous sentences are not equal.

So I would suggest to change

def get_links(self, url):
 req = self.get_req(url)
 if(req is not None):
 html = req.text # the html of the page crawled
 urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
 return urls
 return []

by

def get_links(self, url):
 req = self.get_req(url)
 if req is None:
 return []
 html = req.text # the html of the page crawled
 urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
 return urls

Ie, remove parenthesis and an early return .

-- Edit: As Guimo pointed out, the previous sentences are not equal.

So I would suggest to change

def get_links(self, url):
 req = self.get_req(url)
 if(req is not None):
 html = req.text # the html of the page crawled
 urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
 return urls
 return []

by

def get_links(self, url):
 req = self.get_req(url)
 if req is None:
 return []
 html = req.text # the html of the page crawled
 urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
 return urls

Ie, remove parenthesis and an early return .

Source Link
mayo
  • 501
  • 4
  • 6

1.- Yes, you can use semicolon to have more than one sentence in the same line, but..

Compound statements (multiple statements on the same line) are generally discouraged.

if(not urls): print('[-] Provide start urls'); return # one of the ways ternary operator in python, ';' if continue in same line

Source: PEP-8


2.- Wait some milliseconds all those `while True:` (ie: `wait_kill_threads`, `print_manager`, maybe in `producer` too) your processor will thank you.
3.- Avoid [Magic Numbers][2] or [Hardcoded numbers][3]. On the last line in `producer`:
.....threading.active_count()-3)) # -3 because mainthread and deamon thread

a better practice is use a CONSTANT_VALUE at the beginning of the file;

QTY_THREAD_OFFSET = 3
.....threading.active_count() - QTY_THREAD_OFFSET))

(maybe you can think in a better constant name, but you get the idea right?)

The same for:

while(threading.active_count() > 2): # main thread (this one) count as 1 and deamon too

using the previous const it will be:

while(threading.active_count() >= QTY_THREAD_OFFSET):

4.- Regex: If you are going to use a regex multiple times a good practice is compile the regex before. Instead of:
re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls

you can do this:

URL_RE_PATTERN = '(?<=href=["\'])https?://.+?(?=["\'])'
...
class Crawler():
 regex_urls = re.compile(URL_RE_PATTERN)
 def __init__(self):
 ... 
...
...
def get_links(self, url):
....
 urls = urls_re.match(regex_urls)

Also you can use BeautifulSoup to get urls.


5.- In `verify_and_add` you can try to avoid nested if's:
if(not self.terminate):
 if(url not in self.data):
 self.data.add(url) #

It could be replaced by:

if self.terminate or url in self.data:
 return False
self.data.add(url)
...

6.- A little detail:
if(req is not None):

It could be:

if req:

7.- Are you removing the visited urls from list_urls after is visited? In `renew_url` you are getting a random url but I can't see if you are removing it from that list.
8.- Wait some random secs between requesting pages on the same server, you are crawling a website most of them aren't happy with that.

That is what I see in a first approach! Hope it helps !

lang-py

AltStyle によって変換されたページ (->オリジナル) /