Ie, remove parenthesis and an early return early return.
Ie, remove parenthesis and an early return.
Ie, remove parenthesis and an early return.
-- Edit: As Guimo pointed out, the previous sentences are not equal.
So I would suggest to change
def get_links(self, url):
req = self.get_req(url)
if(req is not None):
html = req.text # the html of the page crawled
urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
return urls
return []
by
def get_links(self, url):
req = self.get_req(url)
if req is None:
return []
html = req.text # the html of the page crawled
urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
return urls
Ie, remove parenthesis and an early return .
-- Edit: As Guimo pointed out, the previous sentences are not equal.
So I would suggest to change
def get_links(self, url):
req = self.get_req(url)
if(req is not None):
html = req.text # the html of the page crawled
urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
return urls
return []
by
def get_links(self, url):
req = self.get_req(url)
if req is None:
return []
html = req.text # the html of the page crawled
urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
return urls
Ie, remove parenthesis and an early return .
1.- Yes, you can use semicolon to have more than one sentence in the same line, but..
Compound statements (multiple statements on the same line) are generally discouraged.
if(not urls): print('[-] Provide start urls'); return # one of the ways ternary operator in python, ';' if continue in same line
Source: PEP-8
2.- Wait some milliseconds all those `while True:` (ie: `wait_kill_threads`, `print_manager`, maybe in `producer` too) your processor will thank you.
3.- Avoid [Magic Numbers][2] or [Hardcoded numbers][3]. On the last line in `producer`:
.....threading.active_count()-3)) # -3 because mainthread and deamon thread
a better practice is use a CONSTANT_VALUE at the beginning of the file;
QTY_THREAD_OFFSET = 3
.....threading.active_count() - QTY_THREAD_OFFSET))
(maybe you can think in a better constant name, but you get the idea right?)
The same for:
while(threading.active_count() > 2): # main thread (this one) count as 1 and deamon too
using the previous const it will be:
while(threading.active_count() >= QTY_THREAD_OFFSET):
4.- Regex: If you are going to use a regex multiple times a good practice is compile the regex before. Instead of:
re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
you can do this:
URL_RE_PATTERN = '(?<=href=["\'])https?://.+?(?=["\'])'
...
class Crawler():
regex_urls = re.compile(URL_RE_PATTERN)
def __init__(self):
...
...
...
def get_links(self, url):
....
urls = urls_re.match(regex_urls)
Also you can use BeautifulSoup to get urls.
5.- In `verify_and_add` you can try to avoid nested if's:
if(not self.terminate):
if(url not in self.data):
self.data.add(url) #
It could be replaced by:
if self.terminate or url in self.data:
return False
self.data.add(url)
...
6.- A little detail:
if(req is not None):
It could be:
if req:
7.- Are you removing the visited urls from list_urls after is visited? In `renew_url` you are getting a random url but I can't see if you are removing it from that list.
8.- Wait some random secs between requesting pages on the same server, you are crawling a website most of them aren't happy with that.
That is what I see in a first approach! Hope it helps !