5
\$\begingroup\$

I've created a crawler using class. This crawler is able to scrape a certain webpage. Total data out there is 249 and the data are displayed there through different pages. I tried to make it accurately. Here is what I did.

import requests
from lxml import html
class wiseowl:
 def __init__(self, start_url):
 self.start_url = start_url
 self.links = [self.start_url] # a list of links to crawl
 self.storage = []
 def crawl(self): # calling get_link for every link in self.links 
 for link in self.links : 
 self.get_link(link)
 def get_link(self,link):
 print('Crawling: ' + link)
 url = "http://www.wiseowl.co.uk"
 response = requests.get(link)
 tree = html.fromstring(response.text)
 for items in tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']"):
 name = items.xpath(".//a/text()")[0]
 urls = url + items.xpath(".//a/@href")[0]
 docs = name , urls
 self.storage.append(docs)
 next_page = tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//*[@class='woPagingItem' or @class='woPagingNext']/@href") # get links form 'woPagingItem' or 'woPagingNext' # 
 for npage in next_page:
 if npage and url + npage not in self.links : # avoid getting the same link twice 
 self.links += [url + npage]
 def __str__(self):
 return "{}".format(self.storage)
crawler=wiseowl("http://www.wiseowl.co.uk/videos/")
crawler.crawl()
for item in crawler.storage:
 print(item)
200_success
146k22 gold badges190 silver badges479 bronze badges
asked Jun 5, 2017 at 20:03
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

General

  • Imports should be grouped, and groups should be separated by a single blank line.1

  • Class names should use CamelCase.2

  • There shouldn't be a blank line following the class signature.

  • Top-level function and class definitions should be separated by two blank lines.3 You already correctly separate method definitions by a single blank line. :)

  • Assignment operators should be separated by whitespace.4

  • In wiseowl, no method ever needs access to self.start_url (the only exception being __init__(), of course). You might as well get rid of it.

  • If you want to cast an object to str, just pass the object to the str constructor:

    str_obj = str(obj)
    

Rewrite

I've removed redundant code and improved code style.

from lxml import html
import requests
class WiseOwl:
 def __init__(self, start_url):
 self.links = [start_url]
 self.storage = []
 def crawl(self): 
 # Calling get_link for every link in self.links 
 for link in self.links : 
 self.get_link(link)
 def get_link(self, link):
 print('Crawling: ' + link)
 url = "http://www.wiseowl.co.uk"
 response = requests.get(link)
 tree = html.fromstring(response.text)
 for items in tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']"):
 name = items.xpath(".//a/text()")[0]
 urls = url + items.xpath(".//a/@href")[0]
 docs = name , urls
 self.storage.append(docs)
 next_page = tree.xpath("//div[contains(concat(' ', @class, ' '), ' 
woPaging ')]//*[@class='woPagingItem' or @class='woPagingNext']/@href") 
 for npage in next_page:
 if npage and url + npage not in self.links: 
 # Avoid getting the same link twice 
 self.links += [url + npage]
 def __str__(self):
 return str(self.storage)
crawler = WiseOwl("http://www.wiseowl.co.uk/videos/")
crawler.crawl()
for item in crawler.storage:
 print(item)

References

1 PEP-8: Imports

2 PEP-8: Naming Conventions: Descriptive: Naming Styles

3 PEP-8: Blank Lines

4 PEP-8: Other Recommendations

answered Jun 5, 2017 at 21:08
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.