Amazon web scraper

Question 1

I am trying to improve my programming and programming design skills (poor at the moment). I created a small Amazon scraper program. It is a working program. I would be very grateful if you could review my design. What could be improved? Does my class design make sense?

main.py:

if __name__ == "__main__":
 from Scraper import Scraper
 scraper = Scraper("Schach", 19, 30)
 productLinks = scraper.scrapeProductLinks()
 for link in productLinks:
 try:
 product = scraper.scrapeProductDetails(link)
 price = product[1]
 alternative_price = product[2]
 reviews = product[3]
 if (price < 20) or (alternative_price < 20) or (reviews) < 30:
 continue
 scraper.saveProductListToCSV(product)
 except Exception as err:
 print(err)

Extractor.py:

from lxml import html as xmlhtml
class Extractor:
 def __init__(self, html, link=""):
 self.html = html
 self.link = link
 self.xml_tree = xmlhtml.fromstring(html)
 def castPrice(self, price):
 if not isinstance(price, str):
 price = "-1"
 return float(price.replace("EUR ", "").replace(",", "."))
 def extractTitle(self):
 try:
 self.title = self.xml_tree.xpath('//*[@id="productTitle"]/text()')[0]
 except IndexError:
 return "NA"
 return self.title
 def extractPrice(self):
 try:
 price = self.xml_tree.xpath('//*[@id="priceblock_ourprice"]/text()')
 alternative_price = self.xml_tree.xpath('//*[@id="buyNewSection"]/div/div/div/div[2]/a/div/div[2]/span[1]/span/text()')
 if price:
 price = price[0]
 elif alternative_price:
 price = alternative_price[0]
 except IndexError:
 return "-1"
 return self.castPrice(price)
 def extractCompetitionPrice(self):
 try:
 competition_price = self.xml_tree.xpath('//*[@id="olp_feature_div"]/div/span[1]/span/text()')[0]
 competition_price = self.castPrice(competition_price)
 except IndexError:
 return "-1"
 return competition_price
 def extractReviews(self):
 try:
 reviews = self.xml_tree.xpath('//*[@id="acrCustomerReviewText"]/text()')[0]
 reviews = reviews.replace(" Kundenrezensionen", "")
 except IndexError:
 return 0
 return int(reviews)
 def getExtractedProduct(self):
 return [self.extractTitle(), self.extractPrice(), self.extractCompetitionPrice(), self.extractReviews(), self.link]

scraper.py:

import requests
from lxml import html as xmlhtml
import csv
from Extractor import Extractor
class Scraper:
 def __init__(self, keyword, price_span, min_review):
 self.csv_header = "Titel\tPreis\tAnbieter\tBewertungen\tLink\n"
 self.startUrl = "http://www.amazon.de/s/field-keywords={0}&page={1}"
 self.keyword = keyword
 self.price_span = float(price_span)
 self.min_review = int(min_review)
 self.links_list = list()
 def getHtml(self, urlAdress):
 r = requests.get(urlAdress)
 return r.text
 def saveProductListToCSV(self, productList):
 with open("amazon.csv", "a", newline='') as f:
 writer = csv.writer(f, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
 writer.writerow(productList)
 def scrapeProductLinks(self):
 x = 1
 while x < 2:
 html = self.getHtml(self.startUrl . format(self.keyword, x))
 tree = xmlhtml.fromstring(html)
 self.links_list += tree.xpath("//a[@class='a-link-normal s-access-detail-page a-text-normal']/@href")
 x += 1
 return self.links_list
 def scrapeProductDetails(self, url):
 extractor = Extractor(self.getHtml(url), url)
 return extractor.getExtractedProduct()

Question 2

Starting in your main and following the path the code takes:

if __name__ == "__main__":
 #large block of code

This is something I usually don't like to see. I'd have expected your module to be executable from the outside, even if I'm not invoking it directly. Like this it's impossible.

I'd have expected instead:

def main:
 """ docstring """
 #large block of code
if __name__ == "__main__":
 main()

This allows me to call your program from somewhere else like:

from module import Main as AmazonScraper
AmazonScraper.main()

This gets me to my next point:

Import at the top of your file if you can. It's much simpler to keep track of your imports if you import at the top of your file, which is also the recommended practice in the official python styleguide (PEP-008)

The next thing referenced is your Scraper. In the main-method there are 4 things done with your scraper.

scraper = Scraper("Schach", 19, 30)
scaper.scrapeProductLinks()
scraper.scrapeProductDetails()
scraper.saveProductListToCSV(product)

<interjection> You are violating naming conventions of pyton here. Methods are supposed to be named in snake_case (just like your variables are)</interjection>

first you construct your scraper with a String and two numbers. The numbers seem to be completely arbitrary and I have absolutely no idea what they are there for. Numbers like that are called: Magic Numbers.

It's considered good practice to extract such magic numbers to appropriately named constants.

Then you have your scraper scrape_product_links which seems to give you a list. So far so good.

Next thing is, you have your scraper scrape_product_details, which seems to give you a fixed-size list or tuple. Also nothing extremely wrong with that. No wait. There is a piece missing.

You're directly after accessing the tuple's data with the numbers 1, 2 and 3. These are again magic numbers. You should instead define a class that can contain product data:

class Product
 def __init__(self, price, alt_price, reviews):
 self.price = price
 self.alt_price = alt_price
 self.reviews = reviews
 # getters for immutable properties

This would completely eliminate the magic numbers and allow you to extend the product class without having to fear breaking calling code.

And last but not least: You call your scraper to save_product_list_to_csv. And this is where I have to intervene.

You're violating single responsibility principles here. Your scraper should only scrape_* things. He mustn't be responsible for saving that data to some csv. That should be the job of something else.

On the other hand your Extractor is clean, simple and has a single responsibility. But there's one thing I'm completely missing overall:

Documentation:

It's critical to provide documentation alongside your code. You should document assumptions made and parameter responsibilities in all your classes and methods.

While that might sound dumb right now Mr. Maintainer (aka. Your future self) will thank you when they look at this code in 2 months and understand zilch of what happens where. It saves them headaches and a lot of time.

Overall:

You have no documentation whatsoever. That's bad.
You also do not respect some major python conventions. That's not quite as bad, but alas.
Your Scraper isn't quite clean, since he's responsible for "too much". That's meh.
Your code cannot be cleanly invoked from the outside. That's bad.
Your extractor is clean and rather simple to understand. That's good!
Your main is easy to follow. Your program is nicely abstracted (well when skipping the magic number stuff). That's also good!

In summary: You can definitely do better ;)

Question 3

Many thanks for your thorough review. I will process your recommendations this evening.

Question 4

def scrapeProductLinks(self):
 x = 1
 while x < 2:
 html = self.getHtml(self.startUrl . format(self.keyword, x))
 tree = xmlhtml.fromstring(html)
 self.links_list += tree.xpath("//a[@class='a-link-normal s-access-detail-page a-text-normal']/@href")
 x += 1
 return self.links_list

What's going on here?

What's stopping you from doing this?

def scrapeProductLinks(self):
 html = self.getHtml(self.startUrl . format(self.keyword, 1))
 tree = xmlhtml.fromstring(html)
 self.links_list += tree.xpath("//a[@class='a-link-normal s-access-detail-page a-text-normal']/@href")
 return self.links_list

It's a loop with only 1 iteration. So you don't need a loop...

Question 5

Many thanks. scrapeProductLinks() fetches a page with 16 product links. These links are stored in a list. It then goes to the next page and repeats the process. It only does loop once for testing purposes. So it only fetches one page. Normally it would fetch much more than that.

Vogel612 Vogel612 25.5k7 gold badges59 silver badges141 bronze badges · Answer 1 · 2015-02-27 09:56:48Z

Starting in your main and following the path the code takes:

if __name__ == "__main__":
 #large block of code

This is something I usually don't like to see. I'd have expected your module to be executable from the outside, even if I'm not invoking it directly. Like this it's impossible.

I'd have expected instead:

def main:
 """ docstring """
 #large block of code
if __name__ == "__main__":
 main()

This allows me to call your program from somewhere else like:

from module import Main as AmazonScraper
AmazonScraper.main()

This gets me to my next point:

Import at the top of your file if you can. It's much simpler to keep track of your imports if you import at the top of your file, which is also the recommended practice in the official python styleguide (PEP-008)

The next thing referenced is your Scraper. In the main-method there are 4 things done with your scraper.

scraper = Scraper("Schach", 19, 30)
scaper.scrapeProductLinks()
scraper.scrapeProductDetails()
scraper.saveProductListToCSV(product)

<interjection> You are violating naming conventions of pyton here. Methods are supposed to be named in snake_case (just like your variables are)</interjection>

first you construct your scraper with a String and two numbers. The numbers seem to be completely arbitrary and I have absolutely no idea what they are there for. Numbers like that are called: Magic Numbers.

It's considered good practice to extract such magic numbers to appropriately named constants.

Then you have your scraper scrape_product_links which seems to give you a list. So far so good.

Next thing is, you have your scraper scrape_product_details, which seems to give you a fixed-size list or tuple. Also nothing extremely wrong with that. No wait. There is a piece missing.

You're directly after accessing the tuple's data with the numbers 1, 2 and 3. These are again magic numbers. You should instead define a class that can contain product data:

class Product
 def __init__(self, price, alt_price, reviews):
 self.price = price
 self.alt_price = alt_price
 self.reviews = reviews
 # getters for immutable properties

This would completely eliminate the magic numbers and allow you to extend the product class without having to fear breaking calling code.

And last but not least: You call your scraper to save_product_list_to_csv. And this is where I have to intervene.

You're violating single responsibility principles here. Your scraper should only scrape_* things. He mustn't be responsible for saving that data to some csv. That should be the job of something else.

On the other hand your Extractor is clean, simple and has a single responsibility. But there's one thing I'm completely missing overall:

Documentation:

It's critical to provide documentation alongside your code. You should document assumptions made and parameter responsibilities in all your classes and methods.

While that might sound dumb right now Mr. Maintainer (aka. Your future self) will thank you when they look at this code in 2 months and understand zilch of what happens where. It saves them headaches and a lot of time.

Overall:

You have no documentation whatsoever. That's bad.
You also do not respect some major python conventions. That's not quite as bad, but alas.
Your Scraper isn't quite clean, since he's responsible for "too much". That's meh.
Your code cannot be cleanly invoked from the outside. That's bad.
Your extractor is clean and rather simple to understand. That's good!
Your main is easy to follow. Your program is nicely abstracted (well when skipping the magic number stuff). That's also good!

In summary: You can definitely do better ;)

Many thanks for your thorough review. I will process your recommendations this evening.

Pimgd Pimgd 22.5k5 gold badges68 silver badges144 bronze badges · Answer 2 · 2015-02-27 09:37:27Z

def scrapeProductLinks(self):
 x = 1
 while x < 2:
 html = self.getHtml(self.startUrl . format(self.keyword, x))
 tree = xmlhtml.fromstring(html)
 self.links_list += tree.xpath("//a[@class='a-link-normal s-access-detail-page a-text-normal']/@href")
 x += 1
 return self.links_list

What's going on here?

What's stopping you from doing this?

def scrapeProductLinks(self):
 html = self.getHtml(self.startUrl . format(self.keyword, 1))
 tree = xmlhtml.fromstring(html)
 self.links_list += tree.xpath("//a[@class='a-link-normal s-access-detail-page a-text-normal']/@href")
 return self.links_list

It's a loop with only 1 iteration. So you don't need a loop...

Many thanks. scrapeProductLinks() fetches a page with 16 product links. These links are stored in a list. It then goes to the next page and repeats the process. It only does loop once for testing purposes. So it only fetches one page. Normally it would fetch much more than that.

Stack Exchange Network

Amazon web scraper

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Amazon web scraper

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions