2
\$\begingroup\$

I've written a script in python scrapy to parse "name" and "price" of different products from a website. Firstly, it scrapes the links of different categories from the upper sided bar located in the main page then it tracks down each categories and reach their pages and then parse the links of different sub-categories from there and finally gets to the target page and parse the aforementioned data from there. I tried to do the whole thing slightly differently from the conventional method in which it is necessary to set rules. However, I got it working the way I expected using the logic I applied here. If any improvement is to be made, I'll be very glad to comply with. Here is what I've tried with:

"sth.py" aka spider contains:

import scrapy
class SephoraSpider(scrapy.Spider):
 name = "sephorasp"
 def start_requests(self):
 yield scrapy.Request(url = "https://www.sephora.ae/en/stores/", callback = self.parse_pages)
 def parse_pages(self, response):
 for link in response.xpath('//ul[@class="nav-primary"]//a[contains(@class,"level0")]/@href').extract():
 yield scrapy.Request(url = link, callback = self.parse_inner_pages)
 def parse_inner_pages(self, response):
 for links in response.xpath('//li[contains(@class,"amshopby-cat")]/a/@href').extract():
 yield scrapy.Request(url = links, callback = self.target_page)
 def target_page(self, response):
 for titles in response.xpath('//div[@class="product-info"]'):
 product = titles.xpath('.//div[contains(@class,"product-name")]/a/text()').extract_first()
 rate = titles.xpath('.//span[@class="price"]/text()').extract_first()
 yield {'name':product,'price':rate}

"items.py" includes:

import scrapy
class SephoraItem(scrapy.Item):
 name = scrapy.Field()
 price = scrapy.Field()

The command I used to get the result along with a csv output is:

scrapy crawl sephorasp -o items.csv -t csv
asked Aug 5, 2017 at 16:33
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

The code is quite clean and easy to read, good job!

I would only focus on couple things:

  • remove extra spaces around = when it is used to define a keyword argument
  • CSS selectors are more appropriate and reliable when it comes to handling multi-valued class attribute than XPath expressions. Plus, they are more concise and generally faster
  • naming - for links in should actually be for link in
  • as for the target_page method, I don't think you need a loop - if I understand correctly, there should be a single "product" parsed at this point
  • you may use start_urls instead of start_requests()

All things taken into account:

import scrapy
class SephoraSpider(scrapy.Spider):
 name = "sephorasp"
 start_urls = ["https://www.sephora.ae/en/stores/"]
 def parse(self, response):
 for link in response.css('ul.nav-primary a.level0::attr(href)').extract():
 yield scrapy.Request(url=link, callback=self.parse_inner_pages)
 def parse_inner_pages(self, response):
 for link in response.css('li.amshopby-cat > a::attr(href)').extract():
 yield scrapy.Request(url=link, callback=self.target_page)
 def target_page(self, response):
 name = response.css('.product-name > a::text').extract_first()
 price = response.css('span.price::text').extract_first()
 yield {'name': name, 'price': price}

(not tested)

answered Aug 5, 2017 at 18:59
\$\endgroup\$
3
  • \$\begingroup\$ Thanks sir, alecxe for the certification along with the cleanest demo. I never had any trouble executing your script. Anyways, if i find any, I'll let you know. Btw, one thing i wish to know -- you used "a::text" to get the title and "a::attr(href)" to get links in this particular example. The thing is when I use the same in general cases i get error. I meant, is it always possible to get text value using the similar expression if i use css selector except for scrapy. Cause so far i used "name.text" e.t.c. Thanks again. \$\endgroup\$ Commented Aug 5, 2017 at 19:22
  • \$\begingroup\$ @Shahin could it be because of the removed loop at the last callback? Unfortunately, dont have a chance to test at the moment..thanks \$\endgroup\$ Commented Aug 5, 2017 at 21:04
  • \$\begingroup\$ No way. This is exactly what you said. It was because of the last loop. \$\endgroup\$ Commented Aug 5, 2017 at 21:10

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.