Scrapy crawler to parse data recursively

Question 1

I've written a script in python scrapy to parse "name" and "price" of different products from a website. Firstly, it scrapes the links of different categories from the upper sided bar located in the main page then it tracks down each categories and reach their pages and then parse the links of different sub-categories from there and finally gets to the target page and parse the aforementioned data from there. I tried to do the whole thing slightly differently from the conventional method in which it is necessary to set rules. However, I got it working the way I expected using the logic I applied here. If any improvement is to be made, I'll be very glad to comply with. Here is what I've tried with:

"sth.py" aka spider contains:

import scrapy
class SephoraSpider(scrapy.Spider):
 name = "sephorasp"
 def start_requests(self):
 yield scrapy.Request(url = "https://www.sephora.ae/en/stores/", callback = self.parse_pages)
 def parse_pages(self, response):
 for link in response.xpath('//ul[@class="nav-primary"]//a[contains(@class,"level0")]/@href').extract():
 yield scrapy.Request(url = link, callback = self.parse_inner_pages)
 def parse_inner_pages(self, response):
 for links in response.xpath('//li[contains(@class,"amshopby-cat")]/a/@href').extract():
 yield scrapy.Request(url = links, callback = self.target_page)
 def target_page(self, response):
 for titles in response.xpath('//div[@class="product-info"]'):
 product = titles.xpath('.//div[contains(@class,"product-name")]/a/text()').extract_first()
 rate = titles.xpath('.//span[@class="price"]/text()').extract_first()
 yield {'name':product,'price':rate}

"items.py" includes:

import scrapy
class SephoraItem(scrapy.Item):
 name = scrapy.Field()
 price = scrapy.Field()

The command I used to get the result along with a csv output is:

scrapy crawl sephorasp -o items.csv -t csv

Question 2

The code is quite clean and easy to read, good job!

I would only focus on couple things:

remove extra spaces around = when it is used to define a keyword argument
CSS selectors are more appropriate and reliable when it comes to handling multi-valued class attribute than XPath expressions. Plus, they are more concise and generally faster
naming - for links in should actually be for link in
as for the target_page method, I don't think you need a loop - if I understand correctly, there should be a single "product" parsed at this point
you may use start_urls instead of start_requests()

All things taken into account:

import scrapy
class SephoraSpider(scrapy.Spider):
 name = "sephorasp"
 start_urls = ["https://www.sephora.ae/en/stores/"]
 def parse(self, response):
 for link in response.css('ul.nav-primary a.level0::attr(href)').extract():
 yield scrapy.Request(url=link, callback=self.parse_inner_pages)
 def parse_inner_pages(self, response):
 for link in response.css('li.amshopby-cat > a::attr(href)').extract():
 yield scrapy.Request(url=link, callback=self.target_page)
 def target_page(self, response):
 name = response.css('.product-name > a::text').extract_first()
 price = response.css('span.price::text').extract_first()
 yield {'name': name, 'price': price}

(not tested)

Question 3

Thanks sir, alecxe for the certification along with the cleanest demo. I never had any trouble executing your script. Anyways, if i find any, I'll let you know. Btw, one thing i wish to know -- you used "a::text" to get the title and "a::attr(href)" to get links in this particular example. The thing is when I use the same in general cases i get error. I meant, is it always possible to get text value using the similar expression if i use css selector except for scrapy. Cause so far i used "name.text" e.t.c. Thanks again.

Question 4

@Shahin could it be because of the removed loop at the last callback? Unfortunately, dont have a chance to test at the moment..thanks

Question 5

No way. This is exactly what you said. It was because of the last loop.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-08-05 18:59:15Z

The code is quite clean and easy to read, good job!

I would only focus on couple things:

remove extra spaces around = when it is used to define a keyword argument
CSS selectors are more appropriate and reliable when it comes to handling multi-valued class attribute than XPath expressions. Plus, they are more concise and generally faster
naming - for links in should actually be for link in
as for the target_page method, I don't think you need a loop - if I understand correctly, there should be a single "product" parsed at this point
you may use start_urls instead of start_requests()

All things taken into account:

import scrapy
class SephoraSpider(scrapy.Spider):
 name = "sephorasp"
 start_urls = ["https://www.sephora.ae/en/stores/"]
 def parse(self, response):
 for link in response.css('ul.nav-primary a.level0::attr(href)').extract():
 yield scrapy.Request(url=link, callback=self.parse_inner_pages)
 def parse_inner_pages(self, response):
 for link in response.css('li.amshopby-cat > a::attr(href)').extract():
 yield scrapy.Request(url=link, callback=self.target_page)
 def target_page(self, response):
 name = response.css('.product-name > a::text').extract_first()
 price = response.css('span.price::text').extract_first()
 yield {'name': name, 'price': price}

(not tested)

Thanks sir, alecxe for the certification along with the cleanest demo. I never had any trouble executing your script. Anyways, if i find any, I'll let you know. Btw, one thing i wish to know -- you used "a::text" to get the title and "a::attr(href)" to get links in this particular example. The thing is when I use the same in general cases i get error. I meant, is it always possible to get text value using the similar expression if i use css selector except for scrapy. Cause so far i used "name.text" e.t.c. Thanks again.
@Shahin could it be because of the removed loop at the last callback? Unfortunately, dont have a chance to test at the moment..thanks
No way. This is exactly what you said. It was because of the last loop.

Stack Exchange Network

Scrapy crawler to parse data recursively

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scrapy crawler to parse data recursively

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions