Extracting certain products from a webpage using Scrapy

Question 1

I've written a script in Python Scrapy to harvest various product names and price from books.toscrape. The reason I submit this tiny code to Code Review is because, in Python 3 when it comes to work with Scrapy and parse some data from a web, the csv output looks awkward (if the csv is derived from default command, as in scrapy crawl toscrapesp -o items.csv -t csv). The results found in such CSV file are with a uniform gap between two lines that means there is a line gap between each two rows. I've fixed it using the below script. I didn't use default command to get the CSV output; rather, I've written few lines of code in spider class and got the desired output.

Although It is running smoothly, I'm not sure it is the ideal way of doing such thing. I expect someone to give any suggestion as to how I can improve this script.

"items.py" includes:

import scrapy
class ToscrapeItem(scrapy.Item):
 Name = scrapy.Field()
 Price = scrapy.Field()

Spider contains:

import csv
import scrapy
outfile = open("various_pro.csv", "w", newline='')
writer = csv.writer(outfile)
class ToscrapeSpider(scrapy.Spider):
 name = "toscrapesp"
 start_urls = ["http://books.toscrape.com/"]
 def parse(self, response):
 for link in response.css('.nav-list a::attr(href)').extract():
 yield scrapy.Request(url=response.urljoin(link), callback=self.collect_data)
 def collect_data(self, response):
 global writer 
 for item in response.css('.product_pod'):
 product = item.css('h3 a::text').extract_first()
 value = item.css('.price_color::text').extract_first()
 yield {'Name': product, 'Price': value} 
 writer.writerow([product,value])

Please click this link to see what I was having earlier. Upon executing the script, I get CSV output with no line gap or blank rows.

Question 2

I don't think you should reinvent the wheel and provide your own CSV export. The following works for me as is (note the addition of .strip() calls - though I don't think they are necessary at all):

import scrapy
class ToscrapeSpider(scrapy.Spider):
 name = "toscrapesp"
 start_urls = ["http://books.toscrape.com/"]
 def parse(self, response):
 for link in response.css('.nav-list a::attr(href)').extract():
 yield scrapy.Request(url=response.urljoin(link), callback=self.collect_data)
 def collect_data(self, response):
 for item in response.css('.product_pod'):
 product = item.css('h3 a::text').extract_first().strip()
 value = item.css('.price_color::text').extract_first().strip()
 yield {'Name': product, 'Price': value}

Running it with scrapy runspider spider.py -o output.csv -t csv produces a CSV file with no blank lines:

Price,Name
53ドル.74,Tipping the Velvet
29ドル.69,Forever and Forever: The ...
55ドル.53,A Flight of Arrows ...
36ドル.95,The House by the ...
30ドル.25,Mrs. Houdini
28ドル.08,The Marriage of Opposites 
...

Question 3

Thanks sir for your kind reply. The thing is, I had been suffering from this "line gap" issue in the csv output for the last two years. Tried with several different ways but still no luck until I used the way I've shown above. I just ran the script rectifying the portion you suggested with .strip(), and got the output with the issue again. I don't know if it happens in my case only or with the people using python 3.5 as well. However, this is the reason I used that customized portion in my spider. Looks a bit awkward but it works. Thanks sir.

Question 4

@Mithu got it. Okay, but can you be sure the problem is not with your CSV editor? What if you open the CSV file with a simple text editor - do you still see these blank lines there? Thanks.

Question 5

Sorry sir alecxe, for this delayed response. I was not around. For your observation, I just uploaded a csv file derived from scrapy using your suggested command. What you said is a little bit tricky for me that is why I uploaded it. Perhaps you can understand, what basically the problem is. Thanks sir. Here goes the link: dropbox.com/s/xv7wfnzivshlu5m/items.csv?dl=0

Question 6

@Mithu ah, of course, this is this windows-specific problem. You should probably patch an item exporter like suggested here. Hope that helps.

Question 7

One last thing sir: should i have to create this,i meant scrapy.exporters or it is located somewhere within scrapy projects like settings.py, middleware.py etc? Thanks sir.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Answer 1 · 2017-09-18 02:15:27Z

1

\$\begingroup\$

I don't think you should reinvent the wheel and provide your own CSV export. The following works for me as is (note the addition of .strip() calls - though I don't think they are necessary at all):

import scrapy
class ToscrapeSpider(scrapy.Spider):
 name = "toscrapesp"
 start_urls = ["http://books.toscrape.com/"]
 def parse(self, response):
 for link in response.css('.nav-list a::attr(href)').extract():
 yield scrapy.Request(url=response.urljoin(link), callback=self.collect_data)
 def collect_data(self, response):
 for item in response.css('.product_pod'):
 product = item.css('h3 a::text').extract_first().strip()
 value = item.css('.price_color::text').extract_first().strip()
 yield {'Name': product, 'Price': value}

Running it with scrapy runspider spider.py -o output.csv -t csv produces a CSV file with no blank lines:

Price,Name
53ドル.74,Tipping the Velvet
29ドル.69,Forever and Forever: The ...
55ドル.53,A Flight of Arrows ...
36ドル.95,The House by the ...
30ドル.25,Mrs. Houdini
28ドル.08,The Marriage of Opposites 
...

Share

answered Sep 18, 2017 at 2:15

alecxe's user avatar

alecxe alecxe

17.5k8 gold badges52 silver badges93 bronze badges

\$\endgroup\$

8

\$\begingroup\$ Thanks sir for your kind reply. The thing is, I had been suffering from this "line gap" issue in the csv output for the last two years. Tried with several different ways but still no luck until I used the way I've shown above. I just ran the script rectifying the portion you suggested with .strip(), and got the output with the issue again. I don't know if it happens in my case only or with the people using python 3.5 as well. However, this is the reason I used that customized portion in my spider. Looks a bit awkward but it works. Thanks sir. \$\endgroup\$

MITHU
– MITHU

2017年09月18日 06:22:16 +00:00
Commented Sep 18, 2017 at 6:22
\$\begingroup\$ @Mithu got it. Okay, but can you be sure the problem is not with your CSV editor? What if you open the CSV file with a simple text editor - do you still see these blank lines there? Thanks. \$\endgroup\$

alecxe
– alecxe

2017年09月18日 13:36:14 +00:00
Commented Sep 18, 2017 at 13:36
\$\begingroup\$ Sorry sir alecxe, for this delayed response. I was not around. For your observation, I just uploaded a csv file derived from scrapy using your suggested command. What you said is a little bit tricky for me that is why I uploaded it. Perhaps you can understand, what basically the problem is. Thanks sir. Here goes the link: dropbox.com/s/xv7wfnzivshlu5m/items.csv?dl=0 \$\endgroup\$

MITHU
– MITHU

2017年09月18日 17:04:19 +00:00
Commented Sep 18, 2017 at 17:04
\$\begingroup\$ @Mithu ah, of course, this is this windows-specific problem. You should probably patch an item exporter like suggested here. Hope that helps. \$\endgroup\$

alecxe
– alecxe

2017年09月18日 20:07:23 +00:00
Commented Sep 18, 2017 at 20:07
\$\begingroup\$ One last thing sir: should i have to create this,i meant scrapy.exporters or it is located somewhere within scrapy projects like settings.py, middleware.py etc? Thanks sir. \$\endgroup\$

MITHU
– MITHU

2017年09月18日 20:44:06 +00:00
Commented Sep 18, 2017 at 20:44

| Show 3 more comments

Stack Exchange Network

Extracting certain products from a webpage using Scrapy

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extracting certain products from a webpage using Scrapy

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions