I've written a script in Python Scrapy to harvest various product names and price from books.toscrape. The reason I submit this tiny code to Code Review is because, in Python 3 when it comes to work with Scrapy and parse some data from a web, the csv output looks awkward (if the csv is derived from default command, as in scrapy crawl toscrapesp -o items.csv -t csv
). The results found in such CSV file are with a uniform gap between two lines that means there is a line gap between each two rows. I've fixed it using the below script. I didn't use default command to get the CSV output; rather, I've written few lines of code in spider class and got the desired output.
Although It is running smoothly, I'm not sure it is the ideal way of doing such thing. I expect someone to give any suggestion as to how I can improve this script.
"items.py" includes:
import scrapy
class ToscrapeItem(scrapy.Item):
Name = scrapy.Field()
Price = scrapy.Field()
Spider contains:
import csv
import scrapy
outfile = open("various_pro.csv", "w", newline='')
writer = csv.writer(outfile)
class ToscrapeSpider(scrapy.Spider):
name = "toscrapesp"
start_urls = ["http://books.toscrape.com/"]
def parse(self, response):
for link in response.css('.nav-list a::attr(href)').extract():
yield scrapy.Request(url=response.urljoin(link), callback=self.collect_data)
def collect_data(self, response):
global writer
for item in response.css('.product_pod'):
product = item.css('h3 a::text').extract_first()
value = item.css('.price_color::text').extract_first()
yield {'Name': product, 'Price': value}
writer.writerow([product,value])
Please click this link to see what I was having earlier. Upon executing the script, I get CSV output with no line gap or blank rows.
1 Answer 1
I don't think you should reinvent the wheel and provide your own CSV export. The following works for me as is (note the addition of .strip()
calls - though I don't think they are necessary at all):
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = "toscrapesp"
start_urls = ["http://books.toscrape.com/"]
def parse(self, response):
for link in response.css('.nav-list a::attr(href)').extract():
yield scrapy.Request(url=response.urljoin(link), callback=self.collect_data)
def collect_data(self, response):
for item in response.css('.product_pod'):
product = item.css('h3 a::text').extract_first().strip()
value = item.css('.price_color::text').extract_first().strip()
yield {'Name': product, 'Price': value}
Running it with scrapy runspider spider.py -o output.csv -t csv
produces a CSV file with no blank lines:
Price,Name
53ドル.74,Tipping the Velvet
29ドル.69,Forever and Forever: The ...
55ドル.53,A Flight of Arrows ...
36ドル.95,The House by the ...
30ドル.25,Mrs. Houdini
28ドル.08,The Marriage of Opposites
...
-
\$\begingroup\$ Thanks sir for your kind reply. The thing is, I had been suffering from this "line gap" issue in the csv output for the last two years. Tried with several different ways but still no luck until I used the way I've shown above. I just ran the script rectifying the portion you suggested with
.strip()
, and got the output with the issue again. I don't know if it happens in my case only or with the people using python 3.5 as well. However, this is the reason I used that customized portion in my spider. Looks a bit awkward but it works. Thanks sir. \$\endgroup\$MITHU– MITHU2017年09月18日 06:22:16 +00:00Commented Sep 18, 2017 at 6:22 -
\$\begingroup\$ @Mithu got it. Okay, but can you be sure the problem is not with your CSV editor? What if you open the CSV file with a simple text editor - do you still see these blank lines there? Thanks. \$\endgroup\$alecxe– alecxe2017年09月18日 13:36:14 +00:00Commented Sep 18, 2017 at 13:36
-
\$\begingroup\$ Sorry sir alecxe, for this delayed response. I was not around. For your observation, I just uploaded a csv file derived from scrapy using your suggested command. What you said is a little bit tricky for me that is why I uploaded it. Perhaps you can understand, what basically the problem is. Thanks sir. Here goes the link: dropbox.com/s/xv7wfnzivshlu5m/items.csv?dl=0 \$\endgroup\$MITHU– MITHU2017年09月18日 17:04:19 +00:00Commented Sep 18, 2017 at 17:04
-
\$\begingroup\$ @Mithu ah, of course, this is this windows-specific problem. You should probably patch an item exporter like suggested here. Hope that helps. \$\endgroup\$alecxe– alecxe2017年09月18日 20:07:23 +00:00Commented Sep 18, 2017 at 20:07
-
\$\begingroup\$ One last thing sir: should i have to create this,i meant
scrapy.exporters
or it is located somewhere within scrapy projects like settings.py, middleware.py etc? Thanks sir. \$\endgroup\$MITHU– MITHU2017年09月18日 20:44:06 +00:00Commented Sep 18, 2017 at 20:44
Explore related questions
See similar questions with these tags.