Writing to a csv file in a customized way using scrapy

Question 1

I've written a script in scrapy to grab different names and links from different pages of a website and write those parsed items in a csv file. When I run my script, I get the results accordingly and find a data filled in csv file. I'm using python 3.5, so when I use scrapy's built-in command to write data in a csv file, I do get a csv file with blank lines in every alternate row. Eventually, I tried the below way to achieve the flawless output (with no blank lines in between). Now, It produces a csv file fixing blank line issues. I hope I did it in the right way. However, if there is anything I can/should do to make it more robust, I'm happy to cope with.

This is my script which provides me with a flawless output in a csv file:

import scrapy ,csv
from scrapy.crawler import CrawlerProcess
class GetInfoSpider(scrapy.Spider):
 name = "infrarail"
 start_urls= ['http://www.infrarail.com/2018/exhibitor-profile/?e={}'.format(page) for page in range(65,70)]
 def __init__(self):
 self.infile = open("output.csv","w",newline="")
 def parse(self, response):
 for q in response.css("article.contentslim"):
 name = q.css("h1::text").extract_first()
 link = q.css("p a::attr(href)").extract_first()
 yield {'Name':name,'Link':link}
 writer = csv.writer(self.infile)
 writer.writerow([name,link])
c = CrawlerProcess({
 'USER_AGENT': 'Mozilla/5.0', 
})
c.crawl(GetInfoSpider)
c.start()

Btw, I used .CrawlerProcess() to be able to run my spider from sublime text editor.

Question 2

Welcome to Code Review! That's quite a well-written question, especially for a new user. Well done.

Question 3

I'd like to mention, that there is a special way of making output files in scrapy - item pipelines. So, in order to make it right, you should write your own pipeline (or modify standard one via subclassing).

Also, you does not close the file, once you're done and you keep it open most of the time. The both problems are handled nicely with pipelines.

UPD: Well, you've asked for a better way, there it is. Although, if it's not acceptable for some hard-to-explain reasons (it's understandable), here's the other approach, how to make it better.

Don't leave the file open. There is a method (__del__()) which destroys the spider object. Add the code to close the file before it.
Another one is to store only the filename in the variable and open / close the file each time you write into it.
Another option is to use NoSQL database, which does not need to be opened / closed. And after scraping is done - get the output file from it.
If you have a few values to scrape you can store it in object variable, and then export it before __del__() method.

All the ways above are NOT welcomed by the actual developer community and may lead to serious problem in future. Use them carefully. Sometimes it's easier (in the long run) to read and understand how it really should be done.

Maybe it's the exact case?

Question 4

I think you put those lines wrongly in the answer section whereas it should be in the comment. If it were for item pipelines, I would not make the title of my post using this customized keyword. Thanks.

Question 5

Actually, I cannot comment other's posts yet

Question 6

@asmitu Is there a reason you went for a customized approach? As in, why is the approach suggested not acceptable for you?

Question 7

Check out this link to be sure as to why __del__() method should be avoided.

Question 8

You should ensure that the file is closed. In addition you should avoid creating a new writer object every loop iteration using the with statement:

class GetInfoSpider(scrapy.Spider):
 name = "infrarail"
 start_urls= ['http://www.infrarail.com/2018/exhibitor-profile/?e={}'.format(page) for page in range(65,70)]
 output = "output.csv"
 def __init__(self):
 # empty outputfile
 open(self.output, "w").close()
 # alternative:
 # if os.path.isfile(self.output):
 # os.remove(self.output)
 def parse(self, response):
 with open(self.output, "a", newline="") as f:
 writer = csv.writer(f)
 for q in response.css("article.contentslim"):
 name = q.css("h1::text").extract_first()
 link = q.css("p a::attr(href)").extract_first()
 writer.writerow([name, link])
 yield {'Name': name, 'Link': link}

Note that I also added some spaces after commas to improve readability, according to Python's official style-guide, PEP8.

It also recommends only importing from one module per line (so while from random import rand, randint is fine, import scrapy, csv is not).

Also note that each item is only written to file when the next one is being requested, as a generator pauses after the yield. That means if you for example itertools.islice it, your last item won't be written to the file. Therefore I swapped those two lines.

Question 9

@asmitu That will still leave the file open. In addition if you ever have more than one GetInfoSpider object, they will have the same file object, because the file is actually opened at the time of the class definition, not the instance creation. See my updated answer for two ways how to use append mode and still make sure that the file is overwritten for each new run.

Question 10

You should opt for closed() method as I've tried below. This method will be called automatically once your spider is closed. This method provides a shortcut to signals.connect() for the spider_closed signal.

class InfraRailSpider(scrapy.Spider):
 name = "infrarail"
 start_urls = ['https://www.infrarail.com/2020/english/exhibitor-list/2018/']
 def __init__(self):
 self.outfile = open("output.csv", "w", newline="")
 self.writer = csv.writer(self.outfile)
 self.writer.writerow(['title'])
 print("***"*20,"opened")
 def closed(self,reason):
 self.outfile.close()
 print("***"*20,"closed")
 def parse(self, response):
 for item in response.css('#exhibitor_list > [class^="e"]'):
 name = item.css('p.basic > b::text').get()
 self.writer.writerow([name])
 yield {'name':name}

Dmitry Arkhipenko Dmitry ArkhipenkoDmitry Arkhipenko 1413 bronze badges · Answer 1 · 2018-06-29 20:28:51Z

I'd like to mention, that there is a special way of making output files in scrapy - item pipelines. So, in order to make it right, you should write your own pipeline (or modify standard one via subclassing).

Also, you does not close the file, once you're done and you keep it open most of the time. The both problems are handled nicely with pipelines.

UPD: Well, you've asked for a better way, there it is. Although, if it's not acceptable for some hard-to-explain reasons (it's understandable), here's the other approach, how to make it better.

Don't leave the file open. There is a method (__del__()) which destroys the spider object. Add the code to close the file before it.
Another one is to store only the filename in the variable and open / close the file each time you write into it.
Another option is to use NoSQL database, which does not need to be opened / closed. And after scraping is done - get the output file from it.
If you have a few values to scrape you can store it in object variable, and then export it before __del__() method.

All the ways above are NOT welcomed by the actual developer community and may lead to serious problem in future. Use them carefully. Sometimes it's easier (in the long run) to read and understand how it really should be done.

Maybe it's the exact case?

I think you put those lines wrongly in the answer section whereas it should be in the comment. If it were for item pipelines, I would not make the title of my post using this customized keyword. Thanks.
@asmitu Is there a reason you went for a customized approach? As in, why is the approach suggested not acceptable for you?
Check out this link to be sure as to why __del__() method should be avoided.

Graipher GraipherGraipher 41.6k7 gold badges70 silver badges134 bronze badges · Answer 2 · 2018-06-30 09:18:52Z

You should ensure that the file is closed. In addition you should avoid creating a new writer object every loop iteration using the with statement:

class GetInfoSpider(scrapy.Spider):
 name = "infrarail"
 start_urls= ['http://www.infrarail.com/2018/exhibitor-profile/?e={}'.format(page) for page in range(65,70)]
 output = "output.csv"
 def __init__(self):
 # empty outputfile
 open(self.output, "w").close()
 # alternative:
 # if os.path.isfile(self.output):
 # os.remove(self.output)
 def parse(self, response):
 with open(self.output, "a", newline="") as f:
 writer = csv.writer(f)
 for q in response.css("article.contentslim"):
 name = q.css("h1::text").extract_first()
 link = q.css("p a::attr(href)").extract_first()
 writer.writerow([name, link])
 yield {'Name': name, 'Link': link}

Note that I also added some spaces after commas to improve readability, according to Python's official style-guide, PEP8.

It also recommends only importing from one module per line (so while from random import rand, randint is fine, import scrapy, csv is not).

Also note that each item is only written to file when the next one is being requested, as a generator pauses after the yield. That means if you for example itertools.islice it, your last item won't be written to the file. Therefore I swapped those two lines.

@asmitu That will still leave the file open. In addition if you ever have more than one GetInfoSpider object, they will have the same file object, because the file is actually opened at the time of the class definition, not the instance creation. See my updated answer for two ways how to use append mode and still make sure that the file is overwritten for each new run.

SIM SIMSIM 2,5011 gold badge26 silver badges49 bronze badges · Answer 3 · 2018-06-30 13:02:35Z

You should opt for closed() method as I've tried below. This method will be called automatically once your spider is closed. This method provides a shortcut to signals.connect() for the spider_closed signal.

class InfraRailSpider(scrapy.Spider):
 name = "infrarail"
 start_urls = ['https://www.infrarail.com/2020/english/exhibitor-list/2018/']
 def __init__(self):
 self.outfile = open("output.csv", "w", newline="")
 self.writer = csv.writer(self.outfile)
 self.writer.writerow(['title'])
 print("***"*20,"opened")
 def closed(self,reason):
 self.outfile.close()
 print("***"*20,"closed")
 def parse(self, response):
 for item in response.css('#exhibitor_list > [class^="e"]'):
 name = item.css('p.basic > b::text').get()
 self.writer.writerow([name])
 yield {'name':name}

Stack Exchange Network

Writing to a csv file in a customized way using scrapy

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Writing to a csv file in a customized way using scrapy

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions