I've written a script in scrapy to grab different names
and links
from different pages of a website and write those parsed items in a csv file. When I run my script, I get the results accordingly and find a data filled in csv file. I'm using python 3.5, so when I use scrapy's built-in command to write data in a csv file, I do get a csv file with blank lines in every alternate row. Eventually, I tried the below way to achieve the flawless output (with no blank lines in between). Now, It produces a csv file fixing blank line issues. I hope I did it in the right way. However, if there is anything I can/should do to make it more robust, I'm happy to cope with.
This is my script which provides me with a flawless output in a csv file:
import scrapy ,csv
from scrapy.crawler import CrawlerProcess
class GetInfoSpider(scrapy.Spider):
name = "infrarail"
start_urls= ['http://www.infrarail.com/2018/exhibitor-profile/?e={}'.format(page) for page in range(65,70)]
def __init__(self):
self.infile = open("output.csv","w",newline="")
def parse(self, response):
for q in response.css("article.contentslim"):
name = q.css("h1::text").extract_first()
link = q.css("p a::attr(href)").extract_first()
yield {'Name':name,'Link':link}
writer = csv.writer(self.infile)
writer.writerow([name,link])
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(GetInfoSpider)
c.start()
Btw, I used .CrawlerProcess()
to be able to run my spider from sublime text editor.
-
\$\begingroup\$ Welcome to Code Review! That's quite a well-written question, especially for a new user. Well done. \$\endgroup\$Mast– Mast ♦2018年06月30日 08:00:43 +00:00Commented Jun 30, 2018 at 8:00
3 Answers 3
I'd like to mention, that there is a special way of making output files in scrapy - item pipelines. So, in order to make it right, you should write your own pipeline (or modify standard one via subclassing).
Also, you does not close the file, once you're done and you keep it open most of the time. The both problems are handled nicely with pipelines.
UPD: Well, you've asked for a better way, there it is. Although, if it's not acceptable for some hard-to-explain reasons (it's understandable), here's the other approach, how to make it better.
Don't leave the file open. There is a method (
__del__()
) which destroys the spider object. Add the code to close the file before it.Another one is to store only the filename in the variable and open / close the file each time you write into it.
Another option is to use NoSQL database, which does not need to be opened / closed. And after scraping is done - get the output file from it.
If you have a few values to scrape you can store it in object variable, and then export it before
__del__()
method.
All the ways above are NOT welcomed by the actual developer community and may lead to serious problem in future. Use them carefully. Sometimes it's easier (in the long run) to read and understand how it really should be done.
Maybe it's the exact case?
-
\$\begingroup\$ I think you put those lines wrongly in the answer section whereas it should be in the comment. If it were for
item pipelines
, I would not make the title of my post using thiscustomized
keyword. Thanks. \$\endgroup\$SIM– SIM2018年06月29日 21:01:34 +00:00Commented Jun 29, 2018 at 21:01 -
\$\begingroup\$ Actually, I cannot comment other's posts yet \$\endgroup\$Dmitry Arkhipenko– Dmitry Arkhipenko2018年06月29日 21:04:12 +00:00Commented Jun 29, 2018 at 21:04
-
\$\begingroup\$ @asmitu Is there a reason you went for a customized approach? As in, why is the approach suggested not acceptable for you? \$\endgroup\$2018年06月30日 08:04:13 +00:00Commented Jun 30, 2018 at 8:04
-
\$\begingroup\$ Check out this link to be sure as to why
__del__()
method should be avoided. \$\endgroup\$SIM– SIM2018年06月30日 14:44:02 +00:00Commented Jun 30, 2018 at 14:44
You should ensure that the file is closed. In addition you should avoid creating a new writer object every loop iteration using the with
statement:
class GetInfoSpider(scrapy.Spider):
name = "infrarail"
start_urls= ['http://www.infrarail.com/2018/exhibitor-profile/?e={}'.format(page) for page in range(65,70)]
output = "output.csv"
def __init__(self):
# empty outputfile
open(self.output, "w").close()
# alternative:
# if os.path.isfile(self.output):
# os.remove(self.output)
def parse(self, response):
with open(self.output, "a", newline="") as f:
writer = csv.writer(f)
for q in response.css("article.contentslim"):
name = q.css("h1::text").extract_first()
link = q.css("p a::attr(href)").extract_first()
writer.writerow([name, link])
yield {'Name': name, 'Link': link}
Note that I also added some spaces after commas to improve readability, according to Python's official style-guide, PEP8.
It also recommends only importing from one module per line (so while from random import rand, randint
is fine, import scrapy, csv
is not).
Also note that each item is only written to file when the next one is being requested, as a generator pauses after the yield
. That means if you for example itertools.islice
it, your last item won't be written to the file. Therefore I swapped those two lines.
-
\$\begingroup\$ @asmitu That will still leave the file open. In addition if you ever have more than one
GetInfoSpider
object, they will have the same file object, because the file is actually opened at the time of the class definition, not the instance creation. See my updated answer for two ways how to use append mode and still make sure that the file is overwritten for each new run. \$\endgroup\$Graipher– Graipher2018年06月30日 11:36:30 +00:00Commented Jun 30, 2018 at 11:36
You should opt for closed() method as I've tried below. This method will be called automatically once your spider is closed. This method provides a shortcut to signals.connect() for the spider_closed signal.
class InfraRailSpider(scrapy.Spider):
name = "infrarail"
start_urls = ['https://www.infrarail.com/2020/english/exhibitor-list/2018/']
def __init__(self):
self.outfile = open("output.csv", "w", newline="")
self.writer = csv.writer(self.outfile)
self.writer.writerow(['title'])
print("***"*20,"opened")
def closed(self,reason):
self.outfile.close()
print("***"*20,"closed")
def parse(self, response):
for item in response.css('#exhibitor_list > [class^="e"]'):
name = item.css('p.basic > b::text').get()
self.writer.writerow([name])
yield {'name':name}
Explore related questions
See similar questions with these tags.