I have written a Python scraping program, using the Scrapy framework. The following code reads a list of urls that need to crawled and gets product highlights from the list. It also follows the next page button.
The resulting products are automatically written to an output file, as I have set FEED_URI
in setting.py
(which is scrap's setting file), as below:
FEED_URI = 'product_export__%(time)s.xml'
This is the code of my crawler/spider:
import scrapy
# list of wanted organisation is written in: organisation.txt
def get_wanted_organisations():
org_file = open('config/organisation.txt', 'r')
return org_file.read().splitlines()
# list of urls to crawl is writtem in url.txt
def get_start_urls():
org_file = open('config/url.txt', 'r')
return org_file.read().splitlines()
class ProductSpider(scrapy.Spider):
name = 'product_spider'
def __init__(self):
self.product_highlight_dict = {}
self.wanted_organisations = get_wanted_organisations() # ignore products listed by any other organisation
def start_requests(self):
org_file = open('config/url.txt', 'r')
start_urls = get_start_urls()
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# get a list of product highlights in the page
product_highlight = response.css('div > article[data-product-id]')
for product_highlight in product_highlights:
organisation = product_highlight.css('a[data-company="ListingCompany"]::text').get()
if organisation and organisation.lower() in self.wanted_organisations:
# we are interested in the following 4 fields, they will be written to output file, by setting FEED_URI setting of Scrapy framework
yield {
'product_id': product_highlight.css('[data-id="ProductId"] span::text').get(),
'title': product_highlight.css('[data-title="ProductTitle"] h1::text').get(),
'price': response.css('[data-price="ProductPrice"] span span::text').get(),
'organisation': organisation,
}
# follow pagination link
next_page = response.css('a[data-pager="NextPage"]::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
I come from .NET background and this is my first Python program. Appreciate any feedback. Not sure if I am following the correct coding convention for naming variables, functions, etc.
Also, the entire program is in one file. Would it be better to move the first two functions into a different file and name the file something like settings_reader?
-
1\$\begingroup\$ Looks good! And it's great you are using scrapy.Otherwise beginners tend to fall towards slow selenium scripts. \$\endgroup\$Vishesh Mangla– Vishesh Mangla2020年07月14日 19:54:15 +00:00Commented Jul 14, 2020 at 19:54
1 Answer 1
Splitting lines
Don't call splitlines()
here:
org_file = open('config/organisation.txt', 'r')
return org_file.read().splitlines()
The file object itself is an iterator over its lines. Also, use a context manager to ensure file closure:
with open('config/organisation.txt', 'r') as f:
return {line.rstrip() for line in f}
This is a set comprehension. You want a set because you're only ever checking for membership, and this will be more efficient.
Generator simplification
You don't really need to yield
here:
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse)
Instead,
return (
scrapy.Request(url=url, callback=self.parse)
for url in start_urls
)
Strongly-typed results
Since you're more experienced in .NET, consider how you would traditionally represent this:
yield {
'product_id': product_highlight.css('[data-id="ProductId"] span::text').get(),
'title': product_highlight.css('[data-title="ProductTitle"] h1::text').get(),
'price': response.css('[data-price="ProductPrice"] span span::text').get(),
'organisation': organisation,
}
Hint: it's not a dictionary. You would do better to make a class (perhaps a @dataclass
) with well-defined members, and set your return type hint for this method to -> Iterable[ResultType]
.