small web scraper to read product highlight from given urls

Question 1

I have written a Python scraping program, using the Scrapy framework. The following code reads a list of urls that need to crawled and gets product highlights from the list. It also follows the next page button.

The resulting products are automatically written to an output file, as I have set FEED_URI in setting.py (which is scrap's setting file), as below:

FEED_URI = 'product_export__%(time)s.xml'

This is the code of my crawler/spider:

import scrapy
# list of wanted organisation is written in: organisation.txt
def get_wanted_organisations():
 org_file = open('config/organisation.txt', 'r')
 return org_file.read().splitlines()
# list of urls to crawl is writtem in url.txt
def get_start_urls():
 org_file = open('config/url.txt', 'r')
 return org_file.read().splitlines()
class ProductSpider(scrapy.Spider):
 name = 'product_spider'
 def __init__(self):
 self.product_highlight_dict = {}
 self.wanted_organisations = get_wanted_organisations() # ignore products listed by any other organisation
 def start_requests(self):
 org_file = open('config/url.txt', 'r')
 start_urls = get_start_urls()
 for url in start_urls:
 yield scrapy.Request(url=url, callback=self.parse)
 def parse(self, response):
 # get a list of product highlights in the page 
 product_highlight = response.css('div > article[data-product-id]')
 for product_highlight in product_highlights:
 organisation = product_highlight.css('a[data-company="ListingCompany"]::text').get()
 if organisation and organisation.lower() in self.wanted_organisations:
 
 # we are interested in the following 4 fields, they will be written to output file, by setting FEED_URI setting of Scrapy framework
 yield {
 'product_id': product_highlight.css('[data-id="ProductId"] span::text').get(),
 'title': product_highlight.css('[data-title="ProductTitle"] h1::text').get(),
 'price': response.css('[data-price="ProductPrice"] span span::text').get(),
 'organisation': organisation,
 }
 
 # follow pagination link
 next_page = response.css('a[data-pager="NextPage"]::attr(href)').get()
 if next_page is not None:
 yield response.follow(next_page, self.parse)

I come from .NET background and this is my first Python program. Appreciate any feedback. Not sure if I am following the correct coding convention for naming variables, functions, etc.

Also, the entire program is in one file. Would it be better to move the first two functions into a different file and name the file something like settings_reader?

Question 2

Looks good! And it's great you are using scrapy.Otherwise beginners tend to fall towards slow selenium scripts.

Question 3

Splitting lines

Don't call splitlines() here:

org_file = open('config/organisation.txt', 'r')
return org_file.read().splitlines()

The file object itself is an iterator over its lines. Also, use a context manager to ensure file closure:

with open('config/organisation.txt', 'r') as f:
 return {line.rstrip() for line in f}

This is a set comprehension. You want a set because you're only ever checking for membership, and this will be more efficient.

Generator simplification

You don't really need to yield here:

for url in start_urls:
 yield scrapy.Request(url=url, callback=self.parse)

Instead,

return (
 scrapy.Request(url=url, callback=self.parse)
 for url in start_urls
)

Strongly-typed results

Since you're more experienced in .NET, consider how you would traditionally represent this:

yield {
 'product_id': product_highlight.css('[data-id="ProductId"] span::text').get(),
 'title': product_highlight.css('[data-title="ProductTitle"] h1::text').get(),
 'price': response.css('[data-price="ProductPrice"] span span::text').get(),
 'organisation': organisation,
}

Hint: it's not a dictionary. You would do better to make a class (perhaps a @dataclass) with well-defined members, and set your return type hint for this method to -> Iterable[ResultType].

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2020-07-15 02:47:21Z

Splitting lines

Don't call splitlines() here:

org_file = open('config/organisation.txt', 'r')
return org_file.read().splitlines()

The file object itself is an iterator over its lines. Also, use a context manager to ensure file closure:

with open('config/organisation.txt', 'r') as f:
 return {line.rstrip() for line in f}

This is a set comprehension. You want a set because you're only ever checking for membership, and this will be more efficient.

Generator simplification

You don't really need to yield here:

for url in start_urls:
 yield scrapy.Request(url=url, callback=self.parse)

Instead,

return (
 scrapy.Request(url=url, callback=self.parse)
 for url in start_urls
)

Strongly-typed results

Since you're more experienced in .NET, consider how you would traditionally represent this:

yield {
 'product_id': product_highlight.css('[data-id="ProductId"] span::text').get(),
 'title': product_highlight.css('[data-title="ProductTitle"] h1::text').get(),
 'price': response.css('[data-price="ProductPrice"] span span::text').get(),
 'organisation': organisation,
}

Hint: it's not a dictionary. You would do better to make a class (perhaps a @dataclass) with well-defined members, and set your return type hint for this method to -> Iterable[ResultType].

Stack Exchange Network

small web scraper to read product highlight from given urls

1 Answer 1

Splitting lines

Generator simplification

Strongly-typed results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

small web scraper to read product highlight from given urls

1 Answer 1

Splitting lines

Generator simplification

Strongly-typed results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions