4
\$\begingroup\$

I have written a Python scraping program, using the Scrapy framework. The following code reads a list of urls that need to crawled and gets product highlights from the list. It also follows the next page button.

The resulting products are automatically written to an output file, as I have set FEED_URI in setting.py (which is scrap's setting file), as below:

FEED_URI = 'product_export__%(time)s.xml'

This is the code of my crawler/spider:

import scrapy
# list of wanted organisation is written in: organisation.txt
def get_wanted_organisations():
 org_file = open('config/organisation.txt', 'r')
 return org_file.read().splitlines()
# list of urls to crawl is writtem in url.txt
def get_start_urls():
 org_file = open('config/url.txt', 'r')
 return org_file.read().splitlines()
class ProductSpider(scrapy.Spider):
 name = 'product_spider'
 def __init__(self):
 self.product_highlight_dict = {}
 self.wanted_organisations = get_wanted_organisations() # ignore products listed by any other organisation
 def start_requests(self):
 org_file = open('config/url.txt', 'r')
 start_urls = get_start_urls()
 for url in start_urls:
 yield scrapy.Request(url=url, callback=self.parse)
 def parse(self, response):
 # get a list of product highlights in the page 
 product_highlight = response.css('div > article[data-product-id]')
 for product_highlight in product_highlights:
 organisation = product_highlight.css('a[data-company="ListingCompany"]::text').get()
 if organisation and organisation.lower() in self.wanted_organisations:
 
 # we are interested in the following 4 fields, they will be written to output file, by setting FEED_URI setting of Scrapy framework
 yield {
 'product_id': product_highlight.css('[data-id="ProductId"] span::text').get(),
 'title': product_highlight.css('[data-title="ProductTitle"] h1::text').get(),
 'price': response.css('[data-price="ProductPrice"] span span::text').get(),
 'organisation': organisation,
 }
 
 # follow pagination link
 next_page = response.css('a[data-pager="NextPage"]::attr(href)').get()
 if next_page is not None:
 yield response.follow(next_page, self.parse)

I come from .NET background and this is my first Python program. Appreciate any feedback. Not sure if I am following the correct coding convention for naming variables, functions, etc.

Also, the entire program is in one file. Would it be better to move the first two functions into a different file and name the file something like settings_reader?

Reinderien
70.9k5 gold badges76 silver badges256 bronze badges
asked Jul 14, 2020 at 10:31
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Looks good! And it's great you are using scrapy.Otherwise beginners tend to fall towards slow selenium scripts. \$\endgroup\$ Commented Jul 14, 2020 at 19:54

1 Answer 1

5
\$\begingroup\$

Splitting lines

Don't call splitlines() here:

org_file = open('config/organisation.txt', 'r')
return org_file.read().splitlines()

The file object itself is an iterator over its lines. Also, use a context manager to ensure file closure:

with open('config/organisation.txt', 'r') as f:
 return {line.rstrip() for line in f}

This is a set comprehension. You want a set because you're only ever checking for membership, and this will be more efficient.

Generator simplification

You don't really need to yield here:

for url in start_urls:
 yield scrapy.Request(url=url, callback=self.parse)

Instead,

return (
 scrapy.Request(url=url, callback=self.parse)
 for url in start_urls
)

Strongly-typed results

Since you're more experienced in .NET, consider how you would traditionally represent this:

yield {
 'product_id': product_highlight.css('[data-id="ProductId"] span::text').get(),
 'title': product_highlight.css('[data-title="ProductTitle"] h1::text').get(),
 'price': response.css('[data-price="ProductPrice"] span span::text').get(),
 'organisation': organisation,
}
 

Hint: it's not a dictionary. You would do better to make a class (perhaps a @dataclass) with well-defined members, and set your return type hint for this method to -> Iterable[ResultType].

Hooman Bahreini
6331 gold badge7 silver badges22 bronze badges
answered Jul 15, 2020 at 2:47
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.