I am scraping the names of the directors from a website using Python / ScraPy. I am very new to coding (under a year and after work) - any views would be appreciated.
The reason I have a for
loop with count from 0 to 100 is that not all the names on the website have a date of birth, and hence where there are blanks I need to return a value ("n/a" in this case) otherwise the lists of names / namerefs / roles / dateofbirths will get out of order.
import scrapy
import re
from CompaniesHouse.items import CompanieshouseItem
class CompaniesHouseSpider(scrapy.Spider):
name = "companieshouse"
allowed_domains = ["companieshouse.gov.uk"]
start_urls = ["https://beta.companieshouse.gov.uk/company/OC361003/officers",
]
def parse(self, response):
for count in range(0,100):
for sel in response.xpath('//*[@id="content-container"]'):
companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]
string1 = "officer-name-" + str(count)
names = sel.xpath('//*[@id="%s"]/a/text()' %string1).extract()
names = [name.strip() for name in names]
namerefs = sel.xpath('//*[@id="%s"]/a/@href' %string1).re(r'(?<=/officers/).*?(?=/appointments)')
namerefs = [nameref.strip() for nameref in namerefs]
string2 = "officer-role-" + str(count)
roles = sel.xpath('//*[@id="%s"]/text()' %string2).extract()
roles = [role.strip() for role in roles]
string3 = "officer-date-of-birth-" + str(count)
if sel.xpath('//*[@id="%s"]/text()' %string3):
dateofbirths = sel.xpath('//*[@id="%s"]/text()' %string3).extract()
else:
dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
result = zip(companys, names, namerefs, roles, dateofbirths)
for company, name, nameref, role, dateofbirth in result:
item = CompanieshouseItem()
item['company'] = company
item['name'] = name
item['nameref'] = "'" + nameref
item['role'] = role
item['dateofbirth'] = dateofbirth
yield item
next_page = response.xpath('//*[@class="pager"]/li/a[@class="page"][contains(., "Next")]/@href').extract()
if next_page:
next_href = next_page[0]
next_page_url = "https://beta.companieshouse.gov.uk" + next_href
request = scrapy.Request(url=next_page_url)
yield request
1 Answer 1
DRY
Reduce duplicated logic using helper functions. Currently you have 2 lines of code for each field you extract, for example:
companys = sel.xpath('//*[@id="company-name"]/text()').extract() companys = [company.strip() for company in companys] names = sel.xpath('//*[@id="%s"]/a/text()' % string1).extract() names = [name.strip() for name in names]
This is tedious. You could capture the common logic in a helper function, for example:
def to_list(xpath):
return [v.strip() for v in xpath.extract()]
With that, much of the code can be simplified:
companys = to_list(sel.xpath('//*[@id="company-name"]/text()'))
names = to_list(sel.xpath('//*[@id="%s"]/a/text()' % string1).extract())
Repeated operations
Here, an xpath lookup is performed twice:
if sel.xpath('//*[@id="%s"]/text()' % string3): dateofbirths = sel.xpath('//*[@id="%s"]/text()' % string3).extract() else: dateofbirths = ["n/a"] dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
It would be better to avoid that:
dateofbirths = to_list(sel.xpath('//*[@id="%s"]/text()' % string3))
if not dateofbirths:
dateofbirths = ["n/a"]
Use "...".format(...)
The "%s" % ...
style formatting is old, it's recommended to use the format
function instead, for example:
names = sel.xpath('//*[@id="{}"]/a/text()'.format(string1)).extract()
Formatting
Python has a style guide called PEP8, I suggest to follow it.
-
1\$\begingroup\$ Regarding the
yield
statements: the code posted by the OP is a web spider built using the Scrapy framework. Callback methods in scrapy can generate both requests and data items. Scrapy engine is the one who will consume those items. If the object is a request, the engine sends it to scrapy's scheduler. If it's a data item, the engine sends it to the item pipelines. Reference: doc.scrapy.org/en/latest/topics/architecture.html \$\endgroup\$Valdir Stumm Junior– Valdir Stumm Junior2016年12月20日 11:25:27 +00:00Commented Dec 20, 2016 at 11:25 -
\$\begingroup\$ Thanks @ValdirStummJunior, I suspected that there was a good explanation for this, thanks for the enlightenment :-) \$\endgroup\$janos– janos2016年12月20日 12:01:07 +00:00Commented Dec 20, 2016 at 12:01
-
\$\begingroup\$ Thanks @Janos - I learned a lot from all that and have implemented it all. \$\endgroup\$nevster– nevster2016年12月20日 20:31:43 +00:00Commented Dec 20, 2016 at 20:31