5
\$\begingroup\$

I am scraping the names of the directors from a website using Python / ScraPy. I am very new to coding (under a year and after work) - any views would be appreciated.

The reason I have a for loop with count from 0 to 100 is that not all the names on the website have a date of birth, and hence where there are blanks I need to return a value ("n/a" in this case) otherwise the lists of names / namerefs / roles / dateofbirths will get out of order.

import scrapy
import re
from CompaniesHouse.items import CompanieshouseItem
class CompaniesHouseSpider(scrapy.Spider):
 name = "companieshouse"
 allowed_domains = ["companieshouse.gov.uk"]
 start_urls = ["https://beta.companieshouse.gov.uk/company/OC361003/officers",
]
 def parse(self, response):
 for count in range(0,100):
 for sel in response.xpath('//*[@id="content-container"]'):
 companys = sel.xpath('//*[@id="company-name"]/text()').extract()
 companys = [company.strip() for company in companys]
 string1 = "officer-name-" + str(count)
 names = sel.xpath('//*[@id="%s"]/a/text()' %string1).extract()
 names = [name.strip() for name in names]
 namerefs = sel.xpath('//*[@id="%s"]/a/@href' %string1).re(r'(?<=/officers/).*?(?=/appointments)')
 namerefs = [nameref.strip() for nameref in namerefs]
 string2 = "officer-role-" + str(count)
 roles = sel.xpath('//*[@id="%s"]/text()' %string2).extract()
 roles = [role.strip() for role in roles]
 string3 = "officer-date-of-birth-" + str(count)
 if sel.xpath('//*[@id="%s"]/text()' %string3):
 dateofbirths = sel.xpath('//*[@id="%s"]/text()' %string3).extract()
 else:
 dateofbirths = ["n/a"]
 dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
 result = zip(companys, names, namerefs, roles, dateofbirths)
 for company, name, nameref, role, dateofbirth in result:
 item = CompanieshouseItem()
 item['company'] = company
 item['name'] = name
 item['nameref'] = "'" + nameref
 item['role'] = role
 item['dateofbirth'] = dateofbirth 
 yield item
 next_page = response.xpath('//*[@class="pager"]/li/a[@class="page"][contains(., "Next")]/@href').extract()
 if next_page:
 next_href = next_page[0]
 next_page_url = "https://beta.companieshouse.gov.uk" + next_href
 request = scrapy.Request(url=next_page_url)
 yield request
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Dec 19, 2016 at 21:28
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

DRY

Reduce duplicated logic using helper functions. Currently you have 2 lines of code for each field you extract, for example:

companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]
names = sel.xpath('//*[@id="%s"]/a/text()' % string1).extract()
names = [name.strip() for name in names]

This is tedious. You could capture the common logic in a helper function, for example:

def to_list(xpath):
 return [v.strip() for v in xpath.extract()]

With that, much of the code can be simplified:

companys = to_list(sel.xpath('//*[@id="company-name"]/text()'))
names = to_list(sel.xpath('//*[@id="%s"]/a/text()' % string1).extract())

Repeated operations

Here, an xpath lookup is performed twice:

if sel.xpath('//*[@id="%s"]/text()' % string3):
 dateofbirths = sel.xpath('//*[@id="%s"]/text()' % string3).extract()
else:
 dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]

It would be better to avoid that:

dateofbirths = to_list(sel.xpath('//*[@id="%s"]/text()' % string3))
if not dateofbirths:
 dateofbirths = ["n/a"]

Use "...".format(...)

The "%s" % ... style formatting is old, it's recommended to use the format function instead, for example:

names = sel.xpath('//*[@id="{}"]/a/text()'.format(string1)).extract()

Formatting

Python has a style guide called PEP8, I suggest to follow it.

answered Dec 19, 2016 at 23:32
\$\endgroup\$
3
  • 1
    \$\begingroup\$ Regarding the yieldstatements: the code posted by the OP is a web spider built using the Scrapy framework. Callback methods in scrapy can generate both requests and data items. Scrapy engine is the one who will consume those items. If the object is a request, the engine sends it to scrapy's scheduler. If it's a data item, the engine sends it to the item pipelines. Reference: doc.scrapy.org/en/latest/topics/architecture.html \$\endgroup\$ Commented Dec 20, 2016 at 11:25
  • \$\begingroup\$ Thanks @ValdirStummJunior, I suspected that there was a good explanation for this, thanks for the enlightenment :-) \$\endgroup\$ Commented Dec 20, 2016 at 12:01
  • \$\begingroup\$ Thanks @Janos - I learned a lot from all that and have implemented it all. \$\endgroup\$ Commented Dec 20, 2016 at 20:31

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.