Scraping names of directors from a website

Question 1

I am scraping the names of the directors from a website using Python / ScraPy. I am very new to coding (under a year and after work) - any views would be appreciated.

The reason I have a for loop with count from 0 to 100 is that not all the names on the website have a date of birth, and hence where there are blanks I need to return a value ("n/a" in this case) otherwise the lists of names / namerefs / roles / dateofbirths will get out of order.

import scrapy
import re
from CompaniesHouse.items import CompanieshouseItem
class CompaniesHouseSpider(scrapy.Spider):
 name = "companieshouse"
 allowed_domains = ["companieshouse.gov.uk"]
 start_urls = ["https://beta.companieshouse.gov.uk/company/OC361003/officers",
]
 def parse(self, response):
 for count in range(0,100):
 for sel in response.xpath('//*[@id="content-container"]'):
 companys = sel.xpath('//*[@id="company-name"]/text()').extract()
 companys = [company.strip() for company in companys]
 string1 = "officer-name-" + str(count)
 names = sel.xpath('//*[@id="%s"]/a/text()' %string1).extract()
 names = [name.strip() for name in names]
 namerefs = sel.xpath('//*[@id="%s"]/a/@href' %string1).re(r'(?<=/officers/).*?(?=/appointments)')
 namerefs = [nameref.strip() for nameref in namerefs]
 string2 = "officer-role-" + str(count)
 roles = sel.xpath('//*[@id="%s"]/text()' %string2).extract()
 roles = [role.strip() for role in roles]
 string3 = "officer-date-of-birth-" + str(count)
 if sel.xpath('//*[@id="%s"]/text()' %string3):
 dateofbirths = sel.xpath('//*[@id="%s"]/text()' %string3).extract()
 else:
 dateofbirths = ["n/a"]
 dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
 result = zip(companys, names, namerefs, roles, dateofbirths)
 for company, name, nameref, role, dateofbirth in result:
 item = CompanieshouseItem()
 item['company'] = company
 item['name'] = name
 item['nameref'] = "'" + nameref
 item['role'] = role
 item['dateofbirth'] = dateofbirth 
 yield item
 next_page = response.xpath('//*[@class="pager"]/li/a[@class="page"][contains(., "Next")]/@href').extract()
 if next_page:
 next_href = next_page[0]
 next_page_url = "https://beta.companieshouse.gov.uk" + next_href
 request = scrapy.Request(url=next_page_url)
 yield request

Question 2

DRY

Reduce duplicated logic using helper functions. Currently you have 2 lines of code for each field you extract, for example:

companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]
names = sel.xpath('//*[@id="%s"]/a/text()' % string1).extract()
names = [name.strip() for name in names]

This is tedious. You could capture the common logic in a helper function, for example:

def to_list(xpath):
 return [v.strip() for v in xpath.extract()]

With that, much of the code can be simplified:

companys = to_list(sel.xpath('//*[@id="company-name"]/text()'))
names = to_list(sel.xpath('//*[@id="%s"]/a/text()' % string1).extract())

Repeated operations

Here, an xpath lookup is performed twice:

if sel.xpath('//*[@id="%s"]/text()' % string3):
 dateofbirths = sel.xpath('//*[@id="%s"]/text()' % string3).extract()
else:
 dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]

It would be better to avoid that:

dateofbirths = to_list(sel.xpath('//*[@id="%s"]/text()' % string3))
if not dateofbirths:
 dateofbirths = ["n/a"]

Use `"...".format(...)`

The "%s" % ... style formatting is old, it's recommended to use the format function instead, for example:

names = sel.xpath('//*[@id="{}"]/a/text()'.format(string1)).extract()

Formatting

Python has a style guide called PEP8, I suggest to follow it.

Question 3

Regarding the yieldstatements: the code posted by the OP is a web spider built using the Scrapy framework. Callback methods in scrapy can generate both requests and data items. Scrapy engine is the one who will consume those items. If the object is a request, the engine sends it to scrapy's scheduler. If it's a data item, the engine sends it to the item pipelines. Reference: doc.scrapy.org/en/latest/topics/architecture.html

Question 4

Thanks @ValdirStummJunior, I suspected that there was a good explanation for this, thanks for the enlightenment :-)

Question 5

Thanks @Janos - I learned a lot from all that and have implemented it all.

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2016-12-19 23:32:19Z

DRY

Reduce duplicated logic using helper functions. Currently you have 2 lines of code for each field you extract, for example:

companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]
names = sel.xpath('//*[@id="%s"]/a/text()' % string1).extract()
names = [name.strip() for name in names]

This is tedious. You could capture the common logic in a helper function, for example:

def to_list(xpath):
 return [v.strip() for v in xpath.extract()]

With that, much of the code can be simplified:

companys = to_list(sel.xpath('//*[@id="company-name"]/text()'))
names = to_list(sel.xpath('//*[@id="%s"]/a/text()' % string1).extract())

Repeated operations

Here, an xpath lookup is performed twice:

if sel.xpath('//*[@id="%s"]/text()' % string3):
 dateofbirths = sel.xpath('//*[@id="%s"]/text()' % string3).extract()
else:
 dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]

It would be better to avoid that:

dateofbirths = to_list(sel.xpath('//*[@id="%s"]/text()' % string3))
if not dateofbirths:
 dateofbirths = ["n/a"]

Use `"...".format(...)`

The "%s" % ... style formatting is old, it's recommended to use the format function instead, for example:

names = sel.xpath('//*[@id="{}"]/a/text()'.format(string1)).extract()

Formatting

Python has a style guide called PEP8, I suggest to follow it.

Regarding the yieldstatements: the code posted by the OP is a web spider built using the Scrapy framework. Callback methods in scrapy can generate both requests and data items. Scrapy engine is the one who will consume those items. If the object is a request, the engine sends it to scrapy's scheduler. If it's a data item, the engine sends it to the item pipelines. Reference: doc.scrapy.org/en/latest/topics/architecture.html
Thanks @ValdirStummJunior, I suspected that there was a good explanation for this, thanks for the enlightenment :-)
Thanks @Janos - I learned a lot from all that and have implemented it all.

Stack Exchange Network

Scraping names of directors from a website

1 Answer 1

DRY

Repeated operations

Use `"...".format(...)`

Formatting

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scraping names of directors from a website

1 Answer 1

DRY

Repeated operations

Use "...".format(...)

Formatting

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use `"...".format(...)`