Parsing a slow-loading webpage with scrapy in combination with selenium

Question 1

I've written a scraper in Python scrapy in combination with selenium to scrape 1000 company names and their revenue from a website. The site has got lazy-loading method enabled so it is not possible to make the site load all the items unless the scraper is able to scroll that page downmost. However, my scraper can reach the lowest portion of this webpage and parse the aforesaid category flawlessly. I've set explicit wait in my scraper instead of any hardcoded delay so that it doesn't take longer than necessary.

As this is my first time to work with selenium along with scrapy, there might be scopes to do betterment of this script to make it more robust.

import scrapy
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class ProductSpider(scrapy.Spider):
 name = "productsp"
 start_urls = ['http://fortune.com/fortune500/list/']
 def __init__(self):
 self.driver = webdriver.Chrome()
 self.wait = WebDriverWait(self.driver, 10)
 def parse(self, response):
 self.driver.get(response.url)
 check_height = self.driver.execute_script("return document.body.scrollHeight;")
 while True:
 self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 try:
 self.wait.until(lambda driver: self.driver.execute_script("return document.body.scrollHeight;") > check_height)
 check_height = self.driver.execute_script("return document.body.scrollHeight;") 
 except:
 break
 for item in self.driver.find_elements_by_css_selector(".row"):
 name = item.find_element_by_css_selector(".company-title").text
 revenue = item.find_element_by_css_selector(".company-revenue").text
 yield {"Title":name,"Revenue":revenue}

Question 2

I've rolled back your latest edit. Please don't update the code in your question with feedback provided by the answers, as this will invalidate those answers. If the code has been changed significantly, feel free to ask a follow-up question instead.

Question 3

The spider is readable and understandable. I would only extract some of the things into separate methods for readability. For example, the "infinite scroll" should probably be just defined in a separate method.

And, the bare except can be replaced with handling a more specific TimeoutException:

def scroll_until_loaded(self):
 check_height = self.driver.execute_script("return document.body.scrollHeight;")
 while True:
 self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 try:
 self.wait.until(lambda driver: self.driver.execute_script("return document.body.scrollHeight;") > check_height)
 check_height = self.driver.execute_script("return document.body.scrollHeight;") 
 except TimeoutException:
 break
def parse(self, response):
 self.driver.get(response.url)
 self.scroll_until_loaded()
 for item in self.driver.find_elements_by_css_selector(".row"):
 name = item.find_element_by_css_selector(".company-title").text
 revenue = item.find_element_by_css_selector(".company-revenue").text
 yield {"Title":name,"Revenue":revenue}

Question 4

Thanks sir for the input. I'll let you know if I encounter any issue executing the script. Your solution always rocks, though!

Question 5

@Shahin thanks for the tests. This error looks unrelated to our changes at the first glance. Do you remember if you had this problem before?

Question 6

@Shahin try adjusting your css selector locator to be .company-list li.row rather than .row. Let me know if it helped, thanks.

Question 7

This is exactly it sir. Sometimes slim is not smart. Btw, with .row in the script along with try/except also works. Thankssssssssssss sir.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-09-25 04:30:27Z

The spider is readable and understandable. I would only extract some of the things into separate methods for readability. For example, the "infinite scroll" should probably be just defined in a separate method.

And, the bare except can be replaced with handling a more specific TimeoutException:

def scroll_until_loaded(self):
 check_height = self.driver.execute_script("return document.body.scrollHeight;")
 while True:
 self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 try:
 self.wait.until(lambda driver: self.driver.execute_script("return document.body.scrollHeight;") > check_height)
 check_height = self.driver.execute_script("return document.body.scrollHeight;") 
 except TimeoutException:
 break
def parse(self, response):
 self.driver.get(response.url)
 self.scroll_until_loaded()
 for item in self.driver.find_elements_by_css_selector(".row"):
 name = item.find_element_by_css_selector(".company-title").text
 revenue = item.find_element_by_css_selector(".company-revenue").text
 yield {"Title":name,"Revenue":revenue}

Thanks sir for the input. I'll let you know if I encounter any issue executing the script. Your solution always rocks, though!
@Shahin thanks for the tests. This error looks unrelated to our changes at the first glance. Do you remember if you had this problem before?
@Shahin try adjusting your css selector locator to be .company-list li.row rather than .row. Let me know if it helped, thanks.
This is exactly it sir. Sometimes slim is not smart. Btw, with .row in the script along with try/except also works. Thankssssssssssss sir.

Stack Exchange Network

Parsing a slow-loading webpage with scrapy in combination with selenium

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing a slow-loading webpage with scrapy in combination with selenium

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions