I've written a scraper in Python scrapy in combination with selenium to scrape 1000 company names and their revenue from a website. The site has got lazy-loading method enabled so it is not possible to make the site load all the items unless the scraper is able to scroll that page downmost. However, my scraper can reach the lowest portion of this webpage and parse the aforesaid category flawlessly. I've set explicit wait in my scraper instead of any hardcoded delay so that it doesn't take longer than necessary.
As this is my first time to work with selenium along with scrapy, there might be scopes to do betterment of this script to make it more robust.
import scrapy
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class ProductSpider(scrapy.Spider):
name = "productsp"
start_urls = ['http://fortune.com/fortune500/list/']
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
def parse(self, response):
self.driver.get(response.url)
check_height = self.driver.execute_script("return document.body.scrollHeight;")
while True:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
self.wait.until(lambda driver: self.driver.execute_script("return document.body.scrollHeight;") > check_height)
check_height = self.driver.execute_script("return document.body.scrollHeight;")
except:
break
for item in self.driver.find_elements_by_css_selector(".row"):
name = item.find_element_by_css_selector(".company-title").text
revenue = item.find_element_by_css_selector(".company-revenue").text
yield {"Title":name,"Revenue":revenue}
-
1\$\begingroup\$ I've rolled back your latest edit. Please don't update the code in your question with feedback provided by the answers, as this will invalidate those answers. If the code has been changed significantly, feel free to ask a follow-up question instead. \$\endgroup\$Mast– Mast ♦2017年09月26日 06:31:36 +00:00Commented Sep 26, 2017 at 6:31
1 Answer 1
The spider is readable and understandable. I would only extract some of the things into separate methods for readability. For example, the "infinite scroll" should probably be just defined in a separate method.
And, the bare except can be replaced with handling a more specific TimeoutException
:
def scroll_until_loaded(self):
check_height = self.driver.execute_script("return document.body.scrollHeight;")
while True:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
self.wait.until(lambda driver: self.driver.execute_script("return document.body.scrollHeight;") > check_height)
check_height = self.driver.execute_script("return document.body.scrollHeight;")
except TimeoutException:
break
def parse(self, response):
self.driver.get(response.url)
self.scroll_until_loaded()
for item in self.driver.find_elements_by_css_selector(".row"):
name = item.find_element_by_css_selector(".company-title").text
revenue = item.find_element_by_css_selector(".company-revenue").text
yield {"Title":name,"Revenue":revenue}
-
\$\begingroup\$ Thanks sir for the input. I'll let you know if I encounter any issue executing the script. Your solution always rocks, though! \$\endgroup\$SIM– SIM2017年09月25日 06:19:17 +00:00Commented Sep 25, 2017 at 6:19
-
\$\begingroup\$ @Shahin thanks for the tests. This error looks unrelated to our changes at the first glance. Do you remember if you had this problem before? \$\endgroup\$alecxe– alecxe2017年09月25日 13:23:27 +00:00Commented Sep 25, 2017 at 13:23
-
\$\begingroup\$ @Shahin try adjusting your css selector locator to be
.company-list li.row
rather than.row
. Let me know if it helped, thanks. \$\endgroup\$alecxe– alecxe2017年09月25日 16:05:04 +00:00Commented Sep 25, 2017 at 16:05 -
\$\begingroup\$ This is exactly it sir. Sometimes slim is not smart. Btw, with
.row
in the script along withtry/except
also works. Thankssssssssssss sir. \$\endgroup\$SIM– SIM2017年09月26日 06:23:54 +00:00Commented Sep 26, 2017 at 6:23
Explore related questions
See similar questions with these tags.