Dynamic web scraper in Python

Question 1

The other week I faced the struggle of scraping dynamically generated content. So I used the selenium library with combination of requests & bs4, the thing is I am unsure of the quality of the implementation as I just learned how to use those tools. I want to have a general feedback on the way I used the libraries, the quality of my code and the logic behind it.

Link to GitHub README.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import selenium.common.exceptions
import requests
from bs4 import BeautifulSoup
from time import sleep
def scraper():
 opts = Options()
 opts.add_argument('--headless')
 driver = webdriver.Chrome(r'C:\Users\leagu\chromedriver.exe', options=opts)
 pos = input('Enter your desired position: ')
 URL = 'https://remote.co/remote-jobs/search/?search_keywords='+pos.replace(' ', '+')
 driver.get(URL)
 # Scroll to the bottom of the page
 while True:
 try:
 WebDriverWait(driver, 5).until(
 ec.text_to_be_present_in_element(
 (By.CLASS_NAME, 'load_more_jobs'),
 'Load more listings')
 )
 loadMore = driver.find_element_by_class_name('load_more_jobs')
 loadMore.click()
 except:
 try: # can't locate element - click the close on the popup add
 WebDriverWait(driver, 5).until(
 ec.presence_of_element_located((By.CLASS_NAME, 'portstlucie-CloseButton'))
 )
 addClose = driver.find_element_by_xpath('//*[@id="om-oqulaezshgjig4mgnmcn-optin"]/div/button')
 addClose.click()
 except: # Timeout / can't locate add - break
 break
 # Find all the job listings
 listings = driver.find_elements_by_class_name('job_listing')
 if len(listings) == 0:
 print(f'There are 0 jobs found by {pos} criteria. Please use different wording.')
 sleep(5)
 scraper()
 else:
 scrapeN = input(f"There are {len(listings)} number of jobs for the {pos} position. If u wish to view a portion of them enter the number of the jobs to view else write 'max': ")
 if scrapeN.lower() == 'max':
 scrapeN = len(listings)
 scrapeN = input(f"There are {len(listings)} number of jobs for the {pos} position. If u wish to view a portion of them enter the number of the jobs to view else write 'max': " )
 print('\n')
 for i in range(int(scrapeN)): # Iterate trough all the job listings
 URL = listings[i].find_element_by_tag_name('a').get_attribute('href')
 html = requests.get(URL)
 soup = BeautifulSoup(html.content, 'html.parser')
 jobT = soup.find('h1', class_='font-weight-bold').text
 jobPD = soup.find('time').text
 link = soup.find('a', class_='application_button')['href']
 print(f'Job - {jobT}. This job was {jobPD}.\nMore information about the job at {URL}. \nLink for application - {link}', end='\n\n')
if __name__ == '__main__':
 scraper()

Question 2

What I like:

usage of f-strings, but you are not using them everywhere eg: URL = 'https://remote.co/remote-jobs/search/?search_keywords='+pos.replace(' ', '+')
usage of if __name__ == '__main__':
code is fairly easy to grasp
effort to provide a decent working script with a minimum of code (70 lines)

What I like less:

usage of exception handling: I would be more specific and not catch any type of exception, only handle those that are relevant to the operation being performed in the try/catch block (selenium.common.exceptions)
lack of global exception handling in the rest of the code
also think about logging full exception details to a file so you don't have to guess what has gone wrong
I would also avoid nesting of try/catch blocks, try to separate each operation from each other (see below)
mixing lower case and upper case in variable names: remember that variable names are case-sensitive in Python. jobPD and jobpd are two different variables that can get assigned different values so this could be a nasty source of bugs
variable names should be more descriptive and meaningful: jobPD does not give a clear hint about what it represents. More descriptive names would be job_title, job_posted_time etc

Regarding the scraping process, make sure that the DOM elements your are expecting are really present: websites change their layout more or less often and you must spot changes that you could break your application. You can either check with Selenium or BS4 if you have already retrieved the HTML. But it seems logical to use Selenium. If you use BS, note the behavior of the different functions:

From the BS4 doc (emphasis is mine):

If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None

You have this block (the nested try/catch block):

 try: # can't locate element - click the close on the popup add
 WebDriverWait(driver, 5).until(
 ec.presence_of_element_located((By.CLASS_NAME, 'portstlucie-CloseButton'))
 )
 addClose = driver.find_element_by_xpath('//*[@id="om-oqulaezshgjig4mgnmcn-optin"]/div/button')
 addClose.click()
 except: # Timeout / can't locate add - break
 break

Is is better to anticipate and avoid exceptions, rather than handling them. So it seems to me that you should verify the existence of the element if possible, rather than trigger an exception.

Instead, you could use the more generic findElements function. Note that findElement is different than findElements. The difference:

findElements will return an empty list if no matching elements are found instead of an exception (NoSuchElementException)

Reference: Find Element and FindElements in Selenium WebDriver

However, if you stick to the current approach, you should not blindly catch all exceptions: in this context the one relevant exception that may occur is: NoSuchElementException.

There is one thing that is not okay: you are using the requests module in parallel to Selenium. That is unnecessary since you already have an instance of Selenium that you can use. To fetch the whole HTML just use:

html = driver.page_source

then feed it to BS

Final thought: have you thought about logging the results to a file, a table or CSV perhaps ? The console buffer may be too small if you retrieve lots of results.

Kate Kate 8,1089 silver badges21 bronze badges · Answer 1 · 2020-04-03 21:48:51Z

What I like:

usage of f-strings, but you are not using them everywhere eg: URL = 'https://remote.co/remote-jobs/search/?search_keywords='+pos.replace(' ', '+')
usage of if __name__ == '__main__':
code is fairly easy to grasp
effort to provide a decent working script with a minimum of code (70 lines)

What I like less:

usage of exception handling: I would be more specific and not catch any type of exception, only handle those that are relevant to the operation being performed in the try/catch block (selenium.common.exceptions)
lack of global exception handling in the rest of the code
also think about logging full exception details to a file so you don't have to guess what has gone wrong
I would also avoid nesting of try/catch blocks, try to separate each operation from each other (see below)
mixing lower case and upper case in variable names: remember that variable names are case-sensitive in Python. jobPD and jobpd are two different variables that can get assigned different values so this could be a nasty source of bugs
variable names should be more descriptive and meaningful: jobPD does not give a clear hint about what it represents. More descriptive names would be job_title, job_posted_time etc

Regarding the scraping process, make sure that the DOM elements your are expecting are really present: websites change their layout more or less often and you must spot changes that you could break your application. You can either check with Selenium or BS4 if you have already retrieved the HTML. But it seems logical to use Selenium. If you use BS, note the behavior of the different functions:

From the BS4 doc (emphasis is mine):

If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None

You have this block (the nested try/catch block):

 try: # can't locate element - click the close on the popup add
 WebDriverWait(driver, 5).until(
 ec.presence_of_element_located((By.CLASS_NAME, 'portstlucie-CloseButton'))
 )
 addClose = driver.find_element_by_xpath('//*[@id="om-oqulaezshgjig4mgnmcn-optin"]/div/button')
 addClose.click()
 except: # Timeout / can't locate add - break
 break

Is is better to anticipate and avoid exceptions, rather than handling them. So it seems to me that you should verify the existence of the element if possible, rather than trigger an exception.

Instead, you could use the more generic findElements function. Note that findElement is different than findElements. The difference:

findElements will return an empty list if no matching elements are found instead of an exception (NoSuchElementException)

Reference: Find Element and FindElements in Selenium WebDriver

However, if you stick to the current approach, you should not blindly catch all exceptions: in this context the one relevant exception that may occur is: NoSuchElementException.

There is one thing that is not okay: you are using the requests module in parallel to Selenium. That is unnecessary since you already have an instance of Selenium that you can use. To fetch the whole HTML just use:

html = driver.page_source

then feed it to BS

Final thought: have you thought about logging the results to a file, a table or CSV perhaps ? The console buffer may be too small if you retrieve lots of results.

Stack Exchange Network

Dynamic web scraper in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Dynamic web scraper in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions