The other week I faced the struggle of scraping dynamically generated content. So I used the selenium library with combination of requests & bs4, the thing is I am unsure of the quality of the implementation as I just learned how to use those tools. I want to have a general feedback on the way I used the libraries, the quality of my code and the logic behind it.
Link to GitHub README.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import selenium.common.exceptions
import requests
from bs4 import BeautifulSoup
from time import sleep
def scraper():
opts = Options()
opts.add_argument('--headless')
driver = webdriver.Chrome(r'C:\Users\leagu\chromedriver.exe', options=opts)
pos = input('Enter your desired position: ')
URL = 'https://remote.co/remote-jobs/search/?search_keywords='+pos.replace(' ', '+')
driver.get(URL)
# Scroll to the bottom of the page
while True:
try:
WebDriverWait(driver, 5).until(
ec.text_to_be_present_in_element(
(By.CLASS_NAME, 'load_more_jobs'),
'Load more listings')
)
loadMore = driver.find_element_by_class_name('load_more_jobs')
loadMore.click()
except:
try: # can't locate element - click the close on the popup add
WebDriverWait(driver, 5).until(
ec.presence_of_element_located((By.CLASS_NAME, 'portstlucie-CloseButton'))
)
addClose = driver.find_element_by_xpath('//*[@id="om-oqulaezshgjig4mgnmcn-optin"]/div/button')
addClose.click()
except: # Timeout / can't locate add - break
break
# Find all the job listings
listings = driver.find_elements_by_class_name('job_listing')
if len(listings) == 0:
print(f'There are 0 jobs found by {pos} criteria. Please use different wording.')
sleep(5)
scraper()
else:
scrapeN = input(f"There are {len(listings)} number of jobs for the {pos} position. If u wish to view a portion of them enter the number of the jobs to view else write 'max': ")
if scrapeN.lower() == 'max':
scrapeN = len(listings)
scrapeN = input(f"There are {len(listings)} number of jobs for the {pos} position. If u wish to view a portion of them enter the number of the jobs to view else write 'max': " )
print('\n')
for i in range(int(scrapeN)): # Iterate trough all the job listings
URL = listings[i].find_element_by_tag_name('a').get_attribute('href')
html = requests.get(URL)
soup = BeautifulSoup(html.content, 'html.parser')
jobT = soup.find('h1', class_='font-weight-bold').text
jobPD = soup.find('time').text
link = soup.find('a', class_='application_button')['href']
print(f'Job - {jobT}. This job was {jobPD}.\nMore information about the job at {URL}. \nLink for application - {link}', end='\n\n')
if __name__ == '__main__':
scraper()
1 Answer 1
What I like:
- usage of f-strings, but you are not using them everywhere eg:
URL = 'https://remote.co/remote-jobs/search/?search_keywords='+pos.replace(' ', '+')
- usage of
if __name__ == '__main__':
- code is fairly easy to grasp
- effort to provide a decent working script with a minimum of code (70 lines)
What I like less:
- usage of exception handling: I would be more specific and not catch any type of exception, only handle those that are relevant to the operation being performed in the try/catch block (
selenium.common.exceptions
) - lack of global exception handling in the rest of the code
- also think about logging full exception details to a file so you don't have to guess what has gone wrong
- I would also avoid nesting of try/catch blocks, try to separate each operation from each other (see below)
- mixing lower case and upper case in variable names: remember that variable names are case-sensitive in Python.
jobPD
andjobpd
are two different variables that can get assigned different values so this could be a nasty source of bugs - variable names should be more descriptive and meaningful:
jobPD
does not give a clear hint about what it represents. More descriptive names would bejob_title
,job_posted_time
etc
Regarding the scraping process, make sure that the DOM elements your are expecting are really present: websites change their layout more or less often and you must spot changes that you could break your application. You can either check with Selenium or BS4 if you have already retrieved the HTML. But it seems logical to use Selenium. If you use BS, note the behavior of the different functions:
From the BS4 doc (emphasis is mine):
If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None
You have this block (the nested try/catch block):
try: # can't locate element - click the close on the popup add
WebDriverWait(driver, 5).until(
ec.presence_of_element_located((By.CLASS_NAME, 'portstlucie-CloseButton'))
)
addClose = driver.find_element_by_xpath('//*[@id="om-oqulaezshgjig4mgnmcn-optin"]/div/button')
addClose.click()
except: # Timeout / can't locate add - break
break
Is is better to anticipate and avoid exceptions, rather than handling them. So it seems to me that you should verify the existence of the element if possible, rather than trigger an exception.
Instead, you could use the more generic findElements
function. Note that findElement
is different than findElements
. The difference:
findElements will return an empty list if no matching elements are found instead of an exception (
NoSuchElementException
)
Reference: Find Element and FindElements in Selenium WebDriver
However, if you stick to the current approach, you should not blindly catch all exceptions: in this context the one relevant exception that may occur is: NoSuchElementException
.
There is one thing that is not okay: you are using the requests
module in parallel to Selenium. That is unnecessary since you already have an instance of Selenium that you can use. To fetch the whole HTML just use:
html = driver.page_source
then feed it to BS
Final thought: have you thought about logging the results to a file, a table or CSV perhaps ? The console buffer may be too small if you retrieve lots of results.