4
\$\begingroup\$

I'm a newbie getting into web scrapers. I've made something that works, but it takes hours and hours to get everything I need. I read something about using parallel processes to process the URLs but I have no clue how to go about it and incorporate it in what I already have. Help is much appreciated!

Here is my, still extremely messy, code. I'm still learning :)

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException
import time
import random
import pprint
import itertools
import csv
import pandas as pd
start_url = "https://www.nationalevacaturebank.nl/vacature/zoeken?query=&location=&distance=city&limit=100&sort=relevance&filters%5BcareerLevel%5D%5B%5D=Starter&filters%5BeducationLevel%5D%5B%5D=MBO"
driver = webdriver.Firefox()
driver.set_page_load_timeout(20)
driver.get(start_url)
driver.find_element_by_xpath('//*[@id="form_save"]').click() #accepts cookies
wait = WebDriverWait(driver, random.randint(1500,3200)/1000.0)
j = random.randint(1500,3200)/1000.0
time.sleep(j)
num_jobs = int(driver.find_element_by_xpath('/html/body/div[3]/div/main/div[2]/div[3]/div/header/h2/span').text)
num_pages = int(num_jobs/102)
urls = []
list_of_links = []
for i in range(num_pages+1):
 try:
 elements = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="search-results-container"]//article/job/a')))
 for i in elements:
 list_of_links.append(i.get_attribute('href'))
 j = random.randint(1500,3200)/1000.0
 time.sleep(j) 
 if 'page=3' not in driver.current_url:
 driver.find_element_by_xpath('//html/body/div[3]/div/main/div[2]/div[3]/div/paginator/div/nav[1]/ul/li[6]/a').click()
 else:
 driver.find_element_by_xpath('//html/body/div[3]/div/main/div[2]/div[3]/div/paginator/div/nav[1]/ul/li[5]/a').click()
 url = driver.current_url
 if url not in urls:
 print(url)
 urls.append(url)
 else:
 break
 except:
 continue
set_list_of_links = list(set(list_of_links))
print(len(set_list_of_links), "results") 
driver.close()
def grouper(n, iterable):
 it = iter(iterable)
 while True:
 chunk = tuple(itertools.islice(it, n))
 if not chunk:
 return
 yield chunk
def remove_empty_lists(l):
 keep_going = True
 prev_l = l
 while keep_going:
 new_l = remover(prev_l)
 #are they identical objects?
 if new_l == prev_l:
 keep_going = False
 #set prev to new
 prev_l = new_l
 #return the result
 return new_l
def remover(l):
 newlist = []
 for i in l:
 if isinstance(i, list) and len(i) != 0:
 newlist.append(remover(i))
 if not isinstance(i, list):
 newlist.append(i)
 return newlist
vacatures = []
chunks = grouper(100, set_list_of_links)
chunk_count = 0
for chunk in chunks: 
 chunk_count +=1
 print(chunk_count)
 j = random.randint(1500,3200)/1000.0
 time.sleep(j)
 for url in chunk:
 driver = webdriver.Firefox()
 driver.set_page_load_timeout(20)
 try: 
 driver.get(url)
 driver.find_element_by_xpath('//*[@id="form_save"]').click() #accepts cookies
 vacature = []
 vacature.append(url)
 j = random.randint(1500,3200)/1000.0
 time.sleep(j)
 elements = driver.find_elements_by_tag_name('dl')
 p_elements = driver.find_elements_by_tag_name('p')
 li_elements = driver.find_elements_by_tag_name('li')
 for i in elements:
 if "Salaris:" not in i.text:
 vacature.append(i.text)
 running_text = list()
 for p in p_elements:
 running_text.append(p.text)
 text= [''.join(running_text)]
 remove_ls = ['vacatures', 'carrièretips', 'help', 'inloggen', 'inschrijven', 'Bezoek website', 'YouTube',
 'Over Nationale Vacaturebank', 'Werken bij de Persgroep', 'Persberichten', 'Autotrack', 'Tweakers',
 'Tweakers Elect', 'ITBanen', 'Contact', 'Carrière Mentors', 'Veelgestelde vragen',
 'Vacatures, stages en bijbanen', 'Bruto Netto Calculator', 'Salariswijzer', 'Direct vacature plaatsen',
 'Kandidaten zoeken', 'Bekijk de webshop', 'Intermediair', 'Volg ons op Facebook']
 for li in li_elements:
 if li.text not in remove_ls: 
 text.append(li.text)
 text = ''. join(text)
 vacature.append(text)
 vacatures.append(vacature)
 driver.close() 
 except TimeoutException as ex:
 isrunning = 0
 print("Exception has been thrown. " + str(ex))
 driver.close()
 except NoSuchElementException:
 continue 
Sᴀᴍ Onᴇᴌᴀ
29.5k16 gold badges45 silver badges201 bronze badges
asked Oct 31, 2018 at 9:58
\$\endgroup\$
5
  • 1
    \$\begingroup\$ Please note reviewers will not comment about code not yet written (How to parallel process), but the rest of the question seems fine. Welcome to Code Review \$\endgroup\$ Commented Oct 31, 2018 at 11:09
  • 1
    \$\begingroup\$ On stackoverflow I just heard that it's probably selenium which slows it down and that I should use requests and beautiful soup/lxml. I don't have experience with either and dread working with lxml. How can I improve in a way that what I've got now works with requests and bs4/lxml. \$\endgroup\$ Commented Oct 31, 2018 at 13:02
  • \$\begingroup\$ Something that's relatively easy to implement is a threadpool solution. Basically, you could have one (or several) thread that produces URLs that need to be visited, and then you have several threads that consume those URLs to get the needed info. However, this solution usually scaled poorly with Selenium since each instance uses up quite a lot of memory. \$\endgroup\$ Commented Oct 31, 2018 at 15:18
  • \$\begingroup\$ I understand that threadpool is easier with requests. In the first instance, I don't have a list of links since I gather these by clicking on the next button. I'm not able to get a working example of my code above with requests and lxml \$\endgroup\$ Commented Oct 31, 2018 at 15:34
  • \$\begingroup\$ Welcome to Code Review! I changed the title so that it describes what the code does per site goals: "State what your code does in your title, not your main concerns about it.". Feel free to edit and give it a different title if there is something more appropriate. \$\endgroup\$ Commented Oct 31, 2018 at 16:11

1 Answer 1

1
\$\begingroup\$

Bearing in mind that this is a small point and won't likely lead to much of a performance gain, it may simplify the complexity of finding elements.

Instead of finding elements by id attribute using xpath, the function find_element_by_id could be used. So instead of lines like:

driver.find_element_by_xpath('//*[@id="form_save"]').click() #accepts cookies

These could be simplified as:

driver.find_element_by_id('form_save').click() #accepts cookies
answered Oct 31, 2018 at 16:08
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.