I've written a script in python in association with selenium to harvest some coffee shop names from yellowpages. Although, the webpage is not javascript injected one, I used selenium for the purpose of experiment, specially how to handle multipage without clicking on the next button.
Upon execution I could notice that my script runs flawlessly and parses the names from that page. However, it does all this very slowly. Is there any way I can make it's performance faster than how it is now being limited within selenium?
Here is the code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
for page_num in range(1,3):
driver.get('https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA&page={0}'.format(page_num))
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.info"))):
try:
name = item.find_element_by_css_selector('a.business-name span[itemprop=name]').text
except:
name = ''
print(name)
driver.quit()
1 Answer 1
The code is pretty much straightforward and understandable, but I would still work on the following stylistic and readability issues:
- extract constants. It might be a good idea to extract the URL template and maximum page number into separate constants, or as arguments of a function
- put your main execution logic into the
if __name__ == '__main__'
to avoid it being executed if the module is imported - avoid using bare exception clauses. Be specific about the exceptions you are handling. In this case,
NoSuchElementException
is a good exception to handle if an item name is not found - use
try/finally
to safely close thedriver
if it fails before quitting - this way you would eliminate a situation when you have opened "ghost" browser windows left from failed scraping tries
All things applied:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape(url, max_page_number):
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
try:
for page_number in range(1, max_page_number + 1):
driver.get(url.format(page_number))
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".info"))):
try:
name = item.find_element_by_css_selector('a.business-name span[itemprop=name]').text
except NoSuchElementException:
name = ''
print(name)
finally:
driver.quit()
if __name__ == '__main__':
url_template = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA&page={0}'
max_page_number = 2
scrape(url_template, max_page_number)
Explore related questions
See similar questions with these tags.