Scraping names from multipages without clicking on the next button

Question 1

I've written a script in python in association with selenium to harvest some coffee shop names from yellowpages. Although, the webpage is not javascript injected one, I used selenium for the purpose of experiment, specially how to handle multipage without clicking on the next button.

Upon execution I could notice that my script runs flawlessly and parses the names from that page. However, it does all this very slowly. Is there any way I can make it's performance faster than how it is now being limited within selenium?

Here is the code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
for page_num in range(1,3): 
 driver.get('https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA&page={0}'.format(page_num)) 
 for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.info"))):
 try:
 name = item.find_element_by_css_selector('a.business-name span[itemprop=name]').text
 except:
 name = ''
 print(name)
driver.quit()

Question 2

The code is pretty much straightforward and understandable, but I would still work on the following stylistic and readability issues:

extract constants. It might be a good idea to extract the URL template and maximum page number into separate constants, or as arguments of a function
put your main execution logic into the if __name__ == '__main__' to avoid it being executed if the module is imported
avoid using bare exception clauses. Be specific about the exceptions you are handling. In this case, NoSuchElementException is a good exception to handle if an item name is not found
use try/finally to safely close the driver if it fails before quitting - this way you would eliminate a situation when you have opened "ghost" browser windows left from failed scraping tries

All things applied:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape(url, max_page_number):
 driver = webdriver.Chrome()
 wait = WebDriverWait(driver, 10)
 try:
 for page_number in range(1, max_page_number + 1):
 driver.get(url.format(page_number))
 for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".info"))):
 try:
 name = item.find_element_by_css_selector('a.business-name span[itemprop=name]').text
 except NoSuchElementException:
 name = ''
 print(name)
 finally:
 driver.quit()
if __name__ == '__main__':
 url_template = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA&page={0}'
 max_page_number = 2
 scrape(url_template, max_page_number)

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-08-21 14:50:06Z

The code is pretty much straightforward and understandable, but I would still work on the following stylistic and readability issues:

extract constants. It might be a good idea to extract the URL template and maximum page number into separate constants, or as arguments of a function
put your main execution logic into the if __name__ == '__main__' to avoid it being executed if the module is imported
avoid using bare exception clauses. Be specific about the exceptions you are handling. In this case, NoSuchElementException is a good exception to handle if an item name is not found
use try/finally to safely close the driver if it fails before quitting - this way you would eliminate a situation when you have opened "ghost" browser windows left from failed scraping tries

All things applied:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape(url, max_page_number):
 driver = webdriver.Chrome()
 wait = WebDriverWait(driver, 10)
 try:
 for page_number in range(1, max_page_number + 1):
 driver.get(url.format(page_number))
 for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".info"))):
 try:
 name = item.find_element_by_css_selector('a.business-name span[itemprop=name]').text
 except NoSuchElementException:
 name = ''
 print(name)
 finally:
 driver.quit()
if __name__ == '__main__':
 url_template = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA&page={0}'
 max_page_number = 2
 scrape(url_template, max_page_number)

Stack Exchange Network

Scraping names from multipages without clicking on the next button

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scraping names from multipages without clicking on the next button

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions