Python + selenium scraper to grab results using reverse search

Question 1

I've written some code in python in combination with selenium to scrape populated result from a website after performing a reverse search.

My scraper opens that site clicks on the "search by address" button and then takes the street number and address from the "original.csv" file and then put it in the searchbox and hit the search button.

Once the result is populated my scraper grabs it and write the result in a new csv file creating new columns in it along with the previous columns in the "Original Csv" file.

It is necessary to switch two iframes to get to the result. To get result for all searches it is necessary to write complex xpaths which can grab data by searching two different locations because sometimes the data are not in a particular location.

I've used try except block in my script so that it can take care of the result with no value. I tried to write all the data in "Number" and "City" column but as I'am very weak in handling try except functionality that is why I created extra column named "Number1" and "City1" so that no data are missing. "Number1" and "City1" both fall under different xpaths, though!

However, my script is running errorlessly and fetching desired results. Any input on this will be highly appreciated.

Here is what I've written to get the job done:

import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def get_info(driver, wait):
 with open("Original.csv", "r") as f, open('Updated.csv', 'w', newline='') as g:
 reader = csv.DictReader(f)
 newfieldnames = reader.fieldnames + ['Number','City','Number1','City1']
 writer = csv.writer = csv.DictWriter(g, fieldnames = newfieldnames)
 writer.writeheader()
 for item in reader:
 driver.get('http://hcad.org/quick-search/')
 driver.switch_to_frame(driver.find_element_by_tag_name("iframe"))
 driver.find_element_by_id("s_addr").click()
 wait.until(EC.presence_of_element_located((By.NAME, 'stnum')))
 driver.find_element_by_name('stnum').send_keys(item["Street"])
 driver.find_element_by_name('stname').send_keys(item["Address"])
 driver.find_element_by_xpath("//input[@value='Search']").click()
 try:
 driver.switch_to_frame(driver.find_element_by_id("quickframe"))
 try:
 element = driver.find_element_by_xpath("//td[@class='data']/table//th")
 name = driver.execute_script("return arguments[0].childNodes[10].textContent", element).strip() or driver.execute_script("return arguments[0].childNodes[12].textContent", element).strip()
 except:
 name = ""
 try:
 element = driver.find_element_by_xpath("//td[@class='data']/table//th")
 pet = driver.execute_script("return arguments[0].childNodes[16].textContent", element).strip() or driver.execute_script("return arguments[0].childNodes[18].textContent", element).strip()
 except:
 pet = ""
 try:
 name1 = driver.find_element_by_xpath("//table[@class='bgcolor_1']//tr[2]/td[3]").text
 except Exception:
 name1 = ""
 try:
 pet1 = driver.find_element_by_xpath("//table[@class='bgcolor_1']//tr[2]/td[4]").text
 except Exception:
 pet1 = ""
 item["Number"] = name
 item["City"] = pet
 item["Number1"] = name1
 item["City1"] = pet1
 print(item)
 writer.writerow(item)
 except Exception as e:
 print(e)
if __name__ == '__main__':
 driver = webdriver.Chrome()
 wait = WebDriverWait(driver, 10)
 try:
 get_info(driver, wait)
 finally:
 driver.quit()

Here is the link to the csv file which I used to search the result. "https://www.dropbox.com/s/etgj0bbsav4ex4y/Original.csv?dl=0"

Question 2

bare exception clauses, generally speaking, should be avoided
I would apply "Extract Method" refactoring method to, at least, move the complexity of getting numbers and cities into a separate function.
I also don't really like these extra Number1 and City1 and, I think, you can still use just Number and City, but provide multiple ways to locate them on a page and fall down to an empty string only after all of them failed.

You can replace:

 driver.switch_to_frame(driver.find_element_by_tag_name("iframe"))

with just:

 driver.switch_to_frame(0)

This will switch to the first frame in the HTML tree.

f and g are not descriptive variable names, how about input_file and output_file?

Alternative Solution

You can avoid using a real browser and all the related overhead and switch requests and BeautifulSoup - this should dramatically improve the overall performance.

Here is a sample working code for a single search:

import requests
from bs4 import BeautifulSoup
search_parameters = {
 'TaxYear': '2017',
 'stnum': '15535',
 'stname': 'CAMPDEN HILL RD'
}
with requests.Session() as session:
 session.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
 session.post('https://public.hcad.org/records/QuickSearch.asp', data={'search': 'addr'},
 headers={'Content-Type': 'application/x-www-form-urlencoded',
 'Referer': 'https://public.hcad.org/records/quicksearch.asp'})
 response = session.post('https://public.hcad.org/records/QuickRecord.asp', data=search_parameters,
 headers={'Content-Type': 'application/x-www-form-urlencoded',
 'Referer': 'https://public.hcad.org/records/QuickSearch.asp'}, allow_redirects=True)
 soup = BeautifulSoup(response.content, "lxml")
 print(soup.select_one("td.data > table th"))

Question 3

Thanks sir alecxe for your review. What an easy way you have shown to achieve the same target what I did unnecessarily complicated. I'm not around my pc. As soon as I'm near my pc i will give you the feedback.

Question 4

It is definitely the best approach I've come across so far to deal with the site like this. Two things come in my mind to see your solution. 1. How could you shake off the barrier of iframe? 2. What does "soup.select_one" means: is it for selecting the first one? Thanks for everything sir.

Question 5

@Shahin well, if we use requests and bs4 as suggested, I think iframes don't matter anymore..well, if I understand you correctly. select_one() is the same as select()[0] except that it will return None if no results. Thanks!

Question 6

You are the great coder sir. You always come up with something new. Thanks for the clarity, by the way.

alecxe alecxealecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-08-11 04:34:41Z

bare exception clauses, generally speaking, should be avoided
I would apply "Extract Method" refactoring method to, at least, move the complexity of getting numbers and cities into a separate function.
I also don't really like these extra Number1 and City1 and, I think, you can still use just Number and City, but provide multiple ways to locate them on a page and fall down to an empty string only after all of them failed.

You can replace:

 driver.switch_to_frame(driver.find_element_by_tag_name("iframe"))

with just:

 driver.switch_to_frame(0)

This will switch to the first frame in the HTML tree.

f and g are not descriptive variable names, how about input_file and output_file?

Alternative Solution

You can avoid using a real browser and all the related overhead and switch requests and BeautifulSoup - this should dramatically improve the overall performance.

Here is a sample working code for a single search:

import requests
from bs4 import BeautifulSoup
search_parameters = {
 'TaxYear': '2017',
 'stnum': '15535',
 'stname': 'CAMPDEN HILL RD'
}
with requests.Session() as session:
 session.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
 session.post('https://public.hcad.org/records/QuickSearch.asp', data={'search': 'addr'},
 headers={'Content-Type': 'application/x-www-form-urlencoded',
 'Referer': 'https://public.hcad.org/records/quicksearch.asp'})
 response = session.post('https://public.hcad.org/records/QuickRecord.asp', data=search_parameters,
 headers={'Content-Type': 'application/x-www-form-urlencoded',
 'Referer': 'https://public.hcad.org/records/QuickSearch.asp'}, allow_redirects=True)
 soup = BeautifulSoup(response.content, "lxml")
 print(soup.select_one("td.data > table th"))

Thanks sir alecxe for your review. What an easy way you have shown to achieve the same target what I did unnecessarily complicated. I'm not around my pc. As soon as I'm near my pc i will give you the feedback.
It is definitely the best approach I've come across so far to deal with the site like this. Two things come in my mind to see your solution. 1. How could you shake off the barrier of iframe? 2. What does "soup.select_one" means: is it for selecting the first one? Thanks for everything sir.
@Shahin well, if we use requests and bs4 as suggested, I think iframes don't matter anymore..well, if I understand you correctly. select_one() is the same as select()[0] except that it will return None if no results. Thanks!
You are the great coder sir. You always come up with something new. Thanks for the clarity, by the way.

Stack Exchange Network

Python + selenium scraper to grab results using reverse search

1 Answer 1

Alternative Solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python + selenium scraper to grab results using reverse search

1 Answer 1

Alternative Solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions