I've written some code in python in combination with selenium to scrape populated result from a website after performing a reverse search.
My scraper opens that site clicks on the "search by address" button and then takes the street number and address from the "original.csv" file and then put it in the searchbox and hit the search button.
Once the result is populated my scraper grabs it and write the result in a new csv file creating new columns in it along with the previous columns in the "Original Csv" file.
It is necessary to switch two iframes to get to the result. To get result for all searches it is necessary to write complex xpaths which can grab data by searching two different locations because sometimes the data are not in a particular location.
I've used try except block in my script so that it can take care of the result with no value. I tried to write all the data in "Number" and "City" column but as I'am very weak in handling try except functionality that is why I created extra column named "Number1" and "City1" so that no data are missing. "Number1" and "City1" both fall under different xpaths, though!
However, my script is running errorlessly and fetching desired results. Any input on this will be highly appreciated.
Here is what I've written to get the job done:
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def get_info(driver, wait):
with open("Original.csv", "r") as f, open('Updated.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = reader.fieldnames + ['Number','City','Number1','City1']
writer = csv.writer = csv.DictWriter(g, fieldnames = newfieldnames)
writer.writeheader()
for item in reader:
driver.get('http://hcad.org/quick-search/')
driver.switch_to_frame(driver.find_element_by_tag_name("iframe"))
driver.find_element_by_id("s_addr").click()
wait.until(EC.presence_of_element_located((By.NAME, 'stnum')))
driver.find_element_by_name('stnum').send_keys(item["Street"])
driver.find_element_by_name('stname').send_keys(item["Address"])
driver.find_element_by_xpath("//input[@value='Search']").click()
try:
driver.switch_to_frame(driver.find_element_by_id("quickframe"))
try:
element = driver.find_element_by_xpath("//td[@class='data']/table//th")
name = driver.execute_script("return arguments[0].childNodes[10].textContent", element).strip() or driver.execute_script("return arguments[0].childNodes[12].textContent", element).strip()
except:
name = ""
try:
element = driver.find_element_by_xpath("//td[@class='data']/table//th")
pet = driver.execute_script("return arguments[0].childNodes[16].textContent", element).strip() or driver.execute_script("return arguments[0].childNodes[18].textContent", element).strip()
except:
pet = ""
try:
name1 = driver.find_element_by_xpath("//table[@class='bgcolor_1']//tr[2]/td[3]").text
except Exception:
name1 = ""
try:
pet1 = driver.find_element_by_xpath("//table[@class='bgcolor_1']//tr[2]/td[4]").text
except Exception:
pet1 = ""
item["Number"] = name
item["City"] = pet
item["Number1"] = name1
item["City1"] = pet1
print(item)
writer.writerow(item)
except Exception as e:
print(e)
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
try:
get_info(driver, wait)
finally:
driver.quit()
Here is the link to the csv file which I used to search the result. "https://www.dropbox.com/s/etgj0bbsav4ex4y/Original.csv?dl=0"
1 Answer 1
- bare exception clauses, generally speaking, should be avoided
I would apply "Extract Method" refactoring method to, at least, move the complexity of getting numbers and cities into a separate function.
I also don't really like these extra
Number1
andCity1
and, I think, you can still use justNumber
andCity
, but provide multiple ways to locate them on a page and fall down to an empty string only after all of them failed.You can replace:
driver.switch_to_frame(driver.find_element_by_tag_name("iframe"))
with just:
driver.switch_to_frame(0)
This will switch to the first frame in the HTML tree.
f
andg
are not descriptive variable names, how aboutinput_file
andoutput_file
?
Alternative Solution
You can avoid using a real browser and all the related overhead and switch requests
and BeautifulSoup
- this should dramatically improve the overall performance.
Here is a sample working code for a single search:
import requests
from bs4 import BeautifulSoup
search_parameters = {
'TaxYear': '2017',
'stnum': '15535',
'stname': 'CAMPDEN HILL RD'
}
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
session.post('https://public.hcad.org/records/QuickSearch.asp', data={'search': 'addr'},
headers={'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'https://public.hcad.org/records/quicksearch.asp'})
response = session.post('https://public.hcad.org/records/QuickRecord.asp', data=search_parameters,
headers={'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'https://public.hcad.org/records/QuickSearch.asp'}, allow_redirects=True)
soup = BeautifulSoup(response.content, "lxml")
print(soup.select_one("td.data > table th"))
-
\$\begingroup\$ Thanks sir alecxe for your review. What an easy way you have shown to achieve the same target what I did unnecessarily complicated. I'm not around my pc. As soon as I'm near my pc i will give you the feedback. \$\endgroup\$SIM– SIM2017年08月11日 09:00:52 +00:00Commented Aug 11, 2017 at 9:00
-
\$\begingroup\$ It is definitely the best approach I've come across so far to deal with the site like this. Two things come in my mind to see your solution. 1. How could you shake off the barrier of iframe? 2. What does "soup.select_one" means: is it for selecting the first one? Thanks for everything sir. \$\endgroup\$SIM– SIM2017年08月11日 15:24:02 +00:00Commented Aug 11, 2017 at 15:24
-
\$\begingroup\$ @Shahin well, if we use requests and bs4 as suggested, I think iframes don't matter anymore..well, if I understand you correctly.
select_one()
is the same asselect()[0]
except that it will returnNone
if no results. Thanks! \$\endgroup\$alecxe– alecxe2017年08月11日 16:16:24 +00:00Commented Aug 11, 2017 at 16:16 -
\$\begingroup\$ You are the great coder sir. You always come up with something new. Thanks for the clarity, by the way. \$\endgroup\$SIM– SIM2017年08月11日 16:25:56 +00:00Commented Aug 11, 2017 at 16:25
Explore related questions
See similar questions with these tags.