I've written a script which is able to perform reverse search in the website using the Name and Lid from a predefined CSV file. However, when the search is done then it can put the results containing Address and Phone Number adjacent to those Name and Lid creating a new CSV file. It is working errorlessly now. I tried to make the total process clean. Any suggestion to do betterment of this script will be highly appreciated. Here is the code I have tried with:
import csv
import requests
from lxml import html
with open("predefined.csv", "r") as f, open('newly_created.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = reader.fieldnames + ['Address', 'Phone']
writer = csv.writer = csv.DictWriter(g, fieldnames = newfieldnames)
writer.writeheader()
for entry in reader:
Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
response = requests.get(Page)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(@class,"business-card")]')
for title in tree.xpath('//article[contains(@class,"business-card")]'):
Address= title.xpath('.//p[@class="address"]/span/text()')[0]
Contact = title.xpath('.//p[@class="phone"]/text()')[0]
print(Address,Contact)
new_row = entry
new_row['Address'] = Address
new_row['Phone'] = Contact
writer.writerow(new_row)
Here is the link to the search criteria of "predefined.csv" file.
Here is the link to the results.
1 Answer 1
There are multiple things we can do to improve the code:
- variable naming - try to be consistent with
PEP8
naming suggestions - for instance:Page
should probably bepage
- or even betterurl
Address
would beaddress
Contact
would becontact
f
can beinput_file
g
can beoutput_file
titles
variable is never used- move the url format string into a constant
- you don't need
writer = csv.writer = csv.DictWriter(...)
- just assign thewriter
to theDictWriter
instance directly - since you are crawling the same domain, re-using
requests.Session()
instance should have a positive impact on performance - use
.findtext()
method instead ofxpath()
and then getting the first item - I would also create a separate
crawl
function to keep the web-scraping logic separate
Here is the modified code with the above and other improvements combined:
import csv
import requests
from lxml import html
URL_TEMPLATE = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}"
def crawl(entries):
with requests.Session() as session:
for entry in entries:
url = URL_TEMPLATE.format(entry["Name"].replace(" ", "-"), entry["Lid"])
response = session.get(url)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(@class,"business-card")]')
for title in titles:
address = title.findtext('.//p[@class="address"]/span')
contact = title.findtext('.//p[@class="phone"]')
print(address, contact)
entry['Address'] = address
entry['Phone'] = contact
yield entry
if __name__ == '__main__':
with open("predefined.csv", "r") as input_file, open('newly_created.csv', 'w', newline='') as output_file:
reader = csv.DictReader(input_file)
field_names = reader.fieldnames + ['Address', 'Phone']
writer = csv.DictWriter(output_file, fieldnames=field_names)
writer.writeheader()
for entry in crawl(reader):
writer.writerow(entry)
(not tested)
-
\$\begingroup\$ Thanks sir alecxe, for your elaborative review and the epic code. I tested it just now and found it working like as your code always does. Btw, is it a good idea to write the results creating another csv file other than the existing one? \$\endgroup\$SIM– SIM2017年07月09日 19:25:13 +00:00Commented Jul 9, 2017 at 19:25
-
\$\begingroup\$ @SMth80 you can do either of them technically, but I would probably keep input and output files as separate files just in case there is something wrong in the logic of the program and I don't want to have my file in an intermediate state. Thanks! \$\endgroup\$alecxe– alecxe2017年07月09日 19:36:37 +00:00Commented Jul 9, 2017 at 19:36
-
\$\begingroup\$ @Shahin yup, I've already seen this post - nice question. And you've got really good reviews - actually don't have anything valuable to add. Thanks for heads up! \$\endgroup\$alecxe– alecxe2017年08月16日 21:20:33 +00:00Commented Aug 16, 2017 at 21:20