Crawler for performing reverse search and writing results into a CSV file

Question 1

I've written a script which is able to perform reverse search in the website using the Name and Lid from a predefined CSV file. However, when the search is done then it can put the results containing Address and Phone Number adjacent to those Name and Lid creating a new CSV file. It is working errorlessly now. I tried to make the total process clean. Any suggestion to do betterment of this script will be highly appreciated. Here is the code I have tried with:

import csv
import requests
from lxml import html
with open("predefined.csv", "r") as f, open('newly_created.csv', 'w', newline='') as g:
 reader = csv.DictReader(f)
 newfieldnames = reader.fieldnames + ['Address', 'Phone']
 writer = csv.writer = csv.DictWriter(g, fieldnames = newfieldnames)
 writer.writeheader()
 for entry in reader:
 Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
 response = requests.get(Page)
 tree = html.fromstring(response.text)
 titles = tree.xpath('//article[contains(@class,"business-card")]')
 for title in tree.xpath('//article[contains(@class,"business-card")]'):
 Address= title.xpath('.//p[@class="address"]/span/text()')[0]
 Contact = title.xpath('.//p[@class="phone"]/text()')[0]
 print(Address,Contact)
 new_row = entry
 new_row['Address'] = Address
 new_row['Phone'] = Contact
 writer.writerow(new_row)

Here is the link to the search criteria of "predefined.csv" file.

Here is the link to the results.

Question 2

There are multiple things we can do to improve the code:

variable naming - try to be consistent with PEP8 naming suggestions - for instance:
- Page should probably be page - or even better url
- Address would be address
- Contact would be contact
- f can be input_file
- g can be output_file
titles variable is never used
move the url format string into a constant
you don't need writer = csv.writer = csv.DictWriter(...) - just assign the writer to the DictWriter instance directly
since you are crawling the same domain, re-using requests.Session() instance should have a positive impact on performance
use .findtext() method instead of xpath() and then getting the first item
I would also create a separate crawl function to keep the web-scraping logic separate

Here is the modified code with the above and other improvements combined:

import csv
import requests
from lxml import html
URL_TEMPLATE = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}"
def crawl(entries):
 with requests.Session() as session:
 for entry in entries:
 url = URL_TEMPLATE.format(entry["Name"].replace(" ", "-"), entry["Lid"])
 response = session.get(url)
 tree = html.fromstring(response.text)
 titles = tree.xpath('//article[contains(@class,"business-card")]')
 for title in titles:
 address = title.findtext('.//p[@class="address"]/span')
 contact = title.findtext('.//p[@class="phone"]')
 print(address, contact)
 entry['Address'] = address
 entry['Phone'] = contact
 yield entry
if __name__ == '__main__':
 with open("predefined.csv", "r") as input_file, open('newly_created.csv', 'w', newline='') as output_file:
 reader = csv.DictReader(input_file)
 field_names = reader.fieldnames + ['Address', 'Phone']
 writer = csv.DictWriter(output_file, fieldnames=field_names)
 writer.writeheader()
 for entry in crawl(reader):
 writer.writerow(entry)

(not tested)

Question 3

Thanks sir alecxe, for your elaborative review and the epic code. I tested it just now and found it working like as your code always does. Btw, is it a good idea to write the results creating another csv file other than the existing one?

Question 4

@SMth80 you can do either of them technically, but I would probably keep input and output files as separate files just in case there is something wrong in the logic of the program and I don't want to have my file in an intermediate state. Thanks!

Question 5

@Shahin yup, I've already seen this post - nice question. And you've got really good reviews - actually don't have anything valuable to add. Thanks for heads up!

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-07-09 18:54:10Z

There are multiple things we can do to improve the code:

variable naming - try to be consistent with PEP8 naming suggestions - for instance:
- Page should probably be page - or even better url
- Address would be address
- Contact would be contact
- f can be input_file
- g can be output_file
titles variable is never used
move the url format string into a constant
you don't need writer = csv.writer = csv.DictWriter(...) - just assign the writer to the DictWriter instance directly
since you are crawling the same domain, re-using requests.Session() instance should have a positive impact on performance
use .findtext() method instead of xpath() and then getting the first item
I would also create a separate crawl function to keep the web-scraping logic separate

Here is the modified code with the above and other improvements combined:

import csv
import requests
from lxml import html
URL_TEMPLATE = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}"
def crawl(entries):
 with requests.Session() as session:
 for entry in entries:
 url = URL_TEMPLATE.format(entry["Name"].replace(" ", "-"), entry["Lid"])
 response = session.get(url)
 tree = html.fromstring(response.text)
 titles = tree.xpath('//article[contains(@class,"business-card")]')
 for title in titles:
 address = title.findtext('.//p[@class="address"]/span')
 contact = title.findtext('.//p[@class="phone"]')
 print(address, contact)
 entry['Address'] = address
 entry['Phone'] = contact
 yield entry
if __name__ == '__main__':
 with open("predefined.csv", "r") as input_file, open('newly_created.csv', 'w', newline='') as output_file:
 reader = csv.DictReader(input_file)
 field_names = reader.fieldnames + ['Address', 'Phone']
 writer = csv.DictWriter(output_file, fieldnames=field_names)
 writer.writeheader()
 for entry in crawl(reader):
 writer.writerow(entry)

(not tested)

Thanks sir alecxe, for your elaborative review and the epic code. I tested it just now and found it working like as your code always does. Btw, is it a good idea to write the results creating another csv file other than the existing one?
@SMth80 you can do either of them technically, but I would probably keep input and output files as separate files just in case there is something wrong in the logic of the program and I don't want to have my file in an intermediate state. Thanks!
@Shahin yup, I've already seen this post - nice question. And you've got really good reviews - actually don't have anything valuable to add. Thanks for heads up!

Stack Exchange Network

Crawler for performing reverse search and writing results into a CSV file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Crawler for performing reverse search and writing results into a CSV file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions