Scraper for parsing email-ID using email sending button

Question 1

I've written a script in python with POST request which is able to send email using send button and then catch the response of that email and finally parse the ID from that. I don't know whether the way I did this is the ideal one but it does parse the email ID from this process. There are 4 email sending buttons available in that webpage and my script is able to scrape them all. Here is what I've tried so far with:

import requests
from lxml import html
main_url = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=London&scrambleSeed=2082758402"
def main_page(url):
 response = requests.get(url, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}).text
 tree = html.fromstring(response)
 for titles in tree.xpath("//div[@class='businessCapsule--callToAction']"):
 title = titles.xpath('.//a[contains(@class,"btn-blue")]/@href')[0] if len(titles.xpath('.//a[contains(@class,"btn-blue")]/@href'))>0 else ""
 process_page(title.replace("/customerneeds/sendenquiry/sendtoone/","").replace("?searchedLocation=London",""))
def process_page(number): 
 link = "https://www.yell.com/customerneeds/sendenquiry/sendtoone/"
 payload = {'message':'Testing whether this email really works.','senderPostcode':'GL51 0EX','enquiryTimeframe':'withinOneMonth','senderFirstName':'mth','senderLastName':'iqbal','senderEmail':'[email protected]','senderEmailConfirm':'[email protected]','uniqueAdId':number,'channel':'desktop','ccSender':'true','marketing':'on'}
 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}
 response = requests.post(link, data = payload, headers = headers)
 items = response.json()
 item = items['emailToCustomerUUID']
 print(item)
main_page(main_url)

These are the four email IDs I've parsed from that webpage:

1. 468145be-0ac3-4ff0-adf5-aceed10b7e5e
2. f750fa18-72c9-44d1-afaa-475778dd8b47
3. d104eb93-1f35-4bdc-ad67-47b13ea67e55
4. 90d2db3b-6144-4266-a86c-aa39b4f99d9a

Post Script: I've used my fake email address and put it in this script so that you can test the result without bringing any change to it.

Question 2

Here are the few improvements I would apply:

match the "Email" links directly via //div[@class='businessCapsule--callToAction']//a[. = 'Email']/@href XPath expression - note how I'm matching the link by Email text
use urlparse to get the "id" number and extract that logic into a separate function
make a "scraper" class to share the same web-scraping session and persist things like headers to avoid repeating them every time you make a request
a better name for process_page would probably be send_email

Improved Code:

from urllib.parse import urlparse
import requests
from lxml import html
def extract_id(link):
 """Extracts the search result id number from a search result link."""
 return urlparse(link).path.split("/")[-1]
class YellScraper:
 MAIN_URL = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=London&scrambleSeed=2082758402"
 EMAIL_SEND_URL = "https://www.yell.com/customerneeds/sendenquiry/sendtoone/"
 def __init__(self):
 self.session = requests.Session()
 self.session.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}
 def scrape(self):
 response = self.session.get(self.MAIN_URL).text
 tree = html.fromstring(response)
 for email_link in tree.xpath("//div[@class='businessCapsule--callToAction']//a[. = 'Email']/@href"):
 self.send_email(extract_id(email_link))
 def send_email(self, search_result_id):
 payload = {'message': 'Testing whether this email really works.', 'senderPostcode': 'GL51 0EX',
 'enquiryTimeframe': 'withinOneMonth', 'senderFirstName': 'mth', 'senderLastName': 'iqbal',
 'senderEmail': '[email protected]', 'senderEmailConfirm': '[email protected]',
 'uniqueAdId': search_result_id, 'channel': 'desktop', 'ccSender': 'true', 'marketing': 'on'}
 response = self.session.post(self.EMAIL_SEND_URL, data=payload)
 items = response.json()
 item = items['emailToCustomerUUID']
 print(item)
if __name__ == '__main__':
 scraper = YellScraper()
 scraper.scrape()

Question 3

Thanks sir alecxe for your invaluable suggestions and dynamic code. The xpath you used in your script is overwhelming. I can't still figure out how this portion "[. = 'Email']" works. A one-liner explanation on how the xpath was built-upon will be very helpful for me to understand. Thanks again sir.

Question 4

@SMth80 glad to help! The . generally refers to the string value of a current node, which for nodes with texts only is equivalent to text(). It is explained in a greater detail here. Thanks for an another interesting problem.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-07-05 00:16:08Z

Here are the few improvements I would apply:

match the "Email" links directly via //div[@class='businessCapsule--callToAction']//a[. = 'Email']/@href XPath expression - note how I'm matching the link by Email text
use urlparse to get the "id" number and extract that logic into a separate function
make a "scraper" class to share the same web-scraping session and persist things like headers to avoid repeating them every time you make a request
a better name for process_page would probably be send_email

Improved Code:

from urllib.parse import urlparse
import requests
from lxml import html
def extract_id(link):
 """Extracts the search result id number from a search result link."""
 return urlparse(link).path.split("/")[-1]
class YellScraper:
 MAIN_URL = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=London&scrambleSeed=2082758402"
 EMAIL_SEND_URL = "https://www.yell.com/customerneeds/sendenquiry/sendtoone/"
 def __init__(self):
 self.session = requests.Session()
 self.session.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}
 def scrape(self):
 response = self.session.get(self.MAIN_URL).text
 tree = html.fromstring(response)
 for email_link in tree.xpath("//div[@class='businessCapsule--callToAction']//a[. = 'Email']/@href"):
 self.send_email(extract_id(email_link))
 def send_email(self, search_result_id):
 payload = {'message': 'Testing whether this email really works.', 'senderPostcode': 'GL51 0EX',
 'enquiryTimeframe': 'withinOneMonth', 'senderFirstName': 'mth', 'senderLastName': 'iqbal',
 'senderEmail': '[email protected]', 'senderEmailConfirm': '[email protected]',
 'uniqueAdId': search_result_id, 'channel': 'desktop', 'ccSender': 'true', 'marketing': 'on'}
 response = self.session.post(self.EMAIL_SEND_URL, data=payload)
 items = response.json()
 item = items['emailToCustomerUUID']
 print(item)
if __name__ == '__main__':
 scraper = YellScraper()
 scraper.scrape()

Thanks sir alecxe for your invaluable suggestions and dynamic code. The xpath you used in your script is overwhelming. I can't still figure out how this portion "[. = 'Email']" works. A one-liner explanation on how the xpath was built-upon will be very helpful for me to understand. Thanks again sir.
@SMth80 glad to help! The . generally refers to the string value of a current node, which for nodes with texts only is equivalent to text(). It is explained in a greater detail here. Thanks for an another interesting problem.

Stack Exchange Network

Scraper for parsing email-ID using email sending button

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scraper for parsing email-ID using email sending button

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions