3
\$\begingroup\$

I've written a script in python with POST request which is able to send email using send button and then catch the response of that email and finally parse the ID from that. I don't know whether the way I did this is the ideal one but it does parse the email ID from this process. There are 4 email sending buttons available in that webpage and my script is able to scrape them all. Here is what I've tried so far with:

import requests
from lxml import html
main_url = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=London&scrambleSeed=2082758402"
def main_page(url):
 response = requests.get(url, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}).text
 tree = html.fromstring(response)
 for titles in tree.xpath("//div[@class='businessCapsule--callToAction']"):
 title = titles.xpath('.//a[contains(@class,"btn-blue")]/@href')[0] if len(titles.xpath('.//a[contains(@class,"btn-blue")]/@href'))>0 else ""
 process_page(title.replace("/customerneeds/sendenquiry/sendtoone/","").replace("?searchedLocation=London",""))
def process_page(number): 
 link = "https://www.yell.com/customerneeds/sendenquiry/sendtoone/"
 payload = {'message':'Testing whether this email really works.','senderPostcode':'GL51 0EX','enquiryTimeframe':'withinOneMonth','senderFirstName':'mth','senderLastName':'iqbal','senderEmail':'[email protected]','senderEmailConfirm':'[email protected]','uniqueAdId':number,'channel':'desktop','ccSender':'true','marketing':'on'}
 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}
 response = requests.post(link, data = payload, headers = headers)
 items = response.json()
 item = items['emailToCustomerUUID']
 print(item)
main_page(main_url)

These are the four email IDs I've parsed from that webpage:

1. 468145be-0ac3-4ff0-adf5-aceed10b7e5e
2. f750fa18-72c9-44d1-afaa-475778dd8b47
3. d104eb93-1f35-4bdc-ad67-47b13ea67e55
4. 90d2db3b-6144-4266-a86c-aa39b4f99d9a

Post Script: I've used my fake email address and put it in this script so that you can test the result without bringing any change to it.

asked Jul 4, 2017 at 20:20
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Here are the few improvements I would apply:

  • match the "Email" links directly via //div[@class='businessCapsule--callToAction']//a[. = 'Email']/@href XPath expression - note how I'm matching the link by Email text
  • use urlparse to get the "id" number and extract that logic into a separate function
  • make a "scraper" class to share the same web-scraping session and persist things like headers to avoid repeating them every time you make a request
  • a better name for process_page would probably be send_email

Improved Code:

from urllib.parse import urlparse
import requests
from lxml import html
def extract_id(link):
 """Extracts the search result id number from a search result link."""
 return urlparse(link).path.split("/")[-1]
class YellScraper:
 MAIN_URL = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=London&scrambleSeed=2082758402"
 EMAIL_SEND_URL = "https://www.yell.com/customerneeds/sendenquiry/sendtoone/"
 def __init__(self):
 self.session = requests.Session()
 self.session.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}
 def scrape(self):
 response = self.session.get(self.MAIN_URL).text
 tree = html.fromstring(response)
 for email_link in tree.xpath("//div[@class='businessCapsule--callToAction']//a[. = 'Email']/@href"):
 self.send_email(extract_id(email_link))
 def send_email(self, search_result_id):
 payload = {'message': 'Testing whether this email really works.', 'senderPostcode': 'GL51 0EX',
 'enquiryTimeframe': 'withinOneMonth', 'senderFirstName': 'mth', 'senderLastName': 'iqbal',
 'senderEmail': '[email protected]', 'senderEmailConfirm': '[email protected]',
 'uniqueAdId': search_result_id, 'channel': 'desktop', 'ccSender': 'true', 'marketing': 'on'}
 response = self.session.post(self.EMAIL_SEND_URL, data=payload)
 items = response.json()
 item = items['emailToCustomerUUID']
 print(item)
if __name__ == '__main__':
 scraper = YellScraper()
 scraper.scrape()
answered Jul 5, 2017 at 0:16
\$\endgroup\$
2
  • \$\begingroup\$ Thanks sir alecxe for your invaluable suggestions and dynamic code. The xpath you used in your script is overwhelming. I can't still figure out how this portion "[. = 'Email']" works. A one-liner explanation on how the xpath was built-upon will be very helpful for me to understand. Thanks again sir. \$\endgroup\$ Commented Jul 5, 2017 at 4:45
  • \$\begingroup\$ @SMth80 glad to help! The . generally refers to the string value of a current node, which for nodes with texts only is equivalent to text(). It is explained in a greater detail here. Thanks for an another interesting problem. \$\endgroup\$ Commented Jul 5, 2017 at 20:22

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.