I've written a script in python with POST request which is able to send email using send button and then catch the response of that email and finally parse the ID from that. I don't know whether the way I did this is the ideal one but it does parse the email ID from this process. There are 4 email sending buttons available in that webpage and my script is able to scrape them all. Here is what I've tried so far with:
import requests
from lxml import html
main_url = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=London&scrambleSeed=2082758402"
def main_page(url):
response = requests.get(url, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[@class='businessCapsule--callToAction']"):
title = titles.xpath('.//a[contains(@class,"btn-blue")]/@href')[0] if len(titles.xpath('.//a[contains(@class,"btn-blue")]/@href'))>0 else ""
process_page(title.replace("/customerneeds/sendenquiry/sendtoone/","").replace("?searchedLocation=London",""))
def process_page(number):
link = "https://www.yell.com/customerneeds/sendenquiry/sendtoone/"
payload = {'message':'Testing whether this email really works.','senderPostcode':'GL51 0EX','enquiryTimeframe':'withinOneMonth','senderFirstName':'mth','senderLastName':'iqbal','senderEmail':'[email protected]','senderEmailConfirm':'[email protected]','uniqueAdId':number,'channel':'desktop','ccSender':'true','marketing':'on'}
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}
response = requests.post(link, data = payload, headers = headers)
items = response.json()
item = items['emailToCustomerUUID']
print(item)
main_page(main_url)
These are the four email IDs I've parsed from that webpage:
1. 468145be-0ac3-4ff0-adf5-aceed10b7e5e
2. f750fa18-72c9-44d1-afaa-475778dd8b47
3. d104eb93-1f35-4bdc-ad67-47b13ea67e55
4. 90d2db3b-6144-4266-a86c-aa39b4f99d9a
Post Script: I've used my fake email address and put it in this script so that you can test the result without bringing any change to it.
1 Answer 1
Here are the few improvements I would apply:
- match the "Email" links directly via
//div[@class='businessCapsule--callToAction']//a[. = 'Email']/@href
XPath expression - note how I'm matching the link byEmail
text - use
urlparse
to get the "id" number and extract that logic into a separate function - make a "scraper" class to share the same web-scraping session and persist things like headers to avoid repeating them every time you make a request
- a better name for
process_page
would probably besend_email
Improved Code:
from urllib.parse import urlparse
import requests
from lxml import html
def extract_id(link):
"""Extracts the search result id number from a search result link."""
return urlparse(link).path.split("/")[-1]
class YellScraper:
MAIN_URL = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=London&scrambleSeed=2082758402"
EMAIL_SEND_URL = "https://www.yell.com/customerneeds/sendenquiry/sendtoone/"
def __init__(self):
self.session = requests.Session()
self.session.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}
def scrape(self):
response = self.session.get(self.MAIN_URL).text
tree = html.fromstring(response)
for email_link in tree.xpath("//div[@class='businessCapsule--callToAction']//a[. = 'Email']/@href"):
self.send_email(extract_id(email_link))
def send_email(self, search_result_id):
payload = {'message': 'Testing whether this email really works.', 'senderPostcode': 'GL51 0EX',
'enquiryTimeframe': 'withinOneMonth', 'senderFirstName': 'mth', 'senderLastName': 'iqbal',
'senderEmail': '[email protected]', 'senderEmailConfirm': '[email protected]',
'uniqueAdId': search_result_id, 'channel': 'desktop', 'ccSender': 'true', 'marketing': 'on'}
response = self.session.post(self.EMAIL_SEND_URL, data=payload)
items = response.json()
item = items['emailToCustomerUUID']
print(item)
if __name__ == '__main__':
scraper = YellScraper()
scraper.scrape()
-
\$\begingroup\$ Thanks sir alecxe for your invaluable suggestions and dynamic code. The xpath you used in your script is overwhelming. I can't still figure out how this portion "[. = 'Email']" works. A one-liner explanation on how the xpath was built-upon will be very helpful for me to understand. Thanks again sir. \$\endgroup\$SIM– SIM2017年07月05日 04:45:47 +00:00Commented Jul 5, 2017 at 4:45
-
\$\begingroup\$ @SMth80 glad to help! The
.
generally refers to the string value of a current node, which for nodes with texts only is equivalent totext()
. It is explained in a greater detail here. Thanks for an another interesting problem. \$\endgroup\$alecxe– alecxe2017年07月05日 20:22:49 +00:00Commented Jul 5, 2017 at 20:22
Explore related questions
See similar questions with these tags.