Picking the desired one between two links

Question 1

I've written a script in python to scrape the link within contact us or about us from few webpages. The challenge here is to prioritize contact us over about us. For example, if any site contains both of them then my scraper should pick the link within contact us. However, if contact us is not present then only the scraper go for parsing the link within about us. My first attempt used the logic if "contact" in item.text.lower() or "about" in item.text.lower() but I could notice that in every cases while dealing with the below links the scraper picks the link within about us whereas my first priority is to get the link within contact us. I next rewrote it with the following approach (using two for loops to get the job done) and found it working.

This is what I've tried to get the links complying with the above criteria:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
links = (
 "http://www.innovaprint.com.sg/",
 "https://www.richardsonproperties.com/",
 "http://www.innovaprint.com.sg/",
 "http://www.cityscape.com.sg/"
 )
def Get_Link(site):
 res = requests.get(site)
 soup = BeautifulSoup(res.text,"lxml")
 for item in soup.select("a[href]"):
 if "contact" in item.text.lower():
 abslink = urljoin(site,item['href'])
 print(abslink)
 return 
 for item in soup.select("a[href]"):
 if "about" in item.text.lower():
 abslink = urljoin(site,item['href'])
 print(abslink)
 return 
if __name__ == '__main__':
 for link in links:
 Get_Link(link)

The two for loops defined within the above function look awkward so I suppose there is any better idea to do the same. Thanks in advance for any betterment of this existing code.

Question 2

You're right that the two for-loops are possibly overkill... though, it's not all that bad.. how big are the documents really?. The alternative is to track a default about link, and if a contact link appears, to override it. I'll explain that later, but first I should mention that the function should really return a value, not just print it inside the loops.

Having said that, consider the loop:

link = None
for item in soup.select("a[href]"):
 if "contact" in item.text.lower():
 link = urljoin(site,item['href'])
 # terminate the loop, no need to look further.
 break;
 if link is None and "about" in item.text.lower():
 link = urljoin(site,item['href'])
return link

rolfl rolfl 98.1k17 gold badges219 silver badges419 bronze badges · Accepted Answer · 2018-04-24 14:00:45Z

You're right that the two for-loops are possibly overkill... though, it's not all that bad.. how big are the documents really?. The alternative is to track a default about link, and if a contact link appears, to override it. I'll explain that later, but first I should mention that the function should really return a value, not just print it inside the loops.

Having said that, consider the loop:

link = None
for item in soup.select("a[href]"):
 if "contact" in item.text.lower():
 link = urljoin(site,item['href'])
 # terminate the loop, no need to look further.
 break;
 if link is None and "about" in item.text.lower():
 link = urljoin(site,item['href'])
return link

Stack Exchange Network

Picking the desired one between two links

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Picking the desired one between two links

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions