5
\$\begingroup\$

I've written a script in python to scrape the link within contact us or about us from few webpages. The challenge here is to prioritize contact us over about us. For example, if any site contains both of them then my scraper should pick the link within contact us. However, if contact us is not present then only the scraper go for parsing the link within about us. My first attempt used the logic if "contact" in item.text.lower() or "about" in item.text.lower() but I could notice that in every cases while dealing with the below links the scraper picks the link within about us whereas my first priority is to get the link within contact us. I next rewrote it with the following approach (using two for loops to get the job done) and found it working.

This is what I've tried to get the links complying with the above criteria:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
links = (
 "http://www.innovaprint.com.sg/",
 "https://www.richardsonproperties.com/",
 "http://www.innovaprint.com.sg/",
 "http://www.cityscape.com.sg/"
 )
def Get_Link(site):
 res = requests.get(site)
 soup = BeautifulSoup(res.text,"lxml")
 for item in soup.select("a[href]"):
 if "contact" in item.text.lower():
 abslink = urljoin(site,item['href'])
 print(abslink)
 return 
 for item in soup.select("a[href]"):
 if "about" in item.text.lower():
 abslink = urljoin(site,item['href'])
 print(abslink)
 return 
if __name__ == '__main__':
 for link in links:
 Get_Link(link)

The two for loops defined within the above function look awkward so I suppose there is any better idea to do the same. Thanks in advance for any betterment of this existing code.

rolfl
98.1k17 gold badges219 silver badges419 bronze badges
asked Apr 23, 2018 at 22:44
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

You're right that the two for-loops are possibly overkill... though, it's not all that bad.. how big are the documents really?. The alternative is to track a default about link, and if a contact link appears, to override it. I'll explain that later, but first I should mention that the function should really return a value, not just print it inside the loops.

Having said that, consider the loop:

link = None
for item in soup.select("a[href]"):
 if "contact" in item.text.lower():
 link = urljoin(site,item['href'])
 # terminate the loop, no need to look further.
 break;
 if link is None and "about" in item.text.lower():
 link = urljoin(site,item['href'])
return link
answered Apr 24, 2018 at 14:00
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.