I've written a script in python to scrape the link within contact us
or about us
from few webpages. The challenge here is to prioritize contact us
over about us
. For example, if any site contains both of them then my scraper should pick the link within contact us
. However, if contact us
is not present then only the scraper go for parsing the link within about us
. My first attempt used the logic if "contact" in item.text.lower() or "about" in item.text.lower()
but I could notice that in every cases while dealing with the below links the scraper picks the link within about us
whereas my first priority is to get the link within contact us
. I next rewrote it with the following approach (using two for loops to get the job done) and found it working.
This is what I've tried to get the links complying with the above criteria:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
links = (
"http://www.innovaprint.com.sg/",
"https://www.richardsonproperties.com/",
"http://www.innovaprint.com.sg/",
"http://www.cityscape.com.sg/"
)
def Get_Link(site):
res = requests.get(site)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("a[href]"):
if "contact" in item.text.lower():
abslink = urljoin(site,item['href'])
print(abslink)
return
for item in soup.select("a[href]"):
if "about" in item.text.lower():
abslink = urljoin(site,item['href'])
print(abslink)
return
if __name__ == '__main__':
for link in links:
Get_Link(link)
The two for loops
defined within the above function look awkward so I suppose there is any better idea to do the same. Thanks in advance for any betterment of this existing code.
1 Answer 1
You're right that the two for-loops are possibly overkill... though, it's not all that bad.. how big are the documents really?. The alternative is to track a default about
link, and if a contact
link appears, to override it. I'll explain that later, but first I should mention that the function should really return a value, not just print it inside the loops.
Having said that, consider the loop:
link = None
for item in soup.select("a[href]"):
if "contact" in item.text.lower():
link = urljoin(site,item['href'])
# terminate the loop, no need to look further.
break;
if link is None and "about" in item.text.lower():
link = urljoin(site,item['href'])
return link