0

I'm trying to write a script to find out the non-responsive links of a web-page in python. While trying, i find out that python doesn't support multi child nodes. Is it true? or we can access the multi child nodes.

Below is my code snippet:

import httplib2
import requests
from bs4 import BeautifulSoup, SoupStrainer
status = {}
response = {}
output = {}
def get_url_status(url, count):
 global links
 links = []
 print(url)
 print(count)
 if count == 0:
 return output
 else:
 # if url not in output.keys():
 headers = requests.utils.default_headers()
 req = requests.get(url, headers)
 if('200' in str(req)):
 # if url not in output.keys():
 output[url] = '200';
 for link in BeautifulSoup(req.content, parse_only=SoupStrainer('a')):
 if 'href' in str(link):
 links.append(link.get('href'))
 # removing other non-mandotary links
 for link in links[:]:
 if "mi" not in link:
 links.remove(link)
 # removing same url
 for link in links[:]:
 if link.rstrip('/') == url:
 links.remove(link)
 # removing duplicate links
 links = list(dict.fromkeys(links))
 if len(links) > 0:
 for urllink in links:
 return get_url_status(urllink, count-1)
result = get_url_status('https://www.mi.com/in', 5)
print(result)

In this code it's only traversing to only the left nodes of the child and skipping rest. something like this. enter image description here

And the output is not satisfactory and very very less compared to real.

{'https://www.mi.com/in': '200', 'https://in.c.mi.com/': '200', 'https://in.c.mi.com/index.php': '200', 'https://in.c.mi.com/global/': '200', 'https://c.mi.com/index.php': '200'}

I know, i'm lacking at multiple locations but i've never done something of this scale and this is my first time. So please excuse if this is a novice question.

Note: I've used mi.com just for the reference.

asked Jan 7, 2020 at 7:30

1 Answer 1

1

At a glance, there's one obvious problem.

if len(links) > 0:
 for urllink in links:
 return get_url_status(urllink, count-1)

This snippet does not iterate over links. It has return in its iterative body which means it will only run for the first item in links, and immediately return it. There is another bug. The function returns just None instead of output if it encounters a page with no links before count reaches 0. Do the following instead.

if len(links):
 for urllink in links:
 get_url_status(urllink, count-1)
return output

And if('200' in str(req)) is not the right way to check the status code. It will check for a substring '200' in the body, instead of only checking the status code. It should be if req.status_code == 200.

Another thing is that the function only adds responsive links to output. If you want to check for non-responsive links, don't you have to add links that do not return the 200 status code?

import requests
from bs4 import BeautifulSoup, SoupStrainer
status = {}
response = {}
output = {}
def get_url_status(url, count):
 global links
 links = []
 # if url not in output.keys():
 headers = requests.utils.default_headers()
 req = requests.get(url, headers)
 if req.status_code == 200:
 # if url not in output.keys():
 output[url] = '200'
 if count == 0:
 return output
 for link in BeautifulSoup(req.content, parse_only=SoupStrainer('a'), parser="html.parser"):
 if 'href' in str(link):
 links.append(link.get('href'))
 # removing other non-mandotary links
 for link in links:
 if "mi" not in link:
 links.remove(link)
 # removing same url
 for link in links:
 if link.rstrip('/') == url:
 links.remove(link)
 # removing duplicate links
 links = list(dict.fromkeys(links))
 print(links)
 if len(links):
 for urllink in links:
 get_url_status(urllink, count-1)
 return output
result = get_url_status('https://www.mi.com/in', 1)
print(result)
answered Jan 7, 2020 at 9:04
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Hurried-Helpful. Will try this and let you know. And yes, i'll add the links for other response code as well. I didn't add this in snippet as the code was already long. I think i'll achieve this by just adding else: output[url] = req.status_code;.
Hi @Hurried-Helpful, After modifications it's not reiterating and only executing for once. Is there a way to share the code and output.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.