Tree traverse in Python

Question 1

I'm trying to write a script to find out the non-responsive links of a web-page in python. While trying, i find out that python doesn't support multi child nodes. Is it true? or we can access the multi child nodes.

Below is my code snippet:

import httplib2
import requests
from bs4 import BeautifulSoup, SoupStrainer
status = {}
response = {}
output = {}
def get_url_status(url, count):
 global links
 links = []
 print(url)
 print(count)
 if count == 0:
 return output
 else:
 # if url not in output.keys():
 headers = requests.utils.default_headers()
 req = requests.get(url, headers)
 if('200' in str(req)):
 # if url not in output.keys():
 output[url] = '200';
 for link in BeautifulSoup(req.content, parse_only=SoupStrainer('a')):
 if 'href' in str(link):
 links.append(link.get('href'))
 # removing other non-mandotary links
 for link in links[:]:
 if "mi" not in link:
 links.remove(link)
 # removing same url
 for link in links[:]:
 if link.rstrip('/') == url:
 links.remove(link)
 # removing duplicate links
 links = list(dict.fromkeys(links))
 if len(links) > 0:
 for urllink in links:
 return get_url_status(urllink, count-1)
result = get_url_status('https://www.mi.com/in', 5)
print(result)

In this code it's only traversing to only the left nodes of the child and skipping rest. something like this. enter image description here

And the output is not satisfactory and very very less compared to real.

{'https://www.mi.com/in': '200', 'https://in.c.mi.com/': '200', 'https://in.c.mi.com/index.php': '200', 'https://in.c.mi.com/global/': '200', 'https://c.mi.com/index.php': '200'}

I know, i'm lacking at multiple locations but i've never done something of this scale and this is my first time. So please excuse if this is a novice question.

Note: I've used mi.com just for the reference.

Question 2

At a glance, there's one obvious problem.

if len(links) > 0:
 for urllink in links:
 return get_url_status(urllink, count-1)

This snippet does not iterate over links. It has return in its iterative body which means it will only run for the first item in links, and immediately return it. There is another bug. The function returns just None instead of output if it encounters a page with no links before count reaches 0. Do the following instead.

if len(links):
 for urllink in links:
 get_url_status(urllink, count-1)
return output

And if('200' in str(req)) is not the right way to check the status code. It will check for a substring '200' in the body, instead of only checking the status code. It should be if req.status_code == 200.

Another thing is that the function only adds responsive links to output. If you want to check for non-responsive links, don't you have to add links that do not return the 200 status code?

import requests
from bs4 import BeautifulSoup, SoupStrainer
status = {}
response = {}
output = {}
def get_url_status(url, count):
 global links
 links = []
 # if url not in output.keys():
 headers = requests.utils.default_headers()
 req = requests.get(url, headers)
 if req.status_code == 200:
 # if url not in output.keys():
 output[url] = '200'
 if count == 0:
 return output
 for link in BeautifulSoup(req.content, parse_only=SoupStrainer('a'), parser="html.parser"):
 if 'href' in str(link):
 links.append(link.get('href'))
 # removing other non-mandotary links
 for link in links:
 if "mi" not in link:
 links.remove(link)
 # removing same url
 for link in links:
 if link.rstrip('/') == url:
 links.remove(link)
 # removing duplicate links
 links = list(dict.fromkeys(links))
 print(links)
 if len(links):
 for urllink in links:
 get_url_status(urllink, count-1)
 return output
result = get_url_status('https://www.mi.com/in', 1)
print(result)

Question 3

Thanks @Hurried-Helpful. Will try this and let you know. And yes, i'll add the links for other response code as well. I didn't add this in snippet as the code was already long. I think i'll achieve this by just adding else: output[url] = req.status_code;.

Question 4

Hi @Hurried-Helpful, After modifications it's not reiterating and only executing for once. Is there a way to share the code and output.

Hurried-Helpful 2,0209 silver badges15 bronze badges · Accepted Answer · 2020-01-07 09:04:23Z

At a glance, there's one obvious problem.

if len(links) > 0:
 for urllink in links:
 return get_url_status(urllink, count-1)

This snippet does not iterate over links. It has return in its iterative body which means it will only run for the first item in links, and immediately return it. There is another bug. The function returns just None instead of output if it encounters a page with no links before count reaches 0. Do the following instead.

if len(links):
 for urllink in links:
 get_url_status(urllink, count-1)
return output

And if('200' in str(req)) is not the right way to check the status code. It will check for a substring '200' in the body, instead of only checking the status code. It should be if req.status_code == 200.

Another thing is that the function only adds responsive links to output. If you want to check for non-responsive links, don't you have to add links that do not return the 200 status code?

import requests
from bs4 import BeautifulSoup, SoupStrainer
status = {}
response = {}
output = {}
def get_url_status(url, count):
 global links
 links = []
 # if url not in output.keys():
 headers = requests.utils.default_headers()
 req = requests.get(url, headers)
 if req.status_code == 200:
 # if url not in output.keys():
 output[url] = '200'
 if count == 0:
 return output
 for link in BeautifulSoup(req.content, parse_only=SoupStrainer('a'), parser="html.parser"):
 if 'href' in str(link):
 links.append(link.get('href'))
 # removing other non-mandotary links
 for link in links:
 if "mi" not in link:
 links.remove(link)
 # removing same url
 for link in links:
 if link.rstrip('/') == url:
 links.remove(link)
 # removing duplicate links
 links = list(dict.fromkeys(links))
 print(links)
 if len(links):
 for urllink in links:
 get_url_status(urllink, count-1)
 return output
result = get_url_status('https://www.mi.com/in', 1)
print(result)

Thanks @Hurried-Helpful. Will try this and let you know. And yes, i'll add the links for other response code as well. I didn't add this in snippet as the code was already long. I think i'll achieve this by just adding else: output[url] = req.status_code;.
Hi @Hurried-Helpful, After modifications it's not reiterating and only executing for once. Is there a way to share the code and output.

CollectivesTM on Stack Overflow

Tree traverse in Python

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related