Crawl site getting URL and status code

Question 1

I wrote a crawler that for every page visited collects the status code. Below my solution. Is this code optimizable?

import urllib
def getfromurl(url):
 start = urllib.urlopen(url)
 raw = ''
 for lines in start.readlines():
 raw += lines 
 start.close()
 return raw
def dumbwork(start_link, start_url, text, pattern, counter):
 if counter < 2:
 counter = counter +1
 while start_link != -1:
 try:
 start_url = text.find('/', start_link) 
 end_url = text.find('"', start_url + 1)
 url = 'http:/' + text[start_url + 1 : end_url]
 page_status = str(urllib.urlopen(url).getcode())
 row = url + ', ' + page_status
 t.write(row + '\n')
 temp = str(getfromurl(url))
 print row
 dumbwork(temp.find(pattern), 0, temp, pattern, counter)
 start_link = text.find(pattern, end_url + 1) 
 except Exception, e:
 break
 else:
 pass
t = open('inout.txt', 'w')
text = str(getfromurl('http://www.site.it'))
pattern = '<a href="http:/'
start_link = text.find(pattern)
dumbwork(start_link, 0, text, pattern, 0)
t.close()

Question 2

Welcome to codereview ! Won't return (requests.head(url)).status_code from requests module do this for you ? I'm usually using this module as it's straight-forward and you have like a lot less headaches if you use it over urllib

Question 3

You're taking for granted that a link will be '<a href="http:/', which is definitely not always the case. What about https:// for example, or if you have something like '<a class="visited" href="http:/' ? That's why you should use a library to parse the DOM objects instead of relying on raw text parsing.
Naming:
- usually a row is related to a database, while a line is related to a text file.
- temp means nothing, it's the new content, so you should use something like new_html_content.
- It takes a bit to understand that counter is actually the max depth that you want to follow, so why not call it depth
- Function names should explain what they do, dumbwork name doesn't, something like recurse_page may be better.
- start_link is good for the first link (almost, see below) but the parameter to the function is actually the current link being parsed, so why not call it current_link?
- You used snake case for start_link, you should keep using it, so get_from_url may be better.
- start_link, start_url and end_url are not links or urls, they're actually the index of the string, so they should be start_link_index, start_url_index and end_url_index
- text is the HTML content, so just rename it to html_content
The lines doing something with row should be next to each other, or better yet, in a separate function.
That 2 should be in a constant so that the first line of the function can be something like if depth < MAX_DEPTH:
You're trapping exceptions but you're not doing anything with them, you should at least log somewhere what happened.
The text.find to get the url are probably better off in a separate function, to improve readability, something like
getfromurl already returns a string, no need for the str()
You're using always the same name for the file which, when opened with w will overwrite the contents. You should at least check if the file already exists.
You're opening a file and leaving it open for the whole duration of the process. This is not bad in itself, but I'd probably put a function called append_to_file where I open the file with a instead of w, write the line and immediately close it. Inside of that function you will also convert the status code to a string.

In the end, your worker loop may look something like this:

def recurse_page(current_link_index, start_index, html_content, pattern, depth):
 if depth < MAX_DEPTH:
 depth += 1
 while current_link_index > -1:
 try:
 url = get_url(html_content, start_index)
 append_to_file(url, urllib.urlopen(url).getcode())
 new_html_content = get_from_url(url)
 recurse_page(new_html_content.find(pattern), 0, new_html_content, pattern, depth)
 current_link_index = html_content.find(pattern, end_url_index + 1) 
 except Exception, e:
 # TODO: Proper error handling
 break

It's not complete code, but it should give you an idea of what I mean.

Question 4

I would recommend using Requests and BeautifulSoup4 for this, it makes it a lot easier.

An example of what you want to do with these two modules:

import requests
from bs4 import BeautifulSoup
resp = requests.get("url")
soup = BeautifulSoup(resp.text, "html.parser")
for item in soup.find_all(href=True):
 link = item.get("href")
 # use link

BeautifulSoup has a lot of other useful searching functionality as well, for example if you only wanted links in 'a' blocks, you could use

for item in soup.find_all("a", href=True):
 link = item.get("href")
 # use link

To install these modules use pip install requests and pip install beautifulsoup4 in your terminal.

ChatterOne ChatterOne 2,84512 silver badges18 bronze badges · Accepted Answer · 2016-07-22 09:26:21Z

You're taking for granted that a link will be '<a href="http:/', which is definitely not always the case. What about https:// for example, or if you have something like '<a class="visited" href="http:/' ? That's why you should use a library to parse the DOM objects instead of relying on raw text parsing.
Naming:
- usually a row is related to a database, while a line is related to a text file.
- temp means nothing, it's the new content, so you should use something like new_html_content.
- It takes a bit to understand that counter is actually the max depth that you want to follow, so why not call it depth
- Function names should explain what they do, dumbwork name doesn't, something like recurse_page may be better.
- start_link is good for the first link (almost, see below) but the parameter to the function is actually the current link being parsed, so why not call it current_link?
- You used snake case for start_link, you should keep using it, so get_from_url may be better.
- start_link, start_url and end_url are not links or urls, they're actually the index of the string, so they should be start_link_index, start_url_index and end_url_index
- text is the HTML content, so just rename it to html_content
The lines doing something with row should be next to each other, or better yet, in a separate function.
That 2 should be in a constant so that the first line of the function can be something like if depth < MAX_DEPTH:
You're trapping exceptions but you're not doing anything with them, you should at least log somewhere what happened.
The text.find to get the url are probably better off in a separate function, to improve readability, something like
getfromurl already returns a string, no need for the str()
You're using always the same name for the file which, when opened with w will overwrite the contents. You should at least check if the file already exists.
You're opening a file and leaving it open for the whole duration of the process. This is not bad in itself, but I'd probably put a function called append_to_file where I open the file with a instead of w, write the line and immediately close it. Inside of that function you will also convert the status code to a string.

In the end, your worker loop may look something like this:

def recurse_page(current_link_index, start_index, html_content, pattern, depth):
 if depth < MAX_DEPTH:
 depth += 1
 while current_link_index > -1:
 try:
 url = get_url(html_content, start_index)
 append_to_file(url, urllib.urlopen(url).getcode())
 new_html_content = get_from_url(url)
 recurse_page(new_html_content.find(pattern), 0, new_html_content, pattern, depth)
 current_link_index = html_content.find(pattern, end_url_index + 1) 
 except Exception, e:
 # TODO: Proper error handling
 break

It's not complete code, but it should give you an idea of what I mean.

Stack Exchange Network

Crawl site getting URL and status code

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Crawl site getting URL and status code

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions