I wrote a crawler that for every page visited collects the status code. Below my solution. Is this code optimizable?
import urllib
def getfromurl(url):
start = urllib.urlopen(url)
raw = ''
for lines in start.readlines():
raw += lines
start.close()
return raw
def dumbwork(start_link, start_url, text, pattern, counter):
if counter < 2:
counter = counter +1
while start_link != -1:
try:
start_url = text.find('/', start_link)
end_url = text.find('"', start_url + 1)
url = 'http:/' + text[start_url + 1 : end_url]
page_status = str(urllib.urlopen(url).getcode())
row = url + ', ' + page_status
t.write(row + '\n')
temp = str(getfromurl(url))
print row
dumbwork(temp.find(pattern), 0, temp, pattern, counter)
start_link = text.find(pattern, end_url + 1)
except Exception, e:
break
else:
pass
t = open('inout.txt', 'w')
text = str(getfromurl('http://www.site.it'))
pattern = '<a href="http:/'
start_link = text.find(pattern)
dumbwork(start_link, 0, text, pattern, 0)
t.close()
2 Answers 2
You're taking for granted that a link will be
'<a href="http:/'
, which is definitely not always the case. What abouthttps://
for example, or if you have something like'<a class="visited" href="http:/'
? That's why you should use a library to parse the DOM objects instead of relying on raw text parsing.Naming:
- usually a
row
is related to a database, while aline
is related to a text file. temp
means nothing, it's the new content, so you should use something likenew_html_content
.- It takes a bit to understand that
counter
is actually the max depth that you want to follow, so why not call itdepth
- Function names should explain what they do,
dumbwork
name doesn't, something likerecurse_page
may be better. start_link
is good for the first link (almost, see below) but the parameter to the function is actually the current link being parsed, so why not call itcurrent_link
?- You used snake case for
start_link
, you should keep using it, soget_from_url
may be better. start_link
,start_url
andend_url
are not links or urls, they're actually the index of the string, so they should bestart_link_index
,start_url_index
andend_url_index
text
is the HTML content, so just rename it tohtml_content
- usually a
The lines doing something with
row
should be next to each other, or better yet, in a separate function.That
2
should be in a constant so that the first line of the function can be something likeif depth < MAX_DEPTH:
You're trapping exceptions but you're not doing anything with them, you should at least log somewhere what happened.
The text.find to get the url are probably better off in a separate function, to improve readability, something like
getfromurl
already returns a string, no need for thestr()
You're using always the same name for the file which, when opened with
w
will overwrite the contents. You should at least check if the file already exists.You're opening a file and leaving it open for the whole duration of the process. This is not bad in itself, but I'd probably put a function called
append_to_file
where I open the file witha
instead ofw
, write the line and immediately close it. Inside of that function you will also convert the status code to a string.
In the end, your worker loop may look something like this:
def recurse_page(current_link_index, start_index, html_content, pattern, depth):
if depth < MAX_DEPTH:
depth += 1
while current_link_index > -1:
try:
url = get_url(html_content, start_index)
append_to_file(url, urllib.urlopen(url).getcode())
new_html_content = get_from_url(url)
recurse_page(new_html_content.find(pattern), 0, new_html_content, pattern, depth)
current_link_index = html_content.find(pattern, end_url_index + 1)
except Exception, e:
# TODO: Proper error handling
break
It's not complete code, but it should give you an idea of what I mean.
I would recommend using Requests and BeautifulSoup4 for this, it makes it a lot easier.
An example of what you want to do with these two modules:
import requests
from bs4 import BeautifulSoup
resp = requests.get("url")
soup = BeautifulSoup(resp.text, "html.parser")
for item in soup.find_all(href=True):
link = item.get("href")
# use link
BeautifulSoup has a lot of other useful searching functionality as well, for example if you only wanted links in 'a' blocks, you could use
for item in soup.find_all("a", href=True):
link = item.get("href")
# use link
To install these modules use pip install requests
and pip install beautifulsoup4
in your terminal.
return (requests.head(url)).status_code
from requests module do this for you ? I'm usually using this module as it's straight-forward and you have like a lot less headaches if you use it overurllib
\$\endgroup\$