I'm just getting my feet on the ground with Python (maybe 3 days now). The only issue with teaching my self is that I have no one to critique my work.
Correct me if I'm wrong but I think my algorithm/method for solving this problem is quite promising; but, the code not so much.
The program basically strips a web-page and puts it in to the designated directory. My favorite part is the method for deciding the image's extensions :)
import os, re, urllib
def Read(url):
uopen = urllib.urlopen(url)
uread = ''
for line in uopen: uread += line
return uread
def Find(what, where):
found = re.findall(what, where)
match = []
for i in found:
i = i[9:-1]
match.append(i)
return match
def Retrieve(urls, loc):
loc = os.path.realpath(loc)
os.system('mkdir ' + loc +'/picretrieved')
loc += '/picretrieved'
x = 0
exts = ['.jpeg', '.jpg', '.gif', '.png']
for url in urls:
x += 1
for i in exts:
ext = re.search(i, url)
if ext != None:
ext = ext.group()
urllib.urlretrieve(url, loc + '/' + str(x) + ext)
else:
continue
print 'Placed', str(x), 'pictures in:', loc
def main():
url = raw_input('URL to PicRetrieve (google.com): ')
url = 'http://' + url
loc = raw_input('Location for PicRetrieve older ("." for here): ')
html = Read(url)
urls = Find('img src=".*?"', html)
print urls
Retrieve(urls, loc)
if __name__ == '__main__':
main()
1 Answer 1
import os, re, urllib
def Read(url):
Python convention is to name function lowercase_with_underscores
uopen = urllib.urlopen(url)
I recommend against abbreviations
uread = ''
for line in uopen: uread += line
return uread
Use uopen.read()
it'll read the whole file
def Find(what, where):
found = re.findall(what, where)
match = []
for i in found:
avoid single letter variable names
i = i[9:-1]
match.append(i)
return match
Combine short lines of code
for i in re.findall(what, where):
match.append(i[9:-1])
I think its cleaner this way. Also use capturing groups rather then indexes.
def Retrieve(urls, loc):
loc = os.path.realpath(loc)
os.system('mkdir ' + loc +'/picretrieved')
Use os.mkdir
rather than using system
loc += '/picretrieved'
Use os.path.join to construct paths
x = 0
exts = ['.jpeg', '.jpg', '.gif', '.png']
for url in urls:
x += 1
use for x, url in enumerate(urls):
for i in exts:
ext = re.search(i, url)
Don't use regular expressions to search for simple strings. In this case I think you really want to use url.endswith(i)
. Also, stop using i
everywhere.
if ext != None:
ext = ext.group()
urllib.urlretrieve(url, loc + '/' + str(x) + ext)
else:
continue
This continue does nothing
print 'Placed', str(x), 'pictures in:', loc
You don't need to use str when you are using print. It does it automatically.
def main():
url = raw_input('URL to PicRetrieve (google.com): ')
url = 'http://' + url
loc = raw_input('Location for PicRetrieve older ("." for here): ')
html = Read(url)
urls = Find('img src=".*?"', html)
Generally, using an html parser is preferred to using a regex on html. In this simple case you are ok.
print urls
Retrieve(urls, loc)
if __name__ == '__main__':
main()