4
\$\begingroup\$

I'm just getting my feet on the ground with Python (maybe 3 days now). The only issue with teaching my self is that I have no one to critique my work.

Correct me if I'm wrong but I think my algorithm/method for solving this problem is quite promising; but, the code not so much.

The program basically strips a web-page and puts it in to the designated directory. My favorite part is the method for deciding the image's extensions :)

import os, re, urllib
def Read(url):
 uopen = urllib.urlopen(url)
 uread = ''
 for line in uopen: uread += line
 return uread
def Find(what, where):
 found = re.findall(what, where)
 match = []
 for i in found:
 i = i[9:-1]
 match.append(i)
 return match 
def Retrieve(urls, loc):
 loc = os.path.realpath(loc)
 os.system('mkdir ' + loc +'/picretrieved')
 loc += '/picretrieved'
 x = 0
 exts = ['.jpeg', '.jpg', '.gif', '.png']
 for url in urls:
 x += 1
 for i in exts:
 ext = re.search(i, url)
 if ext != None:
 ext = ext.group()
 urllib.urlretrieve(url, loc + '/' + str(x) + ext)
 else:
 continue
 print 'Placed', str(x), 'pictures in:', loc
def main():
 url = raw_input('URL to PicRetrieve (google.com): ')
 url = 'http://' + url
 loc = raw_input('Location for PicRetrieve older ("." for here): ')
 html = Read(url)
 urls = Find('img src=".*?"', html)
 print urls
 Retrieve(urls, loc)
if __name__ == '__main__':
 main()
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Apr 11, 2012 at 22:36
\$\endgroup\$
0

1 Answer 1

3
\$\begingroup\$
import os, re, urllib
def Read(url):

Python convention is to name function lowercase_with_underscores

 uopen = urllib.urlopen(url)

I recommend against abbreviations

 uread = ''
 for line in uopen: uread += line
 return uread

Use uopen.read() it'll read the whole file

def Find(what, where):
 found = re.findall(what, where)
 match = []
 for i in found:

avoid single letter variable names

 i = i[9:-1]
 match.append(i)
 return match 

Combine short lines of code

 for i in re.findall(what, where):
 match.append(i[9:-1])

I think its cleaner this way. Also use capturing groups rather then indexes.

def Retrieve(urls, loc):
 loc = os.path.realpath(loc)
 os.system('mkdir ' + loc +'/picretrieved')

Use os.mkdir rather than using system

 loc += '/picretrieved'

Use os.path.join to construct paths

 x = 0
 exts = ['.jpeg', '.jpg', '.gif', '.png']
 for url in urls:
 x += 1

use for x, url in enumerate(urls):

 for i in exts:
 ext = re.search(i, url)

Don't use regular expressions to search for simple strings. In this case I think you really want to use url.endswith(i). Also, stop using i everywhere.

 if ext != None:
 ext = ext.group()
 urllib.urlretrieve(url, loc + '/' + str(x) + ext)
 else:
 continue

This continue does nothing

 print 'Placed', str(x), 'pictures in:', loc

You don't need to use str when you are using print. It does it automatically.

def main():
 url = raw_input('URL to PicRetrieve (google.com): ')
 url = 'http://' + url
 loc = raw_input('Location for PicRetrieve older ("." for here): ')
 html = Read(url)
 urls = Find('img src=".*?"', html)

Generally, using an html parser is preferred to using a regex on html. In this simple case you are ok.

 print urls
 Retrieve(urls, loc)
if __name__ == '__main__':
 main()
answered Apr 12, 2012 at 0:29
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.