Saving images from a web page

Question 1

I'm just getting my feet on the ground with Python (maybe 3 days now). The only issue with teaching my self is that I have no one to critique my work.

Correct me if I'm wrong but I think my algorithm/method for solving this problem is quite promising; but, the code not so much.

The program basically strips a web-page and puts it in to the designated directory. My favorite part is the method for deciding the image's extensions :)

import os, re, urllib
def Read(url):
 uopen = urllib.urlopen(url)
 uread = ''
 for line in uopen: uread += line
 return uread
def Find(what, where):
 found = re.findall(what, where)
 match = []
 for i in found:
 i = i[9:-1]
 match.append(i)
 return match 
def Retrieve(urls, loc):
 loc = os.path.realpath(loc)
 os.system('mkdir ' + loc +'/picretrieved')
 loc += '/picretrieved'
 x = 0
 exts = ['.jpeg', '.jpg', '.gif', '.png']
 for url in urls:
 x += 1
 for i in exts:
 ext = re.search(i, url)
 if ext != None:
 ext = ext.group()
 urllib.urlretrieve(url, loc + '/' + str(x) + ext)
 else:
 continue
 print 'Placed', str(x), 'pictures in:', loc
def main():
 url = raw_input('URL to PicRetrieve (google.com): ')
 url = 'http://' + url
 loc = raw_input('Location for PicRetrieve older ("." for here): ')
 html = Read(url)
 urls = Find('img src=".*?"', html)
 print urls
 Retrieve(urls, loc)
if __name__ == '__main__':
 main()

Question 2

import os, re, urllib
def Read(url):

Python convention is to name function lowercase_with_underscores

 uopen = urllib.urlopen(url)

I recommend against abbreviations

 uread = ''
 for line in uopen: uread += line
 return uread

Use uopen.read() it'll read the whole file

def Find(what, where):
 found = re.findall(what, where)
 match = []
 for i in found:

avoid single letter variable names

 i = i[9:-1]
 match.append(i)
 return match

Combine short lines of code

 for i in re.findall(what, where):
 match.append(i[9:-1])

I think its cleaner this way. Also use capturing groups rather then indexes.

def Retrieve(urls, loc):
 loc = os.path.realpath(loc)
 os.system('mkdir ' + loc +'/picretrieved')

Use os.mkdir rather than using system

 loc += '/picretrieved'

Use os.path.join to construct paths

 x = 0
 exts = ['.jpeg', '.jpg', '.gif', '.png']
 for url in urls:
 x += 1

use for x, url in enumerate(urls):

 for i in exts:
 ext = re.search(i, url)

Don't use regular expressions to search for simple strings. In this case I think you really want to use url.endswith(i). Also, stop using i everywhere.

 if ext != None:
 ext = ext.group()
 urllib.urlretrieve(url, loc + '/' + str(x) + ext)
 else:
 continue

This continue does nothing

 print 'Placed', str(x), 'pictures in:', loc

You don't need to use str when you are using print. It does it automatically.

def main():
 url = raw_input('URL to PicRetrieve (google.com): ')
 url = 'http://' + url
 loc = raw_input('Location for PicRetrieve older ("." for here): ')
 html = Read(url)
 urls = Find('img src=".*?"', html)

Generally, using an html parser is preferred to using a regex on html. In this simple case you are ok.

 print urls
 Retrieve(urls, loc)
if __name__ == '__main__':
 main()

Winston Ewert Winston Ewert 30.7k4 gold badges52 silver badges79 bronze badges · Answer 1 · 2012-04-12 00:29:36Z

import os, re, urllib
def Read(url):

Python convention is to name function lowercase_with_underscores

 uopen = urllib.urlopen(url)

I recommend against abbreviations

 uread = ''
 for line in uopen: uread += line
 return uread

Use uopen.read() it'll read the whole file

def Find(what, where):
 found = re.findall(what, where)
 match = []
 for i in found:

avoid single letter variable names

 i = i[9:-1]
 match.append(i)
 return match

Combine short lines of code

 for i in re.findall(what, where):
 match.append(i[9:-1])

I think its cleaner this way. Also use capturing groups rather then indexes.

def Retrieve(urls, loc):
 loc = os.path.realpath(loc)
 os.system('mkdir ' + loc +'/picretrieved')

Use os.mkdir rather than using system

 loc += '/picretrieved'

Use os.path.join to construct paths

 x = 0
 exts = ['.jpeg', '.jpg', '.gif', '.png']
 for url in urls:
 x += 1

use for x, url in enumerate(urls):

 for i in exts:
 ext = re.search(i, url)

Don't use regular expressions to search for simple strings. In this case I think you really want to use url.endswith(i). Also, stop using i everywhere.

 if ext != None:
 ext = ext.group()
 urllib.urlretrieve(url, loc + '/' + str(x) + ext)
 else:
 continue

This continue does nothing

 print 'Placed', str(x), 'pictures in:', loc

You don't need to use str when you are using print. It does it automatically.

def main():
 url = raw_input('URL to PicRetrieve (google.com): ')
 url = 'http://' + url
 loc = raw_input('Location for PicRetrieve older ("." for here): ')
 html = Read(url)
 urls = Find('img src=".*?"', html)

Generally, using an html parser is preferred to using a regex on html. In this simple case you are ok.

 print urls
 Retrieve(urls, loc)
if __name__ == '__main__':
 main()

Stack Exchange Network

Saving images from a web page

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Saving images from a web page

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions