Simple image scraping

Question 1

I wrote this code over the last few days and I learned a lot, and it works as I expect. However I suspect it's woefully inefficient:

import requests, bs4, os
os.chdir('../../Desktop')
wallpaperFolder = os.makedirs('Wallpapers', exist_ok=True)
pageCount = 1
GalleryUrl = 'https://wall.alphacoders.com/search.php?search=Game+Of+Thrones&page=%s' % (pageCount)
linkList = []
while not GalleryUrl.endswith('page=3'):
 # Download the page
 GalleryUrl = 'https://wall.alphacoders.com/search.php?search=Game+Of+Thrones&page=%s' % (pageCount)
 print('Opening gallary... %s' % (GalleryUrl))
 res = requests.get(GalleryUrl)
 res.raise_for_status()
 soup = bs4.BeautifulSoup(res.text, 'html.parser')
 # find urls to image page
 section = soup.find_all('div', attrs={'class': 'boxgrid'})
 for images in section:
 imageLink = images.find_all('a')
 for a in imageLink:
 imageUrl = a.get('href')
 linkList.append(imageUrl)
 print('Image found at... ' + imageUrl)
 linkCount = 0
# Follow links and download images on page.
 while linkCount != len(linkList):
 print('Downloading image at... %s' % (str(linkList[linkCount])))
 imagePageUrl = ('https://wall.alphacoders.com/' + str(linkList[linkCount]))
 res = requests.get(imagePageUrl)
 wallpaperSoup = bs4.BeautifulSoup(res.text, 'html.parser')
 link = wallpaperSoup.find(id='main_wallpaper')
 imageSrc = link['src']
 # Save image to folder
 res = requests.get(imageSrc)
 imageFile = open(os.path.join(os.getcwd(), 'Wallpapers', os.path.basename(imageSrc)), 'wb')
 for chunk in res.iter_content(12000):
 imageFile.write(chunk)
 imageFile.close()
 linkCount += 1
 # Move to next page and empty link list
 pageCount += 1
 del linkList[:] 
print('Done')

Aside from cutting the print statements and simply letting the program do its thing, how could I optimize this code to run more efficiently?

Question 2

os.chdir('../../Desktop')

Halt. Stop right there. This assumes the script is located at some specific location. Desktop happens to be always in the same place, but this assumes the script to be in a place relative to Desktop. On what OS do you intent to run this? If it's Linux-only, you could replace it by:

os.chdir('~/Desktop')

Now it will work regardless of where the script is located. Of-course, you just made the problem worse for non-Linux systems.

Even better would be to ask where the user wants to drop his files, possibly by using arguments (take a look at argparse).

But why don't you simply drop it instead? A map named Wallpapers will simply be created in the current directory. This would be obvious and expected behaviour.

for images in section:
 imageLink = images.find_all('a')
 for a in imageLink:
 imageUrl = a.get('href')
 linkList.append(imageUrl)
 print('Image found at... ' + imageUrl)

This isn't your bottleneck (not by a long shot), but that second for reeks of bad time complexity for something simple as telling the user where the images are.

One of the major time-consumers appears to be the downloading itself.

I ran cProfile against your code to find the bottleneck(s):

python3 -m cProfile -s tottime CR_133564.py

This automatically sorts the output on the total time taken per function, highest first. Don't forget to import cProfile. Let's take a look at everything with a total time higher than 0.5 seconds:

 8989953 function calls (8984594 primitive calls) in 108.880 seconds
 Ordered by: internal time
 ncalls tottime percall cumtime percall filename:lineno(function)
 20393 59.314 0.003 59.314 0.003 {method 'read' of '_ssl._SSLSocket' objects}
 183 28.853 0.158 28.853 0.158 {method 'do_handshake' of '_ssl._SSLSocket' objects}
 183 9.689 0.053 9.689 0.053 {method 'connect' of '_socket.socket' objects}
 183 1.547 0.008 1.548 0.008 {built-in method _socket.getaddrinfo}
 183 1.005 0.005 1.005 0.005 {method 'load_verify_locations' of '_ssl._SSLContext' objects}
 78129 0.621 0.000 3.383 0.000 parser.py:301(parse_starttag)
 585660 0.580 0.000 0.580 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
 93 0.549 0.006 5.750 0.062 parser.py:134(goahead)

That's responsible for 102.2 seconds out of the 108.9 it takes. If you want to optimize, do it here. The rest is peanuts.

Do you notice anything? You're wasting half a minute on handshakes. That's almost 30 seconds you could be doing something useful instead. Unless you find a trick to carry all files over one handshake, there's not much you can do about it. It also takes almost 60 seconds simply to read (AKA: download) the data. That's only slightly more than I'd expect with my internet connection, so there's not much you can do about that either.

Question 3

Thanks Mast. This introduces a few concepts I'm not yet familiar with, but now I know what direction to look.

Mast ♦Mast 13.8k12 gold badges57 silver badges127 bronze badges · Accepted Answer · 2016-07-14 20:16:09Z

os.chdir('../../Desktop')

Halt. Stop right there. This assumes the script is located at some specific location. Desktop happens to be always in the same place, but this assumes the script to be in a place relative to Desktop. On what OS do you intent to run this? If it's Linux-only, you could replace it by:

os.chdir('~/Desktop')

Now it will work regardless of where the script is located. Of-course, you just made the problem worse for non-Linux systems.

Even better would be to ask where the user wants to drop his files, possibly by using arguments (take a look at argparse).

But why don't you simply drop it instead? A map named Wallpapers will simply be created in the current directory. This would be obvious and expected behaviour.

for images in section:
 imageLink = images.find_all('a')
 for a in imageLink:
 imageUrl = a.get('href')
 linkList.append(imageUrl)
 print('Image found at... ' + imageUrl)

This isn't your bottleneck (not by a long shot), but that second for reeks of bad time complexity for something simple as telling the user where the images are.

One of the major time-consumers appears to be the downloading itself.

I ran cProfile against your code to find the bottleneck(s):

python3 -m cProfile -s tottime CR_133564.py

This automatically sorts the output on the total time taken per function, highest first. Don't forget to import cProfile. Let's take a look at everything with a total time higher than 0.5 seconds:

 8989953 function calls (8984594 primitive calls) in 108.880 seconds
 Ordered by: internal time
 ncalls tottime percall cumtime percall filename:lineno(function)
 20393 59.314 0.003 59.314 0.003 {method 'read' of '_ssl._SSLSocket' objects}
 183 28.853 0.158 28.853 0.158 {method 'do_handshake' of '_ssl._SSLSocket' objects}
 183 9.689 0.053 9.689 0.053 {method 'connect' of '_socket.socket' objects}
 183 1.547 0.008 1.548 0.008 {built-in method _socket.getaddrinfo}
 183 1.005 0.005 1.005 0.005 {method 'load_verify_locations' of '_ssl._SSLContext' objects}
 78129 0.621 0.000 3.383 0.000 parser.py:301(parse_starttag)
 585660 0.580 0.000 0.580 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
 93 0.549 0.006 5.750 0.062 parser.py:134(goahead)

That's responsible for 102.2 seconds out of the 108.9 it takes. If you want to optimize, do it here. The rest is peanuts.

Do you notice anything? You're wasting half a minute on handshakes. That's almost 30 seconds you could be doing something useful instead. Unless you find a trick to carry all files over one handshake, there's not much you can do about it. It also takes almost 60 seconds simply to read (AKA: download) the data. That's only slightly more than I'd expect with my internet connection, so there's not much you can do about that either.

Thanks Mast. This introduces a few concepts I'm not yet familiar with, but now I know what direction to look.

Stack Exchange Network

Simple image scraping

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Simple image scraping

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions