7
\$\begingroup\$

I wrote this code over the last few days and I learned a lot, and it works as I expect. However I suspect it's woefully inefficient:

import requests, bs4, os
os.chdir('../../Desktop')
wallpaperFolder = os.makedirs('Wallpapers', exist_ok=True)
pageCount = 1
GalleryUrl = 'https://wall.alphacoders.com/search.php?search=Game+Of+Thrones&page=%s' % (pageCount)
linkList = []
while not GalleryUrl.endswith('page=3'):
 # Download the page
 GalleryUrl = 'https://wall.alphacoders.com/search.php?search=Game+Of+Thrones&page=%s' % (pageCount)
 print('Opening gallary... %s' % (GalleryUrl))
 res = requests.get(GalleryUrl)
 res.raise_for_status()
 soup = bs4.BeautifulSoup(res.text, 'html.parser')
 # find urls to image page
 section = soup.find_all('div', attrs={'class': 'boxgrid'})
 for images in section:
 imageLink = images.find_all('a')
 for a in imageLink:
 imageUrl = a.get('href')
 linkList.append(imageUrl)
 print('Image found at... ' + imageUrl)
 linkCount = 0
# Follow links and download images on page.
 while linkCount != len(linkList):
 print('Downloading image at... %s' % (str(linkList[linkCount])))
 imagePageUrl = ('https://wall.alphacoders.com/' + str(linkList[linkCount]))
 res = requests.get(imagePageUrl)
 wallpaperSoup = bs4.BeautifulSoup(res.text, 'html.parser')
 link = wallpaperSoup.find(id='main_wallpaper')
 imageSrc = link['src']
 # Save image to folder
 res = requests.get(imageSrc)
 imageFile = open(os.path.join(os.getcwd(), 'Wallpapers', os.path.basename(imageSrc)), 'wb')
 for chunk in res.iter_content(12000):
 imageFile.write(chunk)
 imageFile.close()
 linkCount += 1
 # Move to next page and empty link list
 pageCount += 1
 del linkList[:] 
print('Done')

Aside from cutting the print statements and simply letting the program do its thing, how could I optimize this code to run more efficiently?

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jul 1, 2016 at 5:50
\$\endgroup\$

1 Answer 1

6
\$\begingroup\$
os.chdir('../../Desktop')

Halt. Stop right there. This assumes the script is located at some specific location. Desktop happens to be always in the same place, but this assumes the script to be in a place relative to Desktop. On what OS do you intent to run this? If it's Linux-only, you could replace it by:

os.chdir('~/Desktop')

Now it will work regardless of where the script is located. Of-course, you just made the problem worse for non-Linux systems.

Even better would be to ask where the user wants to drop his files, possibly by using arguments (take a look at argparse).

But why don't you simply drop it instead? A map named Wallpapers will simply be created in the current directory. This would be obvious and expected behaviour.

for images in section:
 imageLink = images.find_all('a')
 for a in imageLink:
 imageUrl = a.get('href')
 linkList.append(imageUrl)
 print('Image found at... ' + imageUrl)

This isn't your bottleneck (not by a long shot), but that second for reeks of bad time complexity for something simple as telling the user where the images are.

One of the major time-consumers appears to be the downloading itself.

I ran cProfile against your code to find the bottleneck(s):

python3 -m cProfile -s tottime CR_133564.py

This automatically sorts the output on the total time taken per function, highest first. Don't forget to import cProfile. Let's take a look at everything with a total time higher than 0.5 seconds:

 8989953 function calls (8984594 primitive calls) in 108.880 seconds
 Ordered by: internal time
 ncalls tottime percall cumtime percall filename:lineno(function)
 20393 59.314 0.003 59.314 0.003 {method 'read' of '_ssl._SSLSocket' objects}
 183 28.853 0.158 28.853 0.158 {method 'do_handshake' of '_ssl._SSLSocket' objects}
 183 9.689 0.053 9.689 0.053 {method 'connect' of '_socket.socket' objects}
 183 1.547 0.008 1.548 0.008 {built-in method _socket.getaddrinfo}
 183 1.005 0.005 1.005 0.005 {method 'load_verify_locations' of '_ssl._SSLContext' objects}
 78129 0.621 0.000 3.383 0.000 parser.py:301(parse_starttag)
 585660 0.580 0.000 0.580 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
 93 0.549 0.006 5.750 0.062 parser.py:134(goahead)

That's responsible for 102.2 seconds out of the 108.9 it takes. If you want to optimize, do it here. The rest is peanuts.

Do you notice anything? You're wasting half a minute on handshakes. That's almost 30 seconds you could be doing something useful instead. Unless you find a trick to carry all files over one handshake, there's not much you can do about it. It also takes almost 60 seconds simply to read (AKA: download) the data. That's only slightly more than I'd expect with my internet connection, so there's not much you can do about that either.

answered Jul 14, 2016 at 20:16
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Thanks Mast. This introduces a few concepts I'm not yet familiar with, but now I know what direction to look. \$\endgroup\$ Commented Jul 15, 2016 at 2:09

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.