I wrote this code over the last few days and I learned a lot, and it works as I expect. However I suspect it's woefully inefficient:
import requests, bs4, os
os.chdir('../../Desktop')
wallpaperFolder = os.makedirs('Wallpapers', exist_ok=True)
pageCount = 1
GalleryUrl = 'https://wall.alphacoders.com/search.php?search=Game+Of+Thrones&page=%s' % (pageCount)
linkList = []
while not GalleryUrl.endswith('page=3'):
# Download the page
GalleryUrl = 'https://wall.alphacoders.com/search.php?search=Game+Of+Thrones&page=%s' % (pageCount)
print('Opening gallary... %s' % (GalleryUrl))
res = requests.get(GalleryUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# find urls to image page
section = soup.find_all('div', attrs={'class': 'boxgrid'})
for images in section:
imageLink = images.find_all('a')
for a in imageLink:
imageUrl = a.get('href')
linkList.append(imageUrl)
print('Image found at... ' + imageUrl)
linkCount = 0
# Follow links and download images on page.
while linkCount != len(linkList):
print('Downloading image at... %s' % (str(linkList[linkCount])))
imagePageUrl = ('https://wall.alphacoders.com/' + str(linkList[linkCount]))
res = requests.get(imagePageUrl)
wallpaperSoup = bs4.BeautifulSoup(res.text, 'html.parser')
link = wallpaperSoup.find(id='main_wallpaper')
imageSrc = link['src']
# Save image to folder
res = requests.get(imageSrc)
imageFile = open(os.path.join(os.getcwd(), 'Wallpapers', os.path.basename(imageSrc)), 'wb')
for chunk in res.iter_content(12000):
imageFile.write(chunk)
imageFile.close()
linkCount += 1
# Move to next page and empty link list
pageCount += 1
del linkList[:]
print('Done')
Aside from cutting the print statements and simply letting the program do its thing, how could I optimize this code to run more efficiently?
1 Answer 1
os.chdir('../../Desktop')
Halt. Stop right there. This assumes the script is located at some specific location. Desktop
happens to be always in the same place, but this assumes the script to be in a place relative to Desktop
. On what OS do you intent to run this? If it's Linux-only, you could replace it by:
os.chdir('~/Desktop')
Now it will work regardless of where the script is located. Of-course, you just made the problem worse for non-Linux systems.
Even better would be to ask where the user wants to drop his files, possibly by using arguments (take a look at argparse
).
But why don't you simply drop it instead? A map named Wallpapers
will simply be created in the current directory. This would be obvious and expected behaviour.
for images in section:
imageLink = images.find_all('a')
for a in imageLink:
imageUrl = a.get('href')
linkList.append(imageUrl)
print('Image found at... ' + imageUrl)
This isn't your bottleneck (not by a long shot), but that second for
reeks of bad time complexity for something simple as telling the user where the images are.
One of the major time-consumers appears to be the downloading itself.
I ran cProfile
against your code to find the bottleneck(s):
python3 -m cProfile -s tottime CR_133564.py
This automatically sorts the output on the total time taken per function, highest first. Don't forget to import cProfile
. Let's take a look at everything with a total time higher than 0.5 seconds:
8989953 function calls (8984594 primitive calls) in 108.880 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
20393 59.314 0.003 59.314 0.003 {method 'read' of '_ssl._SSLSocket' objects}
183 28.853 0.158 28.853 0.158 {method 'do_handshake' of '_ssl._SSLSocket' objects}
183 9.689 0.053 9.689 0.053 {method 'connect' of '_socket.socket' objects}
183 1.547 0.008 1.548 0.008 {built-in method _socket.getaddrinfo}
183 1.005 0.005 1.005 0.005 {method 'load_verify_locations' of '_ssl._SSLContext' objects}
78129 0.621 0.000 3.383 0.000 parser.py:301(parse_starttag)
585660 0.580 0.000 0.580 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
93 0.549 0.006 5.750 0.062 parser.py:134(goahead)
That's responsible for 102.2 seconds out of the 108.9 it takes. If you want to optimize, do it here. The rest is peanuts.
Do you notice anything? You're wasting half a minute on handshakes. That's almost 30 seconds you could be doing something useful instead. Unless you find a trick to carry all files over one handshake, there's not much you can do about it. It also takes almost 60 seconds simply to read
(AKA: download) the data. That's only slightly more than I'd expect with my internet connection, so there's not much you can do about that either.
-
1\$\begingroup\$ Thanks Mast. This introduces a few concepts I'm not yet familiar with, but now I know what direction to look. \$\endgroup\$Lachy Vass– Lachy Vass2016年07月15日 02:09:04 +00:00Commented Jul 15, 2016 at 2:09
Explore related questions
See similar questions with these tags.