221

So I'm trying to make a Python script that downloads webcomics and puts them in a folder on my desktop. I've found a few similar programs on here that do something similar, but nothing quite like what I need. The one that I found most similar is right here (http://bytes.com/topic/python/answers/850927-problem-using-urllib-download-images). I tried using this code:

>>> import urllib
>>> image = urllib.URLopener()
>>> image.retrieve("http://www.gunnerkrigg.com//comics/00000001.jpg","00000001.jpg")
('00000001.jpg', <httplib.HTTPMessage instance at 0x1457a80>)

I then searched my computer for a file "00000001.jpg", but all I found was the cached picture of it. I'm not even sure it saved the file to my computer. Once I understand how to get the file downloaded, I think I know how to handle the rest. Essentially just use a for loop and split the string at the '00000000'.'jpg' and increment the '00000000' up to the largest number, which I would have to somehow determine. Any reccomendations on the best way to do this or how to download the file correctly?

Thanks!

EDIT 6/15/10

Here is the completed script, it saves the files to any directory you choose. For some odd reason, the files weren't downloading and they just did. Any suggestions on how to clean it up would be much appreciated. I'm currently working out how to find out many comics exist on the site so I can get just the latest one, rather than having the program quit after a certain number of exceptions are raised.

import urllib
import os
comicCounter=len(os.listdir('/file'))+1 # reads the number of files in the folder to start downloading at the next comic
errorCount=0
def download_comic(url,comicName):
 """
 download a comic in the form of
 url = http://www.example.com
 comicName = '00000000.jpg'
 """
 image=urllib.URLopener()
 image.retrieve(url,comicName) # download comicName at URL
while comicCounter <= 1000: # not the most elegant solution
 os.chdir('/file') # set where files download to
 try:
 if comicCounter < 10: # needed to break into 10^n segments because comic names are a set of zeros followed by a number
 comicNumber=str('0000000'+str(comicCounter)) # string containing the eight digit comic number
 comicName=str(comicNumber+".jpg") # string containing the file name
 url=str("http://www.gunnerkrigg.com//comics/"+comicName) # creates the URL for the comic
 comicCounter+=1 # increments the comic counter to go to the next comic, must be before the download in case the download raises an exception
 download_comic(url,comicName) # uses the function defined above to download the comic
 print url
 if 10 <= comicCounter < 100:
 comicNumber=str('000000'+str(comicCounter))
 comicName=str(comicNumber+".jpg")
 url=str("http://www.gunnerkrigg.com//comics/"+comicName)
 comicCounter+=1
 download_comic(url,comicName)
 print url
 if 100 <= comicCounter < 1000:
 comicNumber=str('00000'+str(comicCounter))
 comicName=str(comicNumber+".jpg")
 url=str("http://www.gunnerkrigg.com//comics/"+comicName)
 comicCounter+=1
 download_comic(url,comicName)
 print url
 else: # quit the program if any number outside this range shows up
 quit
 except IOError: # urllib raises an IOError for a 404 error, when the comic doesn't exist
 errorCount+=1 # add one to the error count
 if errorCount>3: # if more than three errors occur during downloading, quit the program
 break
 else:
 print str("comic"+ ' ' + str(comicCounter) + ' ' + "does not exist") # otherwise say that the certain comic number doesn't exist
print "all comics are up to date" # prints if all comics are downloaded
kmonsoor
8,1897 gold badges43 silver badges58 bronze badges
asked Jun 15, 2010 at 5:35
7
  • Ok, I got them all to download! Now I'm stuck with a very inelegant solution for determining how many comics are online... I'm basically running the program to a number I know is over the number of comics and then running an exception to come up when a comic doesn't exist, and when the exception comes up more than twice (since I don't think more than two comics will be missing) it quits the program, thinking that there are no more to download. Since I don't have access to the website, is there a best way to determine how many files there are on the website? I'll post my code in a second. Commented Jun 15, 2010 at 17:17
  • creativebe.com/icombiner/merge-jpg.html I used that program to merge all the .jpg files into one PDF. Works awesome, and it's free! Commented Jun 15, 2010 at 18:46
  • 10
    Consider posting your solution as an answer, and removing it from the question. Question posts are for asking questions, answer posts for answers :-) Commented Aug 24, 2014 at 8:51
  • why is this tagged with beautifulsoup ? This post shows up in list of top beautifulsoup question Commented Nov 26, 2016 at 6:24
  • 1
    @P0W I've removed the discussed tag. Commented Dec 28, 2017 at 0:44

20 Answers 20

298

Python 2

Using urllib.urlretrieve

import urllib
urllib.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")

Python 3

Using urllib.request.urlretrieve (part of Python 3's legacy interface, works exactly the same)

import urllib.request
urllib.request.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")
answered Jun 15, 2010 at 5:42
Sign up to request clarification or add additional context in comments.

5 Comments

It seems to be cutting off the file extension for me when passed as an argument (the extension is present in the original URL). Any idea why?
@JeffThompson, no. Does the example (in my answer) work for you (it does for me with Python 2.7.8)? Note how it does specify the extension explicitly for the local file.
Yours does, yes. I think I assumed that if no file extension was given, the extension of the file would be appended. It made sense to me at the time, but I think now I understand what's happening.
this doesn't seem to work when I want to download it to my current file...why?
seems if you run this from pycharm's console who knows where the current folder is....
101

Python 2:

import urllib
f = open('00000001.jpg','wb')
f.write(urllib.urlopen('http://www.gunnerkrigg.com//comics/00000001.jpg').read())
f.close()

Python 3:

import urllib.request
f = open('00000001.jpg','wb')
f.write(urllib.request.urlopen('http://www.gunnerkrigg.com//comics/00000001.jpg').read())
f.close()
answered Jun 15, 2010 at 5:40

Comments

84

Just for the record, using requests library.

import requests
f = open('00000001.jpg','wb')
f.write(requests.get('http://www.gunnerkrigg.com//comics/00000001.jpg').content)
f.close()

Though it should check for requests.get() error.

answered Feb 19, 2013 at 16:26

4 Comments

Even if this solution is not using urllib, you might already be using the requests library already in your python script (that was my case while searching for this) so you might want to use it as well to get your pictures.
Thank you for posting this answer on top of the others. I ended up needing custom headers to get my download to work, and the pointer to the requests library shortened the process of getting everything to work for me considerably.
Couldn't even get urllib to work in python3. Requests had no issues and it's already loaded! The much better choice I reckon.
@user3023715 in python3 you need to import request from urllib see here
40

For Python 3 you will need to import import urllib.request:

import urllib.request 
urllib.request.urlretrieve(url, filename)

for more info check out the link

answered Jul 30, 2017 at 14:48

Comments

14

Python 3 version of @DiGMi's answer:

from urllib import request
f = open('00000001.jpg', 'wb')
f.write(request.urlopen("http://www.gunnerkrigg.com/comics/00000001.jpg").read())
f.close()
answered Aug 29, 2013 at 15:40

Comments

9

I have found this answer and I edit that in more reliable way

def download_photo(self, img_url, filename):
 try:
 image_on_web = urllib.urlopen(img_url)
 if image_on_web.headers.maintype == 'image':
 buf = image_on_web.read()
 path = os.getcwd() + DOWNLOADED_IMAGE_PATH
 file_path = "%s%s" % (path, filename)
 downloaded_image = file(file_path, "wb")
 downloaded_image.write(buf)
 downloaded_image.close()
 image_on_web.close()
 else:
 return False 
 except:
 return False
 return True

From this you never get any other resources or exceptions while downloading.

answered Apr 8, 2013 at 13:25

1 Comment

You should remove the 'self'
7

It's easiest to just use .read() to read the partial or entire response, then write it into a file you've opened in a known good location.

answered Jun 15, 2010 at 5:38

Comments

7

If you know that the files are located in the same directory dir of the website site and have the following format: filename_01.jpg, ..., filename_10.jpg then download all of them:

import requests
for x in range(1, 10):
 str1 = 'filename_%2.2d.jpg' % (x)
 str2 = 'http://site/dir/filename_%2.2d.jpg' % (x)
 f = open(str1, 'wb')
 f.write(requests.get(str2).content)
 f.close()
answered Feb 3, 2016 at 8:13

Comments

4

Maybe you need 'User-Agent':

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36')]
response = opener.open('http://google.com')
htmlData = response.read()
f = open('file.txt','w')
f.write(htmlData )
f.close()
answered May 20, 2014 at 9:30

1 Comment

Maybe page is not available?
4

Using urllib, you can get this done instantly.

import urllib.request
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(URL, "images/0.jpg")
answered May 11, 2020 at 4:31

1 Comment

This needs to be on top! Adding headers helps with 403 forbidden errors
2

Aside from suggesting you read the docs for retrieve() carefully (http://docs.python.org/library/urllib.html#urllib.URLopener.retrieve), I would suggest actually calling read() on the content of the response, and then saving it into a file of your choosing rather than leaving it in the temporary file that retrieve creates.

answered Jun 15, 2010 at 5:40

Comments

2

All the above codes, do not allow to preserve the original image name, which sometimes is required. This will help in saving the images to your local drive, preserving the original image name

 IMAGE = URL.rsplit('/',1)[1]
 urllib.urlretrieve(URL, IMAGE)

Try this for more details.

answered Jul 18, 2014 at 4:42

Comments

2

This worked for me using python 3.

It gets a list of URLs from the csv file and starts downloading them into a folder. In case the content or image does not exist it takes that exception and continues making its magic.

import urllib.request
import csv
import os
errorCount=0
file_list = "/Users/$USER/Desktop/YOUR-FILE-TO-DOWNLOAD-IMAGES/image_{0}.jpg"
# CSV file must separate by commas
# urls.csv is set to your current working directory make sure your cd into or add the corresponding path
with open ('urls.csv') as images:
 images = csv.reader(images)
 img_count = 1
 print("Please Wait.. it will take some time")
 for image in images:
 try:
 urllib.request.urlretrieve(image[0],
 file_list.format(img_count))
 img_count += 1
 except IOError:
 errorCount+=1
 # Stop in case you reach 100 errors downloading images
 if errorCount>100:
 break
 else:
 print ("File does not exist")
print ("Done!")
answered Feb 22, 2018 at 12:12

Comments

2

According to urllib.request.urlretrieve — Python 3.9.2 documentation, The function is ported from the Python 2 module urllib (as opposed to urllib2). It might become deprecated at some point in the future.

Because of this, it might be better to use requests.get(url, params=None, **kwargs). Here is a MWE.

import requests
 
url = 'http://example.com/example.jpg'
response = requests.get(url)
with open(filename, "wb") as f:
 f.write(response.content)

Refer to Downlolad Google’s WebP Images via Take Screenshots with Selenium WebDriver.

answered Feb 20, 2021 at 14:31

Comments

1

A simpler solution may be(python 3):

import urllib.request
import os
os.chdir("D:\\comic") #your path
i=1;
s="00000000"
while i<1000:
 try:
 urllib.request.urlretrieve("http://www.gunnerkrigg.com//comics/"+ s[:8-len(str(i))]+ str(i)+".jpg",str(i)+".jpg")
 except:
 print("not possible" + str(i))
 i+=1;
answered Feb 1, 2017 at 8:48

1 Comment

Be careful about using a bare except like that, see stackoverflow.com/questions/54948548/….
0

What about this:

import urllib, os
def from_url( url, filename = None ):
 '''Store the url content to filename'''
 if not filename:
 filename = os.path.basename( os.path.realpath(url) )
 req = urllib.request.Request( url )
 try:
 response = urllib.request.urlopen( req )
 except urllib.error.URLError as e:
 if hasattr( e, 'reason' ):
 print( 'Fail in reaching the server -> ', e.reason )
 return False
 elif hasattr( e, 'code' ):
 print( 'The server couldn\'t fulfill the request -> ', e.code )
 return False
 else:
 with open( filename, 'wb' ) as fo:
 fo.write( response.read() )
 print( 'Url saved as %s' % filename )
 return True
##
def main():
 test_url = 'http://cdn.sstatic.net/stackoverflow/img/favicon.ico'
 from_url( test_url )
if __name__ == '__main__':
 main()
ecolell
1553 silver badges9 bronze badges
answered Oct 30, 2014 at 1:37

Comments

0

If you need proxy support you can do this:

 if needProxy == False:
 returnCode, urlReturnResponse = urllib.urlretrieve( myUrl, fullJpegPathAndName )
 else:
 proxy_support = urllib2.ProxyHandler({"https":myHttpProxyAddress})
 opener = urllib2.build_opener(proxy_support)
 urllib2.install_opener(opener)
 urlReader = urllib2.urlopen( myUrl ).read() 
 with open( fullJpegPathAndName, "w" ) as f:
 f.write( urlReader )
answered Mar 6, 2018 at 14:58

Comments

0

Another way to do this is via the fastai library. This worked like a charm for me. I was facing a SSL: CERTIFICATE_VERIFY_FAILED Error using urlretrieve so I tried that.

url = 'https://www.linkdoesntexist.com/lennon.jpg'
fastai.core.download_url(url,'image1.jpg', show_progress=False)
answered Jun 15, 2019 at 7:46

1 Comment

I was facing a SSL: CERTIFICATE_VERIFY_FAILED Error stackoverflow.com/questions/27835619/…
0

Using requests

import requests
import shutil,os
headers = {
 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
currentDir = os.getcwd()
path = os.path.join(currentDir,'Images')#saving images to Images folder
def ImageDl(url):
 attempts = 0
 while attempts < 5:#retry 5 times
 try:
 filename = url.split('/')[-1]
 r = requests.get(url,headers=headers,stream=True,timeout=5)
 if r.status_code == 200:
 with open(os.path.join(path,filename),'wb') as f:
 r.raw.decode_content = True
 shutil.copyfileobj(r.raw,f)
 print(filename)
 break
 except Exception as e:
 attempts+=1
 print(e)
if __name__ == '__main__':
 ImageDl(url)
answered Apr 18, 2020 at 17:02

Comments

0

And if you want to download images similar to the website directory structure, you can do this:

 result_path = './result/'
 soup = BeautifulSoup(self.file, 'css.parser')
 for image in soup.findAll("img"):
 image["name"] = image["src"].split("/")[-1]
 image['path'] = image["src"].replace(image["name"], '')
 os.makedirs(result_path + image['path'], exist_ok=True)
 if image["src"].lower().startswith("http"):
 urlretrieve(image["src"], result_path + image["src"][1:])
 else:
 urlretrieve(url + image["src"], result_path + image["src"][1:])
answered Aug 20, 2021 at 15:19

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.