Downloading a picture via urllib and python

Question 1

So I'm trying to make a Python script that downloads webcomics and puts them in a folder on my desktop. I've found a few similar programs on here that do something similar, but nothing quite like what I need. The one that I found most similar is right here (http://bytes.com/topic/python/answers/850927-problem-using-urllib-download-images). I tried using this code:

>>> import urllib
>>> image = urllib.URLopener()
>>> image.retrieve("http://www.gunnerkrigg.com//comics/00000001.jpg","00000001.jpg")
('00000001.jpg', <httplib.HTTPMessage instance at 0x1457a80>)

I then searched my computer for a file "00000001.jpg", but all I found was the cached picture of it. I'm not even sure it saved the file to my computer. Once I understand how to get the file downloaded, I think I know how to handle the rest. Essentially just use a for loop and split the string at the '00000000'.'jpg' and increment the '00000000' up to the largest number, which I would have to somehow determine. Any reccomendations on the best way to do this or how to download the file correctly?

Thanks!

EDIT 6/15/10

Here is the completed script, it saves the files to any directory you choose. For some odd reason, the files weren't downloading and they just did. Any suggestions on how to clean it up would be much appreciated. I'm currently working out how to find out many comics exist on the site so I can get just the latest one, rather than having the program quit after a certain number of exceptions are raised.

import urllib
import os
comicCounter=len(os.listdir('/file'))+1 # reads the number of files in the folder to start downloading at the next comic
errorCount=0
def download_comic(url,comicName):
 """
 download a comic in the form of
 url = http://www.example.com
 comicName = '00000000.jpg'
 """
 image=urllib.URLopener()
 image.retrieve(url,comicName) # download comicName at URL
while comicCounter <= 1000: # not the most elegant solution
 os.chdir('/file') # set where files download to
 try:
 if comicCounter < 10: # needed to break into 10^n segments because comic names are a set of zeros followed by a number
 comicNumber=str('0000000'+str(comicCounter)) # string containing the eight digit comic number
 comicName=str(comicNumber+".jpg") # string containing the file name
 url=str("http://www.gunnerkrigg.com//comics/"+comicName) # creates the URL for the comic
 comicCounter+=1 # increments the comic counter to go to the next comic, must be before the download in case the download raises an exception
 download_comic(url,comicName) # uses the function defined above to download the comic
 print url
 if 10 <= comicCounter < 100:
 comicNumber=str('000000'+str(comicCounter))
 comicName=str(comicNumber+".jpg")
 url=str("http://www.gunnerkrigg.com//comics/"+comicName)
 comicCounter+=1
 download_comic(url,comicName)
 print url
 if 100 <= comicCounter < 1000:
 comicNumber=str('00000'+str(comicCounter))
 comicName=str(comicNumber+".jpg")
 url=str("http://www.gunnerkrigg.com//comics/"+comicName)
 comicCounter+=1
 download_comic(url,comicName)
 print url
 else: # quit the program if any number outside this range shows up
 quit
 except IOError: # urllib raises an IOError for a 404 error, when the comic doesn't exist
 errorCount+=1 # add one to the error count
 if errorCount>3: # if more than three errors occur during downloading, quit the program
 break
 else:
 print str("comic"+ ' ' + str(comicCounter) + ' ' + "does not exist") # otherwise say that the certain comic number doesn't exist
print "all comics are up to date" # prints if all comics are downloaded

Question 2

Ok, I got them all to download! Now I'm stuck with a very inelegant solution for determining how many comics are online... I'm basically running the program to a number I know is over the number of comics and then running an exception to come up when a comic doesn't exist, and when the exception comes up more than twice (since I don't think more than two comics will be missing) it quits the program, thinking that there are no more to download. Since I don't have access to the website, is there a best way to determine how many files there are on the website? I'll post my code in a second.

Question 3

creativebe.com/icombiner/merge-jpg.html I used that program to merge all the .jpg files into one PDF. Works awesome, and it's free!

Question 4

Consider posting your solution as an answer, and removing it from the question. Question posts are for asking questions, answer posts for answers :-)

Question 5

why is this tagged with beautifulsoup ? This post shows up in list of top beautifulsoup question

Question 6

@P0W I've removed the discussed tag.

Question 7

Python 2

Using urllib.urlretrieve

import urllib
urllib.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")

Python 3

Using urllib.request.urlretrieve (part of Python 3's legacy interface, works exactly the same)

import urllib.request
urllib.request.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")

Question 8

It seems to be cutting off the file extension for me when passed as an argument (the extension is present in the original URL). Any idea why?

Question 9

@JeffThompson, no. Does the example (in my answer) work for you (it does for me with Python 2.7.8)? Note how it does specify the extension explicitly for the local file.

Question 10

Yours does, yes. I think I assumed that if no file extension was given, the extension of the file would be appended. It made sense to me at the time, but I think now I understand what's happening.

Question 11

this doesn't seem to work when I want to download it to my current file...why?

Question 12

seems if you run this from pycharm's console who knows where the current folder is....

Question 13

Python 2:

import urllib
f = open('00000001.jpg','wb')
f.write(urllib.urlopen('http://www.gunnerkrigg.com//comics/00000001.jpg').read())
f.close()

Python 3:

import urllib.request
f = open('00000001.jpg','wb')
f.write(urllib.request.urlopen('http://www.gunnerkrigg.com//comics/00000001.jpg').read())
f.close()

Question 14

Just for the record, using requests library.

import requests
f = open('00000001.jpg','wb')
f.write(requests.get('http://www.gunnerkrigg.com//comics/00000001.jpg').content)
f.close()

Though it should check for requests.get() error.

Question 15

Even if this solution is not using urllib, you might already be using the requests library already in your python script (that was my case while searching for this) so you might want to use it as well to get your pictures.

Question 16

Thank you for posting this answer on top of the others. I ended up needing custom headers to get my download to work, and the pointer to the requests library shortened the process of getting everything to work for me considerably.

Question 17

Couldn't even get urllib to work in python3. Requests had no issues and it's already loaded! The much better choice I reckon.

Question 18

@user3023715 in python3 you need to import request from urllib see here

Question 19

For Python 3 you will need to import import urllib.request:

import urllib.request 
urllib.request.urlretrieve(url, filename)

for more info check out the link

Question 20

Python 3 version of @DiGMi's answer:

from urllib import request
f = open('00000001.jpg', 'wb')
f.write(request.urlopen("http://www.gunnerkrigg.com/comics/00000001.jpg").read())
f.close()

Question 21

I have found this answer and I edit that in more reliable way

def download_photo(self, img_url, filename):
 try:
 image_on_web = urllib.urlopen(img_url)
 if image_on_web.headers.maintype == 'image':
 buf = image_on_web.read()
 path = os.getcwd() + DOWNLOADED_IMAGE_PATH
 file_path = "%s%s" % (path, filename)
 downloaded_image = file(file_path, "wb")
 downloaded_image.write(buf)
 downloaded_image.close()
 image_on_web.close()
 else:
 return False 
 except:
 return False
 return True

From this you never get any other resources or exceptions while downloading.

Question 22

You should remove the 'self'

Question 23

It's easiest to just use .read() to read the partial or entire response, then write it into a file you've opened in a known good location.

Question 24

If you know that the files are located in the same directory dir of the website site and have the following format: filename_01.jpg, ..., filename_10.jpg then download all of them:

import requests
for x in range(1, 10):
 str1 = 'filename_%2.2d.jpg' % (x)
 str2 = 'http://site/dir/filename_%2.2d.jpg' % (x)
 f = open(str1, 'wb')
 f.write(requests.get(str2).content)
 f.close()

Question 25

Maybe you need 'User-Agent':

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36')]
response = opener.open('http://google.com')
htmlData = response.read()
f = open('file.txt','w')
f.write(htmlData )
f.close()

Question 26

Maybe page is not available?

Question 27

Using urllib, you can get this done instantly.

import urllib.request
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(URL, "images/0.jpg")

Question 28

This needs to be on top! Adding headers helps with 403 forbidden errors

Question 29

Aside from suggesting you read the docs for retrieve() carefully (http://docs.python.org/library/urllib.html#urllib.URLopener.retrieve), I would suggest actually calling read() on the content of the response, and then saving it into a file of your choosing rather than leaving it in the temporary file that retrieve creates.

Question 30

All the above codes, do not allow to preserve the original image name, which sometimes is required. This will help in saving the images to your local drive, preserving the original image name

 IMAGE = URL.rsplit('/',1)[1]
 urllib.urlretrieve(URL, IMAGE)

Try this for more details.

Question 31

This worked for me using python 3.

It gets a list of URLs from the csv file and starts downloading them into a folder. In case the content or image does not exist it takes that exception and continues making its magic.

import urllib.request
import csv
import os
errorCount=0
file_list = "/Users/$USER/Desktop/YOUR-FILE-TO-DOWNLOAD-IMAGES/image_{0}.jpg"
# CSV file must separate by commas
# urls.csv is set to your current working directory make sure your cd into or add the corresponding path
with open ('urls.csv') as images:
 images = csv.reader(images)
 img_count = 1
 print("Please Wait.. it will take some time")
 for image in images:
 try:
 urllib.request.urlretrieve(image[0],
 file_list.format(img_count))
 img_count += 1
 except IOError:
 errorCount+=1
 # Stop in case you reach 100 errors downloading images
 if errorCount>100:
 break
 else:
 print ("File does not exist")
print ("Done!")

Question 32

According to urllib.request.urlretrieve — Python 3.9.2 documentation, The function is ported from the Python 2 module urllib (as opposed to urllib2). It might become deprecated at some point in the future.

Because of this, it might be better to use requests.get(url, params=None, **kwargs). Here is a MWE.

import requests
 
url = 'http://example.com/example.jpg'
response = requests.get(url)
with open(filename, "wb") as f:
 f.write(response.content)

Refer to Downlolad Google’s WebP Images via Take Screenshots with Selenium WebDriver.

Question 33

A simpler solution may be(python 3):

import urllib.request
import os
os.chdir("D:\\comic") #your path
i=1;
s="00000000"
while i<1000:
 try:
 urllib.request.urlretrieve("http://www.gunnerkrigg.com//comics/"+ s[:8-len(str(i))]+ str(i)+".jpg",str(i)+".jpg")
 except:
 print("not possible" + str(i))
 i+=1;

Question 34

Be careful about using a bare except like that, see stackoverflow.com/questions/54948548/….

Question 35

What about this:

import urllib, os
def from_url( url, filename = None ):
 '''Store the url content to filename'''
 if not filename:
 filename = os.path.basename( os.path.realpath(url) )
 req = urllib.request.Request( url )
 try:
 response = urllib.request.urlopen( req )
 except urllib.error.URLError as e:
 if hasattr( e, 'reason' ):
 print( 'Fail in reaching the server -> ', e.reason )
 return False
 elif hasattr( e, 'code' ):
 print( 'The server couldn\'t fulfill the request -> ', e.code )
 return False
 else:
 with open( filename, 'wb' ) as fo:
 fo.write( response.read() )
 print( 'Url saved as %s' % filename )
 return True
##
def main():
 test_url = 'http://cdn.sstatic.net/stackoverflow/img/favicon.ico'
 from_url( test_url )
if __name__ == '__main__':
 main()

Question 36

If you need proxy support you can do this:

 if needProxy == False:
 returnCode, urlReturnResponse = urllib.urlretrieve( myUrl, fullJpegPathAndName )
 else:
 proxy_support = urllib2.ProxyHandler({"https":myHttpProxyAddress})
 opener = urllib2.build_opener(proxy_support)
 urllib2.install_opener(opener)
 urlReader = urllib2.urlopen( myUrl ).read() 
 with open( fullJpegPathAndName, "w" ) as f:
 f.write( urlReader )

Question 37

Another way to do this is via the fastai library. This worked like a charm for me. I was facing a SSL: CERTIFICATE_VERIFY_FAILED Error using urlretrieve so I tried that.

url = 'https://www.linkdoesntexist.com/lennon.jpg'
fastai.core.download_url(url,'image1.jpg', show_progress=False)

Question 38

I was facing a SSL: CERTIFICATE_VERIFY_FAILED Error stackoverflow.com/questions/27835619/…

Question 39

Using requests

import requests
import shutil,os
headers = {
 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
currentDir = os.getcwd()
path = os.path.join(currentDir,'Images')#saving images to Images folder
def ImageDl(url):
 attempts = 0
 while attempts < 5:#retry 5 times
 try:
 filename = url.split('/')[-1]
 r = requests.get(url,headers=headers,stream=True,timeout=5)
 if r.status_code == 200:
 with open(os.path.join(path,filename),'wb') as f:
 r.raw.decode_content = True
 shutil.copyfileobj(r.raw,f)
 print(filename)
 break
 except Exception as e:
 attempts+=1
 print(e)
if __name__ == '__main__':
 ImageDl(url)

Question 40

And if you want to download images similar to the website directory structure, you can do this:

 result_path = './result/'
 soup = BeautifulSoup(self.file, 'css.parser')
 for image in soup.findAll("img"):
 image["name"] = image["src"].split("/")[-1]
 image['path'] = image["src"].replace(image["name"], '')
 os.makedirs(result_path + image['path'], exist_ok=True)
 if image["src"].lower().startswith("http"):
 urlretrieve(image["src"], result_path + image["src"][1:])
 else:
 urlretrieve(url + image["src"], result_path + image["src"][1:])

Matthew Flaschen 286k53 gold badges525 silver badges554 bronze badges · Accepted Answer · 2010-06-15 05:42:26Z

298

Python 2

Using urllib.urlretrieve

import urllib
urllib.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")

Python 3

Using urllib.request.urlretrieve (part of Python 3's legacy interface, works exactly the same)

import urllib.request
urllib.request.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")

Share

Improve this answer

edited Mar 31, 2020 at 22:55

Alex Lamson's user avatar

Alex Lamson

5656 silver badges16 bronze badges

answered Jun 15, 2010 at 5:42

Matthew Flaschen's user avatar

Matthew Flaschen

286k53 gold badges525 silver badges554 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

JeffThompson

JeffThompson Over a year ago

It seems to be cutting off the file extension for me when passed as an argument (the extension is present in the original URL). Any idea why?

2014年11月01日T23:39:41.337Z+00:00

Matthew Flaschen

Matthew Flaschen Over a year ago

@JeffThompson, no. Does the example (in my answer) work for you (it does for me with Python 2.7.8)? Note how it does specify the extension explicitly for the local file.

2014年11月03日T06:53:58.293Z+00:00

JeffThompson

JeffThompson Over a year ago

Yours does, yes. I think I assumed that if no file extension was given, the extension of the file would be appended. It made sense to me at the time, but I think now I understand what's happening.

2014年11月03日T11:41:13.64Z+00:00

Charlie Parker

Charlie Parker Over a year ago

this doesn't seem to work when I want to download it to my current file...why?

2021年08月10日T17:44:25.65Z+00:00

Charlie Parker

Charlie Parker Over a year ago

seems if you run this from pycharm's console who knows where the current folder is....

2021年08月10日T17:45:20.253Z+00:00

CollectivesTM on Stack Overflow

Downloading a picture via urllib and python

20 Answers 20

5 Comments

Comments

4 Comments

Comments

Comments

1 Comment

Comments

Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

20 Answers 20

5 Comments

Comments

4 Comments

Comments

Comments

1 Comment

Comments

Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Linked

Related