Script to download images from Reddit

Question 1

I wrote this script to download images from Reddit. I would like to hear from others on how can I improve this script.

import requests as _requests
import os as _os
class Redpy:
 def __init__(self, user):
 """Enter a user agent"""
 print("hello")
 self.user = user
 def download(self, subreddit, number=5, sort_option=None):
 """Downloads images from subreddit.
 subreddit="Name of subreddit"
 number=Number of images to be downloaded
 sort_option=new/hot/top
 """
 subreddit.strip('/')
 if sort_option == None:
 sort_option = ''
 self.url = 'https://www.reddit.com/r/' + subreddit + '/'+ sort_option + '.json'
 self.user = {'user-agent':self.user}
 res = _requests.get(self.url, headers=self.user)
 if res.status_code != 200:
 print("Could not download")
 print(res.status_code)
 return
 self._DownloadFiles(res.json(), number)
 def _DownloadFiles(self, jsonfile, number_of_files):
 image_links = self._getImages(jsonfile, number_of_files)
 if not self.createFolder():
 print("Error creating folder")
 return
 index = 0 #used to name the files
 for image_link in image_links:
 image_link = image_link.replace('amp;', '')
 f = _requests.get(image_link)
 if f.status_code==200:
 media_file = open(f'{_os.getcwd()}/red_media/{index}.jpg', 'wb')
 for chunk in f.iter_content(100000):
 media_file.write(chunk)
 media_file.close()
 print("Downloaded")
 index+=1
 print("Download complete")
 global flag
 flag=1
 def _getImages(self, jsonfile, number_of_files):
 images = [] #contains links of images
 for index in range(number_of_files):
 try:
 images.append(jsonfile['data']['children'][index]['data']['preview']['images'][0]['source']['url'])
 except Exception as e:
 print(e)
 return images
 @staticmethod
 def createFolder():
 try:
 if not _os.path.exists(f'{_os.getcwd()}\\red_media'):
 _os.mkdir(f'{_os.getcwd()}\\red_media')
 return True
 return True
 except Exception as e:
 print(e)
 return False

Things I would like to know:

Is there anything that I can do to improve the performance of the code
What can I do to improve the code styling. I have tried to follow the PEP standards as much as I could.
Anything else to improve the code.

Question 2

A few suggestions for general code quality:

If your class only does one thing and it doesn't store any values, it should be a function. Redpy only downloads images from reddit and stores values to achieve exactly this, which you could do in a function. Using a class can have unforeseen consequences.
Choose descriptive names for variables and functions. _getImages does not actually get the images, it returns a list of links of images. In this method, you have images = [] #contains links of images. The comment could have been avoided if you would have chosen image_links as name.
If you split your code up into methods or functions, everything belonging to one task should be inside it. The removal of 'amp;' in every image_link does not belong in _DownloadFiles, it should be in _getImages. download gets unnecessarily separated into _DownloadFiles and _DownloadFiles doesn't generally download files, but it could if some of its functionality got relocated elsewhere.
Clean up your code: there are unnecessary line breaks after _DownloadFiles and a redundant return True in createFolder.
Don't catch general Exceptions, be more specific. In _getImages, you should just look out for KeyErrors. Exceptions in request.get on the other hand are not handled although they possibly should be.
The pattern of looping over a list with a counter (index in your code) in _DownloadFiles can be simplified with enumerate.
When working with files, it is more elegant to use a context manager.

Possible bugs:

subreddit.strip('/') just returns a new string that you would have to assign to a new variable. In your code, the value of subreddit remains unchanged.
self.user gets updated every time download is called. If this happens multiple times, self.user becomes a dict encapsulating a dict encapsulated a dict...
If something goes wrong when extracting links in _getImages, less links than expected get returned.
If your folder already contains images, they will be overwritten.

Concerning PEP8:

A few of your lines are longer than 80 characters. Try to split them up, either by implementing the same logic over multiple lines or by breaking the line up.
In PEP8, functions and methods are in snake_case.

Nit-picky stuff:

You could just use an empty string as default argument for sort_option. Strings are immutable, so you don't have the problem of mutable default arguments.
I don't see why you would import requests as _requests and os as _os
There is no need to construct an absolute file path. f'{_os.getcwd()}/red_media/{index}.jpg' could become f'red_media/{index}.jpg'

Here is my attempt at solving this problem:

import requests
import os
def get_image_links(json, N):
 '''returs a list of the first <N> links to reddit images found in <json>'''
 try:
 children = json['data']['children']
 except KeyError:
 return []
 # append links from children until N are found
 image_links = []
 for child in children:
 try:
 image_link = child['data']['preview']['images'][0]['source']['url']
 except KeyError:
 continue
 image_link = image_link.replace('amp;', '')
 image_links.append(image_link)
 if len(image_links)==N:
 break
 return image_links
def download_files(file_links, folder_name='data', file_extension='jpeg'):
 '''downloads files from <file_links> into subfolder ./<folder_name>/'''
 # create subfolder if it does not exist
 if not os.path.exists(folder_name):
 os.mkdir(folder_name)
 # download files
 for i, file_link in enumerate(file_links):
 try:
 res = requests.get(file_link)
 except requests.exceptions.RequestException:
 print(f"Unable to download {file_link}")
 continue
 if not res.ok:
 print(f"Error {res.status_code} when requesting {file_link}")
 continue
 file_path = os.path.join(folder_name, f'{i}.{file_extension}')
 with open(file_path, 'wb') as file:
 for chunk in res.iter_content(100000):
 file.write(chunk)
def download_reddit_images(user, subreddit, N=5, sort_by='',
 folder_name='red_media'):
 '''
 downloads the first <N> images of <subreddit> sorted by <sort_by>
 (''/'new'/'hot'/'top') into subfolder ./<folder_name>/
 '''
 json_url = ('https://www.reddit.com/r/' + subreddit.strip('/') + '/'
 + sort_by + '.json')
 try:
 res = requests.get(json_url, headers={'user-agent':user})
 except requests.exceptions.RequestException:
 print(f"Unable to get {json_url}")
 return
 if not res.ok:
 print(f"Error {res.status_code} when requesting {json_url}")
 return
 image_links = get_image_links(res.json(), N)
 if not len(image_links):
 print("Unable to find any images")
 return
 download_files(image_links, folder_name=folder_name, file_extension='jpeg')
if __name__=='__main__':
 download_reddit_images('', 'all')

The problem of overwriting existing images persists. A solution would be to use the original filename from reddit that is included in the url.

Question 3

Thanks a lot for your answer. Your answer was the most comprehensive and detailed analysis of the code. I had updated the code after following some PEP 8 guidelines. Here it is github.com/prashantsengar/RedPy ... I will update the code again based on your suggestions. I have a question though, how can using a class have unforeseen consequences. I want to know more about it. And thanks again for your valuable input

Question 4

By unforeseen consequences I meant bugs like the one with self.user, since you wouldn't expect the code execution to be dependant on previous calls. The problem with using a class here is that it makes the code more complex than it has to be, while a function would work just fine. With the class, it also takes more lines to access the functionality of your code, since you have to create a Redpy object and then call download, whereas the only thing you gain is not having to enter user multiple times. For more information see this talk.

Question 5

Thanks again for explaining so nicely @reeetooo

Question 6

I would probably make a download_file function that just downloads a file to a destination and do the iterating over the links directly in download_reddit_images. But otherwise a very nice first review, welcome to Code Review!

reeetooo reeetooo 513 bronze badges · Answer 1 · 2019-05-21 21:46:26Z

A few suggestions for general code quality:

If your class only does one thing and it doesn't store any values, it should be a function. Redpy only downloads images from reddit and stores values to achieve exactly this, which you could do in a function. Using a class can have unforeseen consequences.
Choose descriptive names for variables and functions. _getImages does not actually get the images, it returns a list of links of images. In this method, you have images = [] #contains links of images. The comment could have been avoided if you would have chosen image_links as name.
If you split your code up into methods or functions, everything belonging to one task should be inside it. The removal of 'amp;' in every image_link does not belong in _DownloadFiles, it should be in _getImages. download gets unnecessarily separated into _DownloadFiles and _DownloadFiles doesn't generally download files, but it could if some of its functionality got relocated elsewhere.
Clean up your code: there are unnecessary line breaks after _DownloadFiles and a redundant return True in createFolder.
Don't catch general Exceptions, be more specific. In _getImages, you should just look out for KeyErrors. Exceptions in request.get on the other hand are not handled although they possibly should be.
The pattern of looping over a list with a counter (index in your code) in _DownloadFiles can be simplified with enumerate.
When working with files, it is more elegant to use a context manager.

Possible bugs:

subreddit.strip('/') just returns a new string that you would have to assign to a new variable. In your code, the value of subreddit remains unchanged.
self.user gets updated every time download is called. If this happens multiple times, self.user becomes a dict encapsulating a dict encapsulated a dict...
If something goes wrong when extracting links in _getImages, less links than expected get returned.
If your folder already contains images, they will be overwritten.

Concerning PEP8:

A few of your lines are longer than 80 characters. Try to split them up, either by implementing the same logic over multiple lines or by breaking the line up.
In PEP8, functions and methods are in snake_case.

Nit-picky stuff:

You could just use an empty string as default argument for sort_option. Strings are immutable, so you don't have the problem of mutable default arguments.
I don't see why you would import requests as _requests and os as _os
There is no need to construct an absolute file path. f'{_os.getcwd()}/red_media/{index}.jpg' could become f'red_media/{index}.jpg'

Here is my attempt at solving this problem:

import requests
import os
def get_image_links(json, N):
 '''returs a list of the first <N> links to reddit images found in <json>'''
 try:
 children = json['data']['children']
 except KeyError:
 return []
 # append links from children until N are found
 image_links = []
 for child in children:
 try:
 image_link = child['data']['preview']['images'][0]['source']['url']
 except KeyError:
 continue
 image_link = image_link.replace('amp;', '')
 image_links.append(image_link)
 if len(image_links)==N:
 break
 return image_links
def download_files(file_links, folder_name='data', file_extension='jpeg'):
 '''downloads files from <file_links> into subfolder ./<folder_name>/'''
 # create subfolder if it does not exist
 if not os.path.exists(folder_name):
 os.mkdir(folder_name)
 # download files
 for i, file_link in enumerate(file_links):
 try:
 res = requests.get(file_link)
 except requests.exceptions.RequestException:
 print(f"Unable to download {file_link}")
 continue
 if not res.ok:
 print(f"Error {res.status_code} when requesting {file_link}")
 continue
 file_path = os.path.join(folder_name, f'{i}.{file_extension}')
 with open(file_path, 'wb') as file:
 for chunk in res.iter_content(100000):
 file.write(chunk)
def download_reddit_images(user, subreddit, N=5, sort_by='',
 folder_name='red_media'):
 '''
 downloads the first <N> images of <subreddit> sorted by <sort_by>
 (''/'new'/'hot'/'top') into subfolder ./<folder_name>/
 '''
 json_url = ('https://www.reddit.com/r/' + subreddit.strip('/') + '/'
 + sort_by + '.json')
 try:
 res = requests.get(json_url, headers={'user-agent':user})
 except requests.exceptions.RequestException:
 print(f"Unable to get {json_url}")
 return
 if not res.ok:
 print(f"Error {res.status_code} when requesting {json_url}")
 return
 image_links = get_image_links(res.json(), N)
 if not len(image_links):
 print("Unable to find any images")
 return
 download_files(image_links, folder_name=folder_name, file_extension='jpeg')
if __name__=='__main__':
 download_reddit_images('', 'all')

The problem of overwriting existing images persists. A solution would be to use the original filename from reddit that is included in the url.

Thanks a lot for your answer. Your answer was the most comprehensive and detailed analysis of the code. I had updated the code after following some PEP 8 guidelines. Here it is github.com/prashantsengar/RedPy ... I will update the code again based on your suggestions. I have a question though, how can using a class have unforeseen consequences. I want to know more about it. And thanks again for your valuable input
By unforeseen consequences I meant bugs like the one with self.user, since you wouldn't expect the code execution to be dependant on previous calls. The problem with using a class here is that it makes the code more complex than it has to be, while a function would work just fine. With the class, it also takes more lines to access the functionality of your code, since you have to create a Redpy object and then call download, whereas the only thing you gain is not having to enter user multiple times. For more information see this talk.
I would probably make a download_file function that just downloads a file to a destination and do the iterating over the links directly in download_reddit_images. But otherwise a very nice first review, welcome to Code Review!

Stack Exchange Network

Script to download images from Reddit

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Script to download images from Reddit

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions