\$\begingroup\$
\$\endgroup\$
I wrote this script to download images from Reddit. I would like to hear from others on how can I improve this script.
import requests as _requests
import os as _os
class Redpy:
def __init__(self, user):
"""Enter a user agent"""
print("hello")
self.user = user
def download(self, subreddit, number=5, sort_option=None):
"""Downloads images from subreddit.
subreddit="Name of subreddit"
number=Number of images to be downloaded
sort_option=new/hot/top
"""
subreddit.strip('/')
if sort_option == None:
sort_option = ''
self.url = 'https://www.reddit.com/r/' + subreddit + '/'+ sort_option + '.json'
self.user = {'user-agent':self.user}
res = _requests.get(self.url, headers=self.user)
if res.status_code != 200:
print("Could not download")
print(res.status_code)
return
self._DownloadFiles(res.json(), number)
def _DownloadFiles(self, jsonfile, number_of_files):
image_links = self._getImages(jsonfile, number_of_files)
if not self.createFolder():
print("Error creating folder")
return
index = 0 #used to name the files
for image_link in image_links:
image_link = image_link.replace('amp;', '')
f = _requests.get(image_link)
if f.status_code==200:
media_file = open(f'{_os.getcwd()}/red_media/{index}.jpg', 'wb')
for chunk in f.iter_content(100000):
media_file.write(chunk)
media_file.close()
print("Downloaded")
index+=1
print("Download complete")
global flag
flag=1
def _getImages(self, jsonfile, number_of_files):
images = [] #contains links of images
for index in range(number_of_files):
try:
images.append(jsonfile['data']['children'][index]['data']['preview']['images'][0]['source']['url'])
except Exception as e:
print(e)
return images
@staticmethod
def createFolder():
try:
if not _os.path.exists(f'{_os.getcwd()}\\red_media'):
_os.mkdir(f'{_os.getcwd()}\\red_media')
return True
return True
except Exception as e:
print(e)
return False
Things I would like to know:
- Is there anything that I can do to improve the performance of the code
- What can I do to improve the code styling. I have tried to follow the PEP standards as much as I could.
- Anything else to improve the code.
asked May 21, 2019 at 7:01
1 Answer 1
\$\begingroup\$
\$\endgroup\$
4
A few suggestions for general code quality:
- If your class only does one thing and it doesn't store any values, it should be a function.
Redpy
only downloads images from reddit and stores values to achieve exactly this, which you could do in a function. Using a class can have unforeseen consequences. - Choose descriptive names for variables and functions.
_getImages
does not actually get the images, it returns a list of links of images. In this method, you haveimages = [] #contains links of images
. The comment could have been avoided if you would have chosenimage_links
as name. - If you split your code up into methods or functions, everything belonging to one task should be inside it. The removal of
'amp;'
in everyimage_link
does not belong in_DownloadFiles
, it should be in_getImages
.download
gets unnecessarily separated into_DownloadFiles
and_DownloadFiles
doesn't generally download files, but it could if some of its functionality got relocated elsewhere. - Clean up your code: there are unnecessary line breaks after
_DownloadFiles
and a redundantreturn True
increateFolder
. - Don't catch general Exceptions, be more specific. In
_getImages
, you should just look out forKeyError
s. Exceptions inrequest.get
on the other hand are not handled although they possibly should be. - The pattern of looping over a list with a counter (
index
in your code) in_DownloadFiles
can be simplified with enumerate. - When working with files, it is more elegant to use a context manager.
Possible bugs:
subreddit.strip('/')
just returns a new string that you would have to assign to a new variable. In your code, the value ofsubreddit
remains unchanged.self.user
gets updated every timedownload
is called. If this happens multiple times,self.user
becomes a dict encapsulating a dict encapsulated a dict...- If something goes wrong when extracting links in
_getImages
, less links than expected get returned. - If your folder already contains images, they will be overwritten.
Concerning PEP8:
- A few of your lines are longer than 80 characters. Try to split them up, either by implementing the same logic over multiple lines or by breaking the line up.
- In PEP8, functions and methods are in snake_case.
Nit-picky stuff:
- You could just use an empty string as default argument for
sort_option
. Strings are immutable, so you don't have the problem of mutable default arguments. - I don't see why you would import
requests as _requests
andos as _os
- There is no need to construct an absolute file path.
f'{_os.getcwd()}/red_media/{index}.jpg'
could becomef'red_media/{index}.jpg'
Here is my attempt at solving this problem:
import requests
import os
def get_image_links(json, N):
'''returs a list of the first <N> links to reddit images found in <json>'''
try:
children = json['data']['children']
except KeyError:
return []
# append links from children until N are found
image_links = []
for child in children:
try:
image_link = child['data']['preview']['images'][0]['source']['url']
except KeyError:
continue
image_link = image_link.replace('amp;', '')
image_links.append(image_link)
if len(image_links)==N:
break
return image_links
def download_files(file_links, folder_name='data', file_extension='jpeg'):
'''downloads files from <file_links> into subfolder ./<folder_name>/'''
# create subfolder if it does not exist
if not os.path.exists(folder_name):
os.mkdir(folder_name)
# download files
for i, file_link in enumerate(file_links):
try:
res = requests.get(file_link)
except requests.exceptions.RequestException:
print(f"Unable to download {file_link}")
continue
if not res.ok:
print(f"Error {res.status_code} when requesting {file_link}")
continue
file_path = os.path.join(folder_name, f'{i}.{file_extension}')
with open(file_path, 'wb') as file:
for chunk in res.iter_content(100000):
file.write(chunk)
def download_reddit_images(user, subreddit, N=5, sort_by='',
folder_name='red_media'):
'''
downloads the first <N> images of <subreddit> sorted by <sort_by>
(''/'new'/'hot'/'top') into subfolder ./<folder_name>/
'''
json_url = ('https://www.reddit.com/r/' + subreddit.strip('/') + '/'
+ sort_by + '.json')
try:
res = requests.get(json_url, headers={'user-agent':user})
except requests.exceptions.RequestException:
print(f"Unable to get {json_url}")
return
if not res.ok:
print(f"Error {res.status_code} when requesting {json_url}")
return
image_links = get_image_links(res.json(), N)
if not len(image_links):
print("Unable to find any images")
return
download_files(image_links, folder_name=folder_name, file_extension='jpeg')
if __name__=='__main__':
download_reddit_images('', 'all')
The problem of overwriting existing images persists. A solution would be to use the original filename from reddit that is included in the url.
-
1\$\begingroup\$ Thanks a lot for your answer. Your answer was the most comprehensive and detailed analysis of the code. I had updated the code after following some PEP 8 guidelines. Here it is github.com/prashantsengar/RedPy ... I will update the code again based on your suggestions. I have a question though, how can using a class have unforeseen consequences. I want to know more about it. And thanks again for your valuable input \$\endgroup\$Prashant Sengar– Prashant Sengar2019年05月22日 12:31:01 +00:00Commented May 22, 2019 at 12:31
-
1\$\begingroup\$ By unforeseen consequences I meant bugs like the one with
self.user
, since you wouldn't expect the code execution to be dependant on previous calls. The problem with using a class here is that it makes the code more complex than it has to be, while a function would work just fine. With the class, it also takes more lines to access the functionality of your code, since you have to create aRedpy
object and then calldownload
, whereas the only thing you gain is not having to enteruser
multiple times. For more information see this talk. \$\endgroup\$reeetooo– reeetooo2019年05月22日 16:28:09 +00:00Commented May 22, 2019 at 16:28 -
1\$\begingroup\$ Thanks again for explaining so nicely @reeetooo \$\endgroup\$Prashant Sengar– Prashant Sengar2019年05月22日 17:38:16 +00:00Commented May 22, 2019 at 17:38
-
1\$\begingroup\$ I would probably make a
download_file
function that just downloads a file to a destination and do the iterating over the links directly indownload_reddit_images
. But otherwise a very nice first review, welcome to Code Review! \$\endgroup\$Graipher– Graipher2019年05月25日 10:03:48 +00:00Commented May 25, 2019 at 10:03
lang-py