6
\$\begingroup\$

I wrote a bot in python 3 using PRAW, requests, and BeautifulSoup that downloads a specified amount of image posts from a specified subreddit.

One limitation is that it can only parse the direct image link from imgur pages, but I am not too bent on figuring them out because the majority of image posts on reddit are from imgur, and I don't care about skipping a few submissions.

Any suggestions on how I could improve this code? Any glaring issues? I am new to web scraping so I am not sure on how things can break.

import io
import os
import zipfile
import praw
import requests
from bs4 import BeautifulSoup
DOWNLOAD_PATH = '.'
def is_direct_image(url):
 """Checks if a url is a direct image url by checking
 if the url contains a file extension. Only checks
 for png, gif, jpg, and jpeg, as these are the only
 formats we are likely to encounter.
 """
 return any(ex in url.lower() for ex in ['.png', '.gif', '.jpg', '.jpeg'])
def get_image_url(url):
 """Returns direct image url from imgur page."""
 req = requests.get(url)
 req.raise_for_status()
 soup = BeautifulSoup(req.text, 'html.parser')
 img = soup.find('img', class_='post-image-placeholder')
 try:
 return f'http:{img.get("src")}'
 except AttributeError:
 print(f'Encountered unsupported url: {url}')
def download_image(url, path=DOWNLOAD_PATH, chunksize=512):
 """Downloads each image to the specified download path.
 Uses the image ID as the filename.
 """
 req = requests.get(url)
 req.raise_for_status()
 filename = url.rsplit('/', 1)[-1]
 with open(os.path.join(path, filename), 'wb') as file:
 for chunk in req.iter_content(chunksize):
 file.write(chunk)
def download_album(url, path=DOWNLOAD_PATH, max_size=26214400):
 """Downloads an album from imgur as a zip file and extracts it."""
 req = requests.get(f'{url}/zip', stream=True)
 req.raise_for_status()
 filesize = int(req.headers['Content-Length'])
 if filesize > max_size:
 req.close()
 return None
 with zipfile.ZipFile(io.BytesIO(req.content)) as file:
 file.extractall(path)
def download_from_subreddit(sub='wallpapers', sort='hot', lim=10, albums=True,
 path=DOWNLOAD_PATH):
 """Downloads images from specifed subreddit."""
 reddit = praw.Reddit('bot1')
 subreddit = reddit.subreddit(sub)
 subreddit_sort = {
 'hot': subreddit.hot,
 'top': subreddit.top,
 'new': subreddit.new
 }
 for submission in subreddit_sort[sort](limit=lim):
 # skip stickied and self posts
 if submission.stickied or submission.is_self:
 continue
 url = submission.url
 if '/a/' in url and albums:
 download_album(url, path=path)
 else:
 if not is_direct_image(url):
 url = get_image_url(url)
 if url is not None:
 download_image(url, path=path)
if __name__ == '__main__':
 download_from_subreddit()
200_success
146k22 gold badges190 silver badges478 bronze badges
asked Jun 8, 2017 at 18:01
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$
  • As you're on Python 3.6+, you should definitely use function annotations and type hints to make your code more understandable to others.
  • any(ex in url.lower() for ex in ['.png', '.gif', '.jpg', '.jpeg']) can be replaced with url.lower().endswith(('.png', '.gif', '.jpg', '.jpeg')). Another option is to use the pathlib module and do: Path(url).suffix in {'.png', '.gif', '.jpg', '.jpeg'}.
  • Similarly filename = url.rsplit('/', 1)[-1] can be replaced with filename = Path(url).name.
  • If we replace DOWNLOAD_PATH = '.' with DOWNLOAD_PATH = Path('.') then os.path.join(path, filename) can be changed to path.joinpath(filename).
  • Group from bs4 import BeautifulSoup with other 3rd part imports.
  • The variable name req in req = requests.get(f'{url}/zip', stream=True) is misleading as it is actually a response object, the request object can be obtained using req.request.
  • Instead of closing the response explicitly you could use with statement: with requests.get(f'{url}/zip', stream=True) as response:
  • Use session to re-use connections and improve performance.

Don't do too many things that are tightly coupled to each other in a single function, instead break them into multiple function calls so that each of them can be isolated and can be tested or mocked easily. For example get_image_url currently fetched the HTML content and then it is parsed. We can break it into two different functions: get_url_content and parse_content. Now get_url_content can be used anywhere where we want the content of page and parse_content can accept any content for parsing and return a URL. This can be done with other functions as well for example getting subreddit_sort can be a separate function etc

answered Jun 8, 2017 at 20:44
\$\endgroup\$
2
  • 1
    \$\begingroup\$ I followed your advice in splitting up logic a bit more, but on using session, what connections am I reusing? Is accessing any page on the same domain reusing a connection? Would it work by making a session object on a module level and passing it to the function as an argument/using it directly? \$\endgroup\$ Commented Jun 9, 2017 at 14:11
  • \$\begingroup\$ @EpicNice Use a module level session. Yes, connections to the same hosts are re-used. \$\endgroup\$ Commented Jun 10, 2017 at 17:38

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.