subreddit image bot

Question 1

I wrote a bot in python 3 using PRAW, requests, and BeautifulSoup that downloads a specified amount of image posts from a specified subreddit.

One limitation is that it can only parse the direct image link from imgur pages, but I am not too bent on figuring them out because the majority of image posts on reddit are from imgur, and I don't care about skipping a few submissions.

Any suggestions on how I could improve this code? Any glaring issues? I am new to web scraping so I am not sure on how things can break.

import io
import os
import zipfile
import praw
import requests
from bs4 import BeautifulSoup
DOWNLOAD_PATH = '.'
def is_direct_image(url):
 """Checks if a url is a direct image url by checking
 if the url contains a file extension. Only checks
 for png, gif, jpg, and jpeg, as these are the only
 formats we are likely to encounter.
 """
 return any(ex in url.lower() for ex in ['.png', '.gif', '.jpg', '.jpeg'])
def get_image_url(url):
 """Returns direct image url from imgur page."""
 req = requests.get(url)
 req.raise_for_status()
 soup = BeautifulSoup(req.text, 'html.parser')
 img = soup.find('img', class_='post-image-placeholder')
 try:
 return f'http:{img.get("src")}'
 except AttributeError:
 print(f'Encountered unsupported url: {url}')
def download_image(url, path=DOWNLOAD_PATH, chunksize=512):
 """Downloads each image to the specified download path.
 Uses the image ID as the filename.
 """
 req = requests.get(url)
 req.raise_for_status()
 filename = url.rsplit('/', 1)[-1]
 with open(os.path.join(path, filename), 'wb') as file:
 for chunk in req.iter_content(chunksize):
 file.write(chunk)
def download_album(url, path=DOWNLOAD_PATH, max_size=26214400):
 """Downloads an album from imgur as a zip file and extracts it."""
 req = requests.get(f'{url}/zip', stream=True)
 req.raise_for_status()
 filesize = int(req.headers['Content-Length'])
 if filesize > max_size:
 req.close()
 return None
 with zipfile.ZipFile(io.BytesIO(req.content)) as file:
 file.extractall(path)
def download_from_subreddit(sub='wallpapers', sort='hot', lim=10, albums=True,
 path=DOWNLOAD_PATH):
 """Downloads images from specifed subreddit."""
 reddit = praw.Reddit('bot1')
 subreddit = reddit.subreddit(sub)
 subreddit_sort = {
 'hot': subreddit.hot,
 'top': subreddit.top,
 'new': subreddit.new
 }
 for submission in subreddit_sort[sort](limit=lim):
 # skip stickied and self posts
 if submission.stickied or submission.is_self:
 continue
 url = submission.url
 if '/a/' in url and albums:
 download_album(url, path=path)
 else:
 if not is_direct_image(url):
 url = get_image_url(url)
 if url is not None:
 download_image(url, path=path)
if __name__ == '__main__':
 download_from_subreddit()

Question 2

As you're on Python 3.6+, you should definitely use function annotations and type hints to make your code more understandable to others.
any(ex in url.lower() for ex in ['.png', '.gif', '.jpg', '.jpeg']) can be replaced with url.lower().endswith(('.png', '.gif', '.jpg', '.jpeg')). Another option is to use the pathlib module and do: Path(url).suffix in {'.png', '.gif', '.jpg', '.jpeg'}.
Similarly filename = url.rsplit('/', 1)[-1] can be replaced with filename = Path(url).name.
If we replace DOWNLOAD_PATH = '.' with DOWNLOAD_PATH = Path('.') then os.path.join(path, filename) can be changed to path.joinpath(filename).
Group from bs4 import BeautifulSoup with other 3rd part imports.
The variable name req in req = requests.get(f'{url}/zip', stream=True) is misleading as it is actually a response object, the request object can be obtained using req.request.
Instead of closing the response explicitly you could use with statement: with requests.get(f'{url}/zip', stream=True) as response:
Use session to re-use connections and improve performance.

Don't do too many things that are tightly coupled to each other in a single function, instead break them into multiple function calls so that each of them can be isolated and can be tested or mocked easily. For example get_image_url currently fetched the HTML content and then it is parsed. We can break it into two different functions: get_url_content and parse_content. Now get_url_content can be used anywhere where we want the content of page and parse_content can accept any content for parsing and return a URL. This can be done with other functions as well for example getting subreddit_sort can be a separate function etc

Question 3

I followed your advice in splitting up logic a bit more, but on using session, what connections am I reusing? Is accessing any page on the same domain reusing a connection? Would it work by making a session object on a module level and passing it to the function as an argument/using it directly?

Question 4

@EpicNice Use a module level session. Yes, connections to the same hosts are re-used.

Ashwini Chaudhary Ashwini Chaudhary 3,15516 silver badges18 bronze badges · Answer 1 · 2017-06-08 20:44:49Z

As you're on Python 3.6+, you should definitely use function annotations and type hints to make your code more understandable to others.
any(ex in url.lower() for ex in ['.png', '.gif', '.jpg', '.jpeg']) can be replaced with url.lower().endswith(('.png', '.gif', '.jpg', '.jpeg')). Another option is to use the pathlib module and do: Path(url).suffix in {'.png', '.gif', '.jpg', '.jpeg'}.
Similarly filename = url.rsplit('/', 1)[-1] can be replaced with filename = Path(url).name.
If we replace DOWNLOAD_PATH = '.' with DOWNLOAD_PATH = Path('.') then os.path.join(path, filename) can be changed to path.joinpath(filename).
Group from bs4 import BeautifulSoup with other 3rd part imports.
The variable name req in req = requests.get(f'{url}/zip', stream=True) is misleading as it is actually a response object, the request object can be obtained using req.request.
Instead of closing the response explicitly you could use with statement: with requests.get(f'{url}/zip', stream=True) as response:
Use session to re-use connections and improve performance.

Don't do too many things that are tightly coupled to each other in a single function, instead break them into multiple function calls so that each of them can be isolated and can be tested or mocked easily. For example get_image_url currently fetched the HTML content and then it is parsed. We can break it into two different functions: get_url_content and parse_content. Now get_url_content can be used anywhere where we want the content of page and parse_content can accept any content for parsing and return a URL. This can be done with other functions as well for example getting subreddit_sort can be a separate function etc

I followed your advice in splitting up logic a bit more, but on using session, what connections am I reusing? Is accessing any page on the same domain reusing a connection? Would it work by making a session object on a module level and passing it to the function as an argument/using it directly?
@EpicNice Use a module level session. Yes, connections to the same hosts are re-used.

Stack Exchange Network

subreddit image bot

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

subreddit image bot

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions