I wrote a bot in python 3 using PRAW, requests, and BeautifulSoup that downloads a specified amount of image posts from a specified subreddit.
One limitation is that it can only parse the direct image link from imgur pages, but I am not too bent on figuring them out because the majority of image posts on reddit are from imgur, and I don't care about skipping a few submissions.
Any suggestions on how I could improve this code? Any glaring issues? I am new to web scraping so I am not sure on how things can break.
import io
import os
import zipfile
import praw
import requests
from bs4 import BeautifulSoup
DOWNLOAD_PATH = '.'
def is_direct_image(url):
"""Checks if a url is a direct image url by checking
if the url contains a file extension. Only checks
for png, gif, jpg, and jpeg, as these are the only
formats we are likely to encounter.
"""
return any(ex in url.lower() for ex in ['.png', '.gif', '.jpg', '.jpeg'])
def get_image_url(url):
"""Returns direct image url from imgur page."""
req = requests.get(url)
req.raise_for_status()
soup = BeautifulSoup(req.text, 'html.parser')
img = soup.find('img', class_='post-image-placeholder')
try:
return f'http:{img.get("src")}'
except AttributeError:
print(f'Encountered unsupported url: {url}')
def download_image(url, path=DOWNLOAD_PATH, chunksize=512):
"""Downloads each image to the specified download path.
Uses the image ID as the filename.
"""
req = requests.get(url)
req.raise_for_status()
filename = url.rsplit('/', 1)[-1]
with open(os.path.join(path, filename), 'wb') as file:
for chunk in req.iter_content(chunksize):
file.write(chunk)
def download_album(url, path=DOWNLOAD_PATH, max_size=26214400):
"""Downloads an album from imgur as a zip file and extracts it."""
req = requests.get(f'{url}/zip', stream=True)
req.raise_for_status()
filesize = int(req.headers['Content-Length'])
if filesize > max_size:
req.close()
return None
with zipfile.ZipFile(io.BytesIO(req.content)) as file:
file.extractall(path)
def download_from_subreddit(sub='wallpapers', sort='hot', lim=10, albums=True,
path=DOWNLOAD_PATH):
"""Downloads images from specifed subreddit."""
reddit = praw.Reddit('bot1')
subreddit = reddit.subreddit(sub)
subreddit_sort = {
'hot': subreddit.hot,
'top': subreddit.top,
'new': subreddit.new
}
for submission in subreddit_sort[sort](limit=lim):
# skip stickied and self posts
if submission.stickied or submission.is_self:
continue
url = submission.url
if '/a/' in url and albums:
download_album(url, path=path)
else:
if not is_direct_image(url):
url = get_image_url(url)
if url is not None:
download_image(url, path=path)
if __name__ == '__main__':
download_from_subreddit()
1 Answer 1
- As you're on Python 3.6+, you should definitely use function annotations and type hints to make your code more understandable to others.
any(ex in url.lower() for ex in ['.png', '.gif', '.jpg', '.jpeg'])
can be replaced withurl.lower().endswith(('.png', '.gif', '.jpg', '.jpeg'))
. Another option is to use the pathlib module and do:Path(url).suffix in {'.png', '.gif', '.jpg', '.jpeg'}
.- Similarly
filename = url.rsplit('/', 1)[-1]
can be replaced withfilename = Path(url).name
. - If we replace
DOWNLOAD_PATH = '.'
withDOWNLOAD_PATH = Path('.')
thenos.path.join(path, filename)
can be changed topath.joinpath(filename)
. - Group
from bs4 import BeautifulSoup
with other 3rd part imports. - The variable name
req
inreq = requests.get(f'{url}/zip', stream=True)
is misleading as it is actually a response object, the request object can be obtained usingreq.request
. - Instead of closing the response explicitly you could use
with statement
:with requests.get(f'{url}/zip', stream=True) as response:
- Use session to re-use connections and improve performance.
Don't do too many things that are tightly coupled to each other in a single function, instead break them into multiple function calls so that each of them can be isolated and can be tested or mocked easily. For example get_image_url
currently fetched the HTML content and then it is parsed. We can break it into two different functions: get_url_content
and parse_content
. Now get_url_content
can be used anywhere where we want the content of page and parse_content
can accept any content for parsing and return a URL. This can be done with other functions as well for example getting subreddit_sort
can be a separate function etc
-
1\$\begingroup\$ I followed your advice in splitting up logic a bit more, but on using session, what connections am I reusing? Is accessing any page on the same domain reusing a connection? Would it work by making a session object on a module level and passing it to the function as an argument/using it directly? \$\endgroup\$EpicNice– EpicNice2017年06月09日 14:11:03 +00:00Commented Jun 9, 2017 at 14:11
-
\$\begingroup\$ @EpicNice Use a module level session. Yes, connections to the same hosts are re-used. \$\endgroup\$Ashwini Chaudhary– Ashwini Chaudhary2017年06月10日 17:38:08 +00:00Commented Jun 10, 2017 at 17:38
Explore related questions
See similar questions with these tags.