I find myself often encountering a use case where I need to fetch paginated data, where I need all of the data. Currently I use something similar to below to handle these cases:
def fetch_paginated_data(api_base, query, good_citizen_rate):
encoded = urllib.parse.quote(query.encode('utf-8'))
initial = f'{api_base}?q={encoded}'
payload = {
'has_more': True,
'next_page': initial
}
storage = []
while payload['has_more']:
response = requests.get(payload['next_page'])
if response.status_code != 200:
raise ValueError('Status not 200')
payload['has_more'] = False
return
data = response.json()
if (data and data['data']):
storage.extend(data['data'])
payload['has_more'] = data['has_more']
if 'next_page' in data:
payload['next_page'] = data['next_page']
time.sleep(good_citizen_rate)
return storage
My question is how could I better make this convention more reusable? Should I use async
/ await
? yield
pages? Add a callback to transform the data prior to adding it to the collection variable?
2 Answers 2
You should let the requests
module do the urlencoding for you. It can take a dictionary of parameters and do it right.
Instead of having a variable around which you check in your while
condition, I would prefer a while True
and an explicit break
here. The loop is small enough that this makes it more readable IMO.
I would indeed use yield
, or rather yield from
here, in order to flatten the response. This means that the next page is only fetched when you have consumed the previous data. If you need to, you could also make it asynchronously using aiohttp
, have a look at this question of mine which used it or the documentation I linked.
You can also use response.raise_for_status
to raise an exception if the status code is not 2XX. Note that any code after a raise
is not accessible, so your payload['has_more'] = False
and return
will never run (unless you run it in a debugger or something).
Python has an official style-guide, PEP8, which you are following (nice!). One thing it does not recommend but I would nevertheless encourage you to do is not using unnecessary parenthesis in if
statements. Python is not C.
def fetch_paginated_data(url, query, timeout=0):
params = {"q": query}
while True:
response = requests.get(url, params=params)
response.raise_for_status()
data = response.json()
if data and data['data']:
yield from data['data']
if not data['has_more']:
break
url = data['next_page']
time.sleep(timeout)
Here I left the interface as is. If you are free to change it, I would consider taking params
directly as an argument or even collect all keyword arguments (except for the timeout, which I also gave a defualt value) directly into a dictionary with **params
. The latter would give you the very nice interface:
data = fetch_paginated_data("example.com", q="FooBar", timeout=3)
Instead of if data and data['data']:
you could also just use a try..except
block. This will be slightly more performant if the response usually has this key:
try:
yield from data['data']
except (KeyError, TypeError): # for dict, list
pass
You should definitely add a docstring
describing what this function does. Especially your naming the timeout a rate can be confusing for other people reading your code, or yourself in half a year.
Without the other code that calls that function is difficutl to say if you need async operations or not, on the other hand here are some suggestions to make your code more pythonic.
payload = {
'has_more': True,
'next_page': initial
}
Change to
has_more = True
next_page = initial
Your code
while payload['has_more']:
response = requests.get(payload['next_page'])
if response.status_code != 200:
raise ValueError('Status not 200')
payload['has_more'] = False
return
Change to
while has_more:
response = requests.get(payload['next_page'])
if response.status_code != 200:
raise ValueError('Status not 200')
Your code
payload['has_more'] = data['has_more']
Change to
has_more = data['has_more']
Sorry if I forget more changes.
requests.get(payload['next_page'])
if there is no next page? And why do you needtime.sleep(good_citizen_rate)
? \$\endgroup\$has_more
from the apis I often use.time.sleep(good_citizen_rate)
is to follow the providers of these apis guidelines \$\endgroup\$payload['has_more'] = False
has no effect, because it is written afterraise
(also thereturn
). And it seems not only related to 3.6+ (except for the f-string feature you're using, so this is only a side comment). \$\endgroup\$