I wrote a script that parses API on schedule (Tuesday-Saturday), downloading everything for the previous day.
import requests
import pandas as pd
from datetime import date, timedelta
# # This is what I'd normally use, but since there would be no data today,
# # I assign specific date myself
# DATE = (date.today() - timedelta(days=1)).strftime("%Y-%m-%d")
DATE = "2020年10月23日"
URL = "https://spending.gov.ua/portal-api/v2/api/transactions/page/"
def fetch(session, params):
next_page, last_page = 0, 0
while next_page <= last_page:
params["page"] = next_page
data = session.get(URL, params=params).json()
yield pd.json_normalize(data.get("transactions"))\
.assign(page=params.get("page"))
next_page, last_page = next_page+1, data["count"] // data["pageSize"]
def fetch_all():
with requests.Session() as session:
params = {"page": 0, "pageSize": 100, "startdate": DATE, "enddate": DATE}
yield from fetch(session, params)
if __name__ == "__main__":
data = fetch_all()
pd.concat(data).to_csv(f"data/{DATE}.csv", index=False)
Here I’m wondering about a couple of things.
Firstly, if I’m using requests.Session
correctly.
I read in the documentation that:
The Session object allows you to persist certain parameters across requests. ... So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase.
I'm not sure whether that's the case here as I didn't notice any changes in the performance.
Secondly, if splitting code into two functions instead of one was a good idea.
Here I thought that it would be easier to maintain -- the underlying function fetch
doesn't change while fetch_all
potentially could. For example, I could feed a range of dates instead of a singe date, changing fetch_all
to:
def fetch_all(date_range):
with requests.Session() as session:
for date in date_range:
params = {"page": 0, "pageSize": 100, "startdate": date, "enddate": date}
yield from fetch(session, params)
Also, the yield
and yield from
-- could've used .append
and returned a list instead. Not sure which approach is better.
1 Answer 1
Here I’m wondering about a couple of things.
Firstly, if I’m using
requests.Session
correctly.
Yes, you are. In one of my other reviews, using requests.Session
in the same way for iterating over a paginated API almost halved the total execution time.
I did some quick testing by downloading the last 7 pages (pages 1625-1631) for "2020年10月23日" and it did marginally better than making requests with requests.get
:
requests.get
: 23.2 secondsrequests.Session
: 17.7 seconds
Secondly, if splitting code into two functions instead of one was a good idea.
I think it's fine to have it split into two functions. That said, I do have some comments about the responsibilities and interface of fetch
and how to better take advantage of your usages of yield
and yield from
down below.
Overall the code looks clean and is easy to read. Here's how I think it can be improved:
I think all the low-level details of how to issue requests to the API should be abstracted away from the caller of
fetch
. That is,fetch
's function signature should look something like this:def fetch( session: requests.Session, start_date: date, end_date: date, starting_page: int = 0, page_size: int = 100, ) -> Iterator[pd.DataFrame]: pass
So now creating an appropriate
params
would befetch
's responsibility, notfetch_all
's. Notice also thatstart_date
andend_date
are of typedatetime.date
, notstr
. Similarly,fetch_all
should not have to be concerned with what date string serialization format the API accepts; this isfetch
's responsibility.Within
fetch
, instead of maintaining variablesnext_page
andlast_page
on each request, I think it would be better to calculate the total number of pages (n) only once with the first request (page k), then use a for loop for pages k+1..n-1:def to_dataframe(json_data: Dict[str, Any], page: int) -> pd.DataFrame: return pd.json_normalize(json_data["transactions"]).assign(page=page) def fetch( session: requests.Session, start_date: date, end_date: date, starting_page: int = 0, page_size: int = 100, ) -> Iterator[pd.DataFrame]: params = { "startdate": start_date.isoformat(), "enddate": end_date.isoformat(), "page": starting_page, "pageSize": page_size, } data = session.get(URL, params=params).json() page_count = math.ceil(data["count"] / data["pageSize"]) last_page = page_count - 1 if starting_page > last_page: return print(f"{starting_page} / {last_page}") yield to_dataframe(data, starting_page) for page in range(starting_page + 1, page_count): params["page"] = page data = session.get(URL, params=params).json() print(f"{page} / {last_page}") yield to_dataframe(data, page)
The tradeoff here is that there's a small duplication of code because the first request is handled a little differently, but now we've delegated responsibility of page number iteration to the for loop.
I recommend adding an event hook to the
session
object so that it always callsraise_for_status()
on the response object. This ensures that all requests made with the session raiserequests.HTTPError
if the server gives us a 4xx or 5xx response, and prevents us from converting an error response's.json()
data into a dataframe:session.hooks["response"].append( lambda r, *args, **kwargs: r.raise_for_status() )
Currently the program is combining all dataframes in memory before exporting it to a CSV file. To take advantage of
fetch_all
being anIterator[pd.DataFrame]
, I think it would be better to write each dataframe to the CSV immediately, so we don't need to hold it in memory any longer than necessary:output_path = Path(f"data/{DATE}.csv") output_path.unlink(missing_ok=True) data = fetch_all() for i, dataframe in enumerate(data): write_header = True if i == 0 else False dataframe.to_csv( output_path, header=write_header, index=False, mode="a" )
Refactored version:
#!/usr/bin/env python3
import math
from datetime import date, timedelta
from pathlib import Path
from typing import Any, Dict, Iterator
import pandas as pd # type: ignore
import requests
# # This is what I'd normally use, but since there would be no data today,
# # I assign specific date myself
# DATE = date.today() - timedelta(days=1)
DATE = date.fromisoformat("2020年10月23日")
URL = "https://spending.gov.ua/portal-api/v2/api/transactions/page/"
def to_dataframe(json_data: Dict[str, Any], page: int) -> pd.DataFrame:
return pd.json_normalize(json_data["transactions"]).assign(page=page)
def fetch(
session: requests.Session,
start_date: date,
end_date: date,
starting_page: int = 0,
page_size: int = 100,
) -> Iterator[pd.DataFrame]:
params = {
"startdate": start_date.isoformat(),
"enddate": end_date.isoformat(),
"page": starting_page,
"pageSize": page_size,
}
data = session.get(URL, params=params).json()
page_count = math.ceil(data["count"] / data["pageSize"])
last_page = page_count - 1
if starting_page > last_page:
return
print(f"{starting_page} / {last_page}")
yield to_dataframe(data, starting_page)
for page in range(starting_page + 1, page_count):
params["page"] = page
data = session.get(URL, params=params).json()
print(f"{page} / {last_page}")
yield to_dataframe(data, page)
def fetch_all() -> Iterator[pd.DataFrame]:
with requests.Session() as session:
session.hooks["response"].append(
lambda r, *args, **kwargs: r.raise_for_status()
)
yield from fetch(session, start_date=DATE, end_date=DATE)
if __name__ == "__main__":
output_path = Path(f"data/{DATE}.csv")
output_path.unlink(missing_ok=True)
data = fetch_all()
for i, dataframe in enumerate(data):
write_header = True if i == 0 else False
dataframe.to_csv(
output_path, header=write_header, index=False, mode="a"
)
-
\$\begingroup\$ Thank you for such a detailed answer. I learnt a lot! \$\endgroup\$Hryhorii Pavlenko– Hryhorii Pavlenko2020年10月27日 06:18:03 +00:00Commented Oct 27, 2020 at 6:18
-
1\$\begingroup\$ Though I intentionally used while loop to avoid code duplication in fetch function. But I agree that your approach is more explicit, so I’ll use it instead. Thanks once again. \$\endgroup\$Hryhorii Pavlenko– Hryhorii Pavlenko2020年10月27日 09:02:59 +00:00Commented Oct 27, 2020 at 9:02