I wrote a script that connects to World Bank API, gathers some data (indicators) for a number of countries, creates pd.DataFrame, and saves the result as xlsx file.
What I had in mind when I was writing the code:
- I wanted to avoid constructing pd.DataFrame within the function or using the append method to store data in-memory (not sure if it helps in this case; is it a good practice in general?).
- I thought it was a good idea to split the code into separate functions (one that iterates over data, one that makes requests for every page that is related to the indicator, and finally, the one that iterates over all indicators and calls other functions).
- It was the first time I used
yield from
and I'm not sure if it's the correct way of using the construct. - Not sure about typings.
import os
import requests
import pandas as pd
from time import sleep
from itertools import chain
from datetime import datetime
from typing import List, Iterator
COUNTRIES = {
"Baltic States": ["EST", "LVA", "LTU"],
"Central Asia": ["KAZ", "KGZ", "TJK", "TKM", "UZB"],
"Eastern Europe": ["BLR", "MDA", "UKR"],
"Eurasia": ["RUS"],
"Transcaucasia": ["ARM", "AZE", "GEO"]
}
INDICATORS = {
"Infrastructure": ["EG.ELC.ACCS.RU.ZS", "IT.CEL.SETS"],
"Medicine": ["SH.DYN.MORT", "SH.DYN.AIDS.ZS"],
"Public Sector": ["GC.TAX.TOTL.GD.ZS", "MS.MIL.TOTL.TF.ZS"]
}
URL = "http://api.worldbank.org/v2/country/"
DATE = datetime.today().strftime("%Y-%m-%d %H_%M")
CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))
ISO3_PARAM = ';'.join(chain(*COUNTRIES.values()))
INDICATOR_PARAM = list(chain(*INDICATORS.values()))
def mapping(s: pd.Series, metadict: dict) -> str:
""" Assign categories to Series' values according to dict's keys. """
return ' '.join((k) for k,v in metadict.items() if s in v)
def item_level(data: List) -> Iterator:
""" Iterate over data and yield dict containing specific fields. """
for item in data:
yield {
"iso3": item["countryiso3code"],
"indicator": item["indicator"]["value"],
"id": item["indicator"]["id"],
"year": item["date"],
"value": item["value"]
}
def page_level(indicator: str) -> Iterator:
""" Iterate over all pages related to specific indicator. """
base_url = f"{URL}{ISO3_PARAM}/indicator/{indicator}?format=json"
next_page = 1
while True:
meta, data = requests.get(f"{base_url}&page={next_page}").json()
yield from item_level(data)
num_pages, next_page = meta["pages"], next_page + 1
if next_page > num_pages:
break
sleep(1)
def indicator_level(indicators: List) -> Iterator:
""" Iterate over all indicators. """
for indicator in indicators:
yield from page_level(indicator)
def main():
""" Create pd.DataFrame from generator and save it. """
df = pd.DataFrame(indicator_level(INDICATOR_PARAM))
df["year"] = df["year"].astype(int)
df["region"] = df["iso3"].apply(mapping, metadict=COUNTRIES)
df["category"] = df["id"].apply(mapping, metadict=INDICATORS)
df.loc[df["year"] >= 1991].to_excel(f"{CURRENT_DIR}/../../data/raw/dataset_{DATE}.xlsx", index=False)
if __name__ == "__main__":
main()
1 Answer 1
I would turn your mappings mapping countries and indicators around. Then you can simply do:
COUNTRIES = {"EST": "Baltic States", "LVA": "Baltic States", "LTU": "Baltic States", ...}
df["region"] = df["iso3"].map(COUNTRIES)
df["category"] = df["id"].map(INDICATORS)
If you are too lazy to manually invert the dictionary, or like the current structure more but still want to have the easier usage, just use this one-line function:
def invert(d):
return {v: k for k, values in d.items() for v in values}
Instead of
ISO3_PARAM = ';'.join(chain(*COUNTRIES.values()))
Use
ISO3_PARAM = ';'.join(chain.from_iterable(COUNTRIES.values()))
This does not have any memory problems (not that that is a problem here, but it is good practice in general).
And if you follow the previous recommendation, replace values
with keys
.
I would slightly simplify your page_level
function by inlining meta["pages"]
and using the params
argument of requests.get
to pass the parameters
def page_level(indicator: str) -> Iterator:
""" Iterate over all pages related to specific indicator. """
url = f"{URL}{ISO3_PARAM}/indicator/{indicator}"
page = 1
while True:
meta, data = requests.get(url, params={"format": "json", "page": page}).json()
yield from item_level(data)
page += 1
if page > meta["pages"]:
break
sleep(1)
I must say that while I like that you separated things into their own function, I am not so sure about their names (I know, names are hard!). They convey only at which level a function operates, but not what it does.
My suggestions for names:
indicator_level
->get_dataframe
(and directly return the dataframe) orget_data
page_level
->get_indicator_data
item_level
->extract_items
-
\$\begingroup\$ oh this is great! I used that structure of dict because I felt it was easier to create, surely your example makes mapping much better. With the rest I also agree, thank you! In general, I wasn't sure it was a good idea to create three functions instead of one or two? And I wasn't sure if I used yeild from correctly. \$\endgroup\$Hryhorii Pavlenko– Hryhorii Pavlenko2020年05月18日 17:36:50 +00:00Commented May 18, 2020 at 17:36
-
1\$\begingroup\$ @politicalscientist: Your use of
yield from
is perfectly fine. I was also contemplating whether to put the stuff fromindicator_level
into thepage_level
, but in the end decided that it was too much clutter and nicer in its own function. \$\endgroup\$Graipher– Graipher2020年05月18日 18:28:30 +00:00Commented May 18, 2020 at 18:28