Accessing time series development data from World Bank API

Question 1

I wrote a script that connects to World Bank API, gathers some data (indicators) for a number of countries, creates pd.DataFrame, and saves the result as xlsx file.

What I had in mind when I was writing the code:

I wanted to avoid constructing pd.DataFrame within the function or using the append method to store data in-memory (not sure if it helps in this case; is it a good practice in general?).
I thought it was a good idea to split the code into separate functions (one that iterates over data, one that makes requests for every page that is related to the indicator, and finally, the one that iterates over all indicators and calls other functions).
It was the first time I used yield from and I'm not sure if it's the correct way of using the construct.
Not sure about typings.

import os
import requests
import pandas as pd
from time import sleep
from itertools import chain
from datetime import datetime
from typing import List, Iterator
COUNTRIES = {
 "Baltic States": ["EST", "LVA", "LTU"],
 "Central Asia": ["KAZ", "KGZ", "TJK", "TKM", "UZB"],
 "Eastern Europe": ["BLR", "MDA", "UKR"],
 "Eurasia": ["RUS"],
 "Transcaucasia": ["ARM", "AZE", "GEO"]
}
INDICATORS = {
 "Infrastructure": ["EG.ELC.ACCS.RU.ZS", "IT.CEL.SETS"],
 "Medicine": ["SH.DYN.MORT", "SH.DYN.AIDS.ZS"],
 "Public Sector": ["GC.TAX.TOTL.GD.ZS", "MS.MIL.TOTL.TF.ZS"]
}
URL = "http://api.worldbank.org/v2/country/"
DATE = datetime.today().strftime("%Y-%m-%d %H_%M")
CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))
ISO3_PARAM = ';'.join(chain(*COUNTRIES.values()))
INDICATOR_PARAM = list(chain(*INDICATORS.values()))
def mapping(s: pd.Series, metadict: dict) -> str:
 """ Assign categories to Series' values according to dict's keys. """
 return ' '.join((k) for k,v in metadict.items() if s in v)
def item_level(data: List) -> Iterator:
 """ Iterate over data and yield dict containing specific fields. """
 for item in data:
 yield {
 "iso3": item["countryiso3code"],
 "indicator": item["indicator"]["value"],
 "id": item["indicator"]["id"],
 "year": item["date"], 
 "value": item["value"]
 }
def page_level(indicator: str) -> Iterator:
 """ Iterate over all pages related to specific indicator. """
 base_url = f"{URL}{ISO3_PARAM}/indicator/{indicator}?format=json"
 next_page = 1
 while True:
 meta, data = requests.get(f"{base_url}&page={next_page}").json()
 yield from item_level(data)
 num_pages, next_page = meta["pages"], next_page + 1
 if next_page > num_pages:
 break
 sleep(1)
def indicator_level(indicators: List) -> Iterator:
 """ Iterate over all indicators. """
 for indicator in indicators:
 yield from page_level(indicator) 
def main():
 """ Create pd.DataFrame from generator and save it. """
 df = pd.DataFrame(indicator_level(INDICATOR_PARAM))
 df["year"] = df["year"].astype(int)
 df["region"] = df["iso3"].apply(mapping, metadict=COUNTRIES)
 df["category"] = df["id"].apply(mapping, metadict=INDICATORS)
 df.loc[df["year"] >= 1991].to_excel(f"{CURRENT_DIR}/../../data/raw/dataset_{DATE}.xlsx", index=False)
if __name__ == "__main__":
 main()

Question 2

I would turn your mappings mapping countries and indicators around. Then you can simply do:

COUNTRIES = {"EST": "Baltic States", "LVA": "Baltic States", "LTU": "Baltic States", ...}
df["region"] = df["iso3"].map(COUNTRIES)
df["category"] = df["id"].map(INDICATORS)

If you are too lazy to manually invert the dictionary, or like the current structure more but still want to have the easier usage, just use this one-line function:

def invert(d):
 return {v: k for k, values in d.items() for v in values}

Instead of

ISO3_PARAM = ';'.join(chain(*COUNTRIES.values()))

Use

ISO3_PARAM = ';'.join(chain.from_iterable(COUNTRIES.values()))

This does not have any memory problems (not that that is a problem here, but it is good practice in general).

And if you follow the previous recommendation, replace values with keys.

I would slightly simplify your page_level function by inlining meta["pages"] and using the params argument of requests.get to pass the parameters

def page_level(indicator: str) -> Iterator:
 """ Iterate over all pages related to specific indicator. """
 url = f"{URL}{ISO3_PARAM}/indicator/{indicator}"
 page = 1
 while True:
 meta, data = requests.get(url, params={"format": "json", "page": page}).json()
 yield from item_level(data)
 page += 1
 if page > meta["pages"]:
 break
 sleep(1)

I must say that while I like that you separated things into their own function, I am not so sure about their names (I know, names are hard!). They convey only at which level a function operates, but not what it does.

My suggestions for names:

indicator_level -> get_dataframe (and directly return the dataframe) or get_data
page_level -> get_indicator_data
item_level -> extract_items

Question 3

oh this is great! I used that structure of dict because I felt it was easier to create, surely your example makes mapping much better. With the rest I also agree, thank you! In general, I wasn't sure it was a good idea to create three functions instead of one or two? And I wasn't sure if I used yeild from correctly.

Question 4

@politicalscientist: Your use of yield from is perfectly fine. I was also contemplating whether to put the stuff from indicator_level into the page_level, but in the end decided that it was too much clutter and nicer in its own function.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2020-05-18 16:37:23Z

I would turn your mappings mapping countries and indicators around. Then you can simply do:

COUNTRIES = {"EST": "Baltic States", "LVA": "Baltic States", "LTU": "Baltic States", ...}
df["region"] = df["iso3"].map(COUNTRIES)
df["category"] = df["id"].map(INDICATORS)

If you are too lazy to manually invert the dictionary, or like the current structure more but still want to have the easier usage, just use this one-line function:

def invert(d):
 return {v: k for k, values in d.items() for v in values}

Instead of

ISO3_PARAM = ';'.join(chain(*COUNTRIES.values()))

Use

ISO3_PARAM = ';'.join(chain.from_iterable(COUNTRIES.values()))

This does not have any memory problems (not that that is a problem here, but it is good practice in general).

And if you follow the previous recommendation, replace values with keys.

I would slightly simplify your page_level function by inlining meta["pages"] and using the params argument of requests.get to pass the parameters

def page_level(indicator: str) -> Iterator:
 """ Iterate over all pages related to specific indicator. """
 url = f"{URL}{ISO3_PARAM}/indicator/{indicator}"
 page = 1
 while True:
 meta, data = requests.get(url, params={"format": "json", "page": page}).json()
 yield from item_level(data)
 page += 1
 if page > meta["pages"]:
 break
 sleep(1)

I must say that while I like that you separated things into their own function, I am not so sure about their names (I know, names are hard!). They convey only at which level a function operates, but not what it does.

My suggestions for names:

indicator_level -> get_dataframe (and directly return the dataframe) or get_data
page_level -> get_indicator_data
item_level -> extract_items

oh this is great! I used that structure of dict because I felt it was easier to create, surely your example makes mapping much better. With the rest I also agree, thank you! In general, I wasn't sure it was a good idea to create three functions instead of one or two? And I wasn't sure if I used yeild from correctly.
@politicalscientist: Your use of yield from is perfectly fine. I was also contemplating whether to put the stuff from indicator_level into the page_level, but in the end decided that it was too much clutter and nicer in its own function.

Stack Exchange Network

Accessing time series development data from World Bank API

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Accessing time series development data from World Bank API

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions