Microservice that fetches data from REST repository endpoints on Github

Question 1

I have code in Python 3 where I fetch data from the GitHub API, and I want some advice on the architecture of this code.

import urllib.request, urllib.parse, urllib.error
import json
import sys
from collections import Counter
from datetime import datetime, timedelta, date
from tabulate import tabulate
from LanguagesRepoCounter import LanguagesCounter
from RepoListLanguages import RepoListLanguages
#getting the date of one mounth ago from today
One_Mouth_ago_date = datetime.date(datetime.now()) - timedelta(days=30)
url = 'https://api.github.com/search/repositories?q=created:%3E'+One_Mouth_ago_date.isoformat()+'&sort=stars&order=desc&page=1&per_page=100'
data_connection_request = urllib.request.urlopen(url)
unicoded_data = data_connection_request.read().decode()
try:
 json_data = json.loads(unicoded_data)
except:
 json_data = None
# testing retrieaved data
if not json_data or json_data["total_count"]<1:
 print(json_data)
 sys.exit('######## Failure To Retrieve ########')
#Counting the number of repo using every language
Lang_Dic_Count = LanguagesCounter(json_data)
# Getting the list of repo using every language
DicDataList = RepoListLanguages(json_data, Lang_Dic_Count)
# Displaye the repo's urls for every language
for key in DicDataList:
 print(key, ' language used in')
 for i in range(len(DicDataList[key])):
 print(" ", DicDataList[key][i])
# Displaye the number of repos using every language in a table
table = []
for key in Lang_Dic_Count:
 under_table=[]
 under_table.append(key)
 under_table.append(Lang_Dic_Count[key])
 table.append(under_table)
print(tabulate(table, headers=('Language', 'number of Repo using it'), tablefmt="grid"))

and the first function is :

def LanguagesCounter(Data):
 ''' Count how many times every programming language is used in the 100 most starred repos created in the last 30 days given in a parameter as json object, return a dict with languages as a keys and number of repo using this language as a value
 '''
 Languages_liste = []
 for i in range(len(Data["items"])):
 Languages_liste.append(Data["items"][i]["language"])
 Lang_Dic_Count ={i:Languages_liste.count(i) for i in Languages_liste}
 return Lang_Dic_Count

the third function is:

def RepoListLanguages(Data, list):
 '''
 return the list of repositories from Data parameter for every programming language in the list parameter as a dictionary with the programming language as key and a list of repositories as value
 '''
 languagesRepo = {}
 for key in list:
 languagesRepo[key]= []
 for repo in range(len(Data["items"])):
 if (key == Data["items"][repo]["language"]):
 languagesRepo[key].append(Data["items"][repo]["html_url"])
 return languagesRepo

The code fetches the most starred repos from GitHub API that were created in the last 30 days, counts how many repos use every programming language, and lists the repo links for every language.

Question 2

I don't think that this: '''Python will run.

Question 3

since you are using 3rd party python modules, why not use one of the official libraries to interact with github as well? developer.github.com/v3/libraries

Question 4

The main improvements can be done in your two helper functions. Your first function could use the builtin collections.Counter:

from collections import Counter
def languages_counter(data):
 ''' Count how many times every programming language is used given in a parameter as json object, return a dict with languages as a keys and number of repo using this language as a value
 '''
 return Counter(repo["language"] for repo in data["items"])

And the second one could use collections.defaultdict:

from collections import defaultdict
def groupby_language(data):
 '''
 return the list of repositories from data parameter for every programming language as a dictionary with the programming language as key and a list of repositories as value
 '''
 repo_urls = defaultdict(list)
 for repo in data["items"]:
 repo_urls[repo["language"]].append(repo["html_url"])
 return repo_urls

Note that I followed Python's official style-guide, PEP8, which recommends using lower_case for functions and variables.

Alternatively, you could group the repos by language first, and use that afterwards:

def groupby(values, key):
 grouped = defaultdict(list)
 for x in values:
 grouped[x[key]].append(x)
 return grouped
grouped_repos = groupby(get_data(), "language")
counts = {language: len(repos)
 for language, repos in grouped_repos.items()}
urls = {language: [repo["remote_url"] for repo in repos]
 for language, repos in grouped_repos.items()}

In your main code, you should put getting the data into its own function as well. It then becomes easier to change it.

I would also use the requests module here, instead of urllib, it is a bit more user friendly and has advanced options like re-using a session for multiple requests, automatic URL-encoding and decoding, directly getting a JSON dictionary and checking the status code and raising an exception if it is not OK.

import requests
def get_data():
 one_month_ago = datetime.date(datetime.now()) - timedelta(days=30)
 url = "https://api.github.com/search/repositories"
 params = {"q": f">{one_month_ago.isoformat()}",
 "sort": "stars",
 "order": "desc",
 "page": 1,
 "per_page": 100}
 response = requests.get(url, params=params)
 response.raise_for_status()
 return response.json()["items"]

Question 5

Thank you, a very pedagogical answer.

Question 6

@Graipher I have similar question here where I needed some help on architecture/design of my code so wanted to see if you can help me out? Sorry for the ping like this.

Question 7

@AndyP I will take a look when I have some time later.

Question 8

@Graipher Thanks really appreciate it.

Question 9

The two functions you broke out (LanguagesCounter, and RepoListLanguages) both could be simplified significantly using list comprehension. You seem to be thinking in terms of loops and appending to lists instead of generating them in one go which is more pythonic. Add in the Counter function from collections and RepoListLanguages becomes a relatively trivial one liner.

Those are both at the top of the rewrite below so you can see them in action.

Beyond that, the only real changes I made were using lower_camel_case per pep8 guidelines, changing a few variable names for clarity, and pulling the messy code at the bottom into functions. Functions and clear structure make reading the code much easier for anyone who needs to even if they don't change how it works.

import urllib.request, urllib.parse, urllib.error
import json
import sys
from collections import Counter
from datetime import datetime, timedelta, date
from tabulate import tabulate
from collections import Counter
def language_counter(data):
 return Counter(repo["language"] for repo in data["items"])
def repo_languages(data, list):
 return {
 key: [
 repo["html_url"] for repo in data["items"]
 if key == repo["language"]
 ]
 for key in list
 }
def new_url():
 last_month = datetime.date(datetime.now()) - timedelta(days=30)
 return 'https://api.github.com/search/repositories?q=created:%3E' + last_month.isoformat(
 ) + '&sort=stars&order=desc&page=1&per_page=100'
def get_data(url):
 data_connection_request = urllib.request.urlopen(url)
 unicoded_data = data_connection_request.read().decode()
 try:
 json_data = json.loads(unicoded_data)
 except:
 json_data = None
 return json_data
def validate_or_quit(json_data):
 if not json_data or json_data["total_count"] < 1:
 print(json_data)
 sys.exit('######## Failure To Retrieve ########')
def pretty_print_list(data):
 for key in data:
 print(key, ' language used in')
 for _, item in enumerate(data[key]):
 print(" " * 27, item)
def pretty_print_dict(data):
 data_list = [[key, value] for key, value in data.items()]
 print(
 tabulate(
 data_list,
 headers=('Language', 'number of Repo using it'),
 tablefmt="grid")
 )
if __name__ == "__main__":
 json_data = get_data(new_url())
 validate_or_quit(json_data)
 lang_counts = language_counter(json_data)
 repo_langs = repo_languages(json_data, lang_counts)
 pretty_print_list(repo_langs)
 pretty_print_dict(lang_counts)

Question 10

Thank you for the advises, I want to know if is it a good practice to define all functions in the same file as you did in your recommendations ?

Question 11

Depends on how large/complex your code is and if you expect to reuse portions. For a small project like this seems to be you can probably leave it in one file

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2020-11-24 19:45:15Z

The main improvements can be done in your two helper functions. Your first function could use the builtin collections.Counter:

from collections import Counter
def languages_counter(data):
 ''' Count how many times every programming language is used given in a parameter as json object, return a dict with languages as a keys and number of repo using this language as a value
 '''
 return Counter(repo["language"] for repo in data["items"])

And the second one could use collections.defaultdict:

from collections import defaultdict
def groupby_language(data):
 '''
 return the list of repositories from data parameter for every programming language as a dictionary with the programming language as key and a list of repositories as value
 '''
 repo_urls = defaultdict(list)
 for repo in data["items"]:
 repo_urls[repo["language"]].append(repo["html_url"])
 return repo_urls

Note that I followed Python's official style-guide, PEP8, which recommends using lower_case for functions and variables.

Alternatively, you could group the repos by language first, and use that afterwards:

def groupby(values, key):
 grouped = defaultdict(list)
 for x in values:
 grouped[x[key]].append(x)
 return grouped
grouped_repos = groupby(get_data(), "language")
counts = {language: len(repos)
 for language, repos in grouped_repos.items()}
urls = {language: [repo["remote_url"] for repo in repos]
 for language, repos in grouped_repos.items()}

In your main code, you should put getting the data into its own function as well. It then becomes easier to change it.

I would also use the requests module here, instead of urllib, it is a bit more user friendly and has advanced options like re-using a session for multiple requests, automatic URL-encoding and decoding, directly getting a JSON dictionary and checking the status code and raising an exception if it is not OK.

import requests
def get_data():
 one_month_ago = datetime.date(datetime.now()) - timedelta(days=30)
 url = "https://api.github.com/search/repositories"
 params = {"q": f">{one_month_ago.isoformat()}",
 "sort": "stars",
 "order": "desc",
 "page": 1,
 "per_page": 100}
 response = requests.get(url, params=params)
 response.raise_for_status()
 return response.json()["items"]

@Graipher I have similar question here where I needed some help on architecture/design of my code so wanted to see if you can help me out? Sorry for the ping like this.

Stack Exchange Network

Microservice that fetches data from REST repository endpoints on Github

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Microservice that fetches data from REST repository endpoints on Github

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions