I have code in Python 3 where I fetch data from the GitHub API, and I want some advice on the architecture of this code.
import urllib.request, urllib.parse, urllib.error
import json
import sys
from collections import Counter
from datetime import datetime, timedelta, date
from tabulate import tabulate
from LanguagesRepoCounter import LanguagesCounter
from RepoListLanguages import RepoListLanguages
#getting the date of one mounth ago from today
One_Mouth_ago_date = datetime.date(datetime.now()) - timedelta(days=30)
url = 'https://api.github.com/search/repositories?q=created:%3E'+One_Mouth_ago_date.isoformat()+'&sort=stars&order=desc&page=1&per_page=100'
data_connection_request = urllib.request.urlopen(url)
unicoded_data = data_connection_request.read().decode()
try:
json_data = json.loads(unicoded_data)
except:
json_data = None
# testing retrieaved data
if not json_data or json_data["total_count"]<1:
print(json_data)
sys.exit('######## Failure To Retrieve ########')
#Counting the number of repo using every language
Lang_Dic_Count = LanguagesCounter(json_data)
# Getting the list of repo using every language
DicDataList = RepoListLanguages(json_data, Lang_Dic_Count)
# Displaye the repo's urls for every language
for key in DicDataList:
print(key, ' language used in')
for i in range(len(DicDataList[key])):
print(" ", DicDataList[key][i])
# Displaye the number of repos using every language in a table
table = []
for key in Lang_Dic_Count:
under_table=[]
under_table.append(key)
under_table.append(Lang_Dic_Count[key])
table.append(under_table)
print(tabulate(table, headers=('Language', 'number of Repo using it'), tablefmt="grid"))
and the first function is :
def LanguagesCounter(Data):
''' Count how many times every programming language is used in the 100 most starred repos created in the last 30 days given in a parameter as json object, return a dict with languages as a keys and number of repo using this language as a value
'''
Languages_liste = []
for i in range(len(Data["items"])):
Languages_liste.append(Data["items"][i]["language"])
Lang_Dic_Count ={i:Languages_liste.count(i) for i in Languages_liste}
return Lang_Dic_Count
the third function is:
def RepoListLanguages(Data, list):
'''
return the list of repositories from Data parameter for every programming language in the list parameter as a dictionary with the programming language as key and a list of repositories as value
'''
languagesRepo = {}
for key in list:
languagesRepo[key]= []
for repo in range(len(Data["items"])):
if (key == Data["items"][repo]["language"]):
languagesRepo[key].append(Data["items"][repo]["html_url"])
return languagesRepo
The code fetches the most starred repos from GitHub API that were created in the last 30 days, counts how many repos use every programming language, and lists the repo links for every language.
2 Answers 2
The main improvements can be done in your two helper functions. Your first function could use the builtin collections.Counter
:
from collections import Counter
def languages_counter(data):
''' Count how many times every programming language is used given in a parameter as json object, return a dict with languages as a keys and number of repo using this language as a value
'''
return Counter(repo["language"] for repo in data["items"])
And the second one could use collections.defaultdict
:
from collections import defaultdict
def groupby_language(data):
'''
return the list of repositories from data parameter for every programming language as a dictionary with the programming language as key and a list of repositories as value
'''
repo_urls = defaultdict(list)
for repo in data["items"]:
repo_urls[repo["language"]].append(repo["html_url"])
return repo_urls
Note that I followed Python's official style-guide, PEP8, which recommends using lower_case
for functions and variables.
Alternatively, you could group the repos by language first, and use that afterwards:
def groupby(values, key):
grouped = defaultdict(list)
for x in values:
grouped[x[key]].append(x)
return grouped
grouped_repos = groupby(get_data(), "language")
counts = {language: len(repos)
for language, repos in grouped_repos.items()}
urls = {language: [repo["remote_url"] for repo in repos]
for language, repos in grouped_repos.items()}
In your main code, you should put getting the data into its own function as well. It then becomes easier to change it.
I would also use the requests
module here, instead of urllib
, it is a bit more user friendly and has advanced options like re-using a session for multiple requests, automatic URL-encoding and decoding, directly getting a JSON dictionary and checking the status code and raising an exception if it is not OK.
import requests
def get_data():
one_month_ago = datetime.date(datetime.now()) - timedelta(days=30)
url = "https://api.github.com/search/repositories"
params = {"q": f">{one_month_ago.isoformat()}",
"sort": "stars",
"order": "desc",
"page": 1,
"per_page": 100}
response = requests.get(url, params=params)
response.raise_for_status()
return response.json()["items"]
-
\$\begingroup\$ Thank you, a very pedagogical answer. \$\endgroup\$E.Mohammed– E.Mohammed2020年11月24日 21:12:46 +00:00Commented Nov 24, 2020 at 21:12
-
-
\$\begingroup\$ @AndyP I will take a look when I have some time later. \$\endgroup\$Graipher– Graipher2020年12月01日 06:07:23 +00:00Commented Dec 1, 2020 at 6:07
-
\$\begingroup\$ @Graipher Thanks really appreciate it. \$\endgroup\$AndyP– AndyP2020年12月01日 14:27:26 +00:00Commented Dec 1, 2020 at 14:27
The two functions you broke out (LanguagesCounter
, and RepoListLanguages
) both could be simplified significantly using list comprehension. You seem to be thinking in terms of loops and appending to lists instead of generating them in one go which is more pythonic. Add in the Counter
function from collections and RepoListLanguages becomes a relatively trivial one liner.
Those are both at the top of the rewrite below so you can see them in action.
Beyond that, the only real changes I made were using lower_camel_case per pep8 guidelines, changing a few variable names for clarity, and pulling the messy code at the bottom into functions. Functions and clear structure make reading the code much easier for anyone who needs to even if they don't change how it works.
import urllib.request, urllib.parse, urllib.error
import json
import sys
from collections import Counter
from datetime import datetime, timedelta, date
from tabulate import tabulate
from collections import Counter
def language_counter(data):
return Counter(repo["language"] for repo in data["items"])
def repo_languages(data, list):
return {
key: [
repo["html_url"] for repo in data["items"]
if key == repo["language"]
]
for key in list
}
def new_url():
last_month = datetime.date(datetime.now()) - timedelta(days=30)
return 'https://api.github.com/search/repositories?q=created:%3E' + last_month.isoformat(
) + '&sort=stars&order=desc&page=1&per_page=100'
def get_data(url):
data_connection_request = urllib.request.urlopen(url)
unicoded_data = data_connection_request.read().decode()
try:
json_data = json.loads(unicoded_data)
except:
json_data = None
return json_data
def validate_or_quit(json_data):
if not json_data or json_data["total_count"] < 1:
print(json_data)
sys.exit('######## Failure To Retrieve ########')
def pretty_print_list(data):
for key in data:
print(key, ' language used in')
for _, item in enumerate(data[key]):
print(" " * 27, item)
def pretty_print_dict(data):
data_list = [[key, value] for key, value in data.items()]
print(
tabulate(
data_list,
headers=('Language', 'number of Repo using it'),
tablefmt="grid")
)
if __name__ == "__main__":
json_data = get_data(new_url())
validate_or_quit(json_data)
lang_counts = language_counter(json_data)
repo_langs = repo_languages(json_data, lang_counts)
pretty_print_list(repo_langs)
pretty_print_dict(lang_counts)
-
\$\begingroup\$ Thank you for the advises, I want to know if is it a good practice to define all functions in the same file as you did in your recommendations ? \$\endgroup\$E.Mohammed– E.Mohammed2020年11月24日 21:11:56 +00:00Commented Nov 24, 2020 at 21:11
-
1\$\begingroup\$ Depends on how large/complex your code is and if you expect to reuse portions. For a small project like this seems to be you can probably leave it in one file \$\endgroup\$Coupcoup– Coupcoup2020年11月24日 23:30:22 +00:00Commented Nov 24, 2020 at 23:30
Explore related questions
See similar questions with these tags.
'''Python
will run. \$\endgroup\$