I haven't done much Python programming, but I really like the language so I've been using it for side projects that I work on. The problem with this is that I don't have the opportunity to have my peers review my code and give suggestions.
I've been working on a project that will be scraping GitHub looking for security vulnerabilities. I've created a separate file in the project that contains all functions that interact with GitHub's API.
import requests
import re
import base64
import os
def getRepos(since=0):
url = 'http://api.github.com/repositories'
data = """{
since: %s
}""" % since
response = requests.get(url, data=data)
if response.status_code == 403:
print "Problem making request!", response.status_code
print response.headers
matches = re.match(r'<.+?>', response.headers['Link'])
next = matches.group(0)[1:-1]
return response.json(), next
def getRepo(url):
response = requests.get(url)
return response.json()
def getReadMe(url):
url = url + "/readme"
response = requests.get(url)
return response.json()
# todo: return array of all commits so we can examine each one
def getRepoSHA(url):
# /repos/:owner/:repo/commits
commits = requests.get(url + "/commits").json()
return commits[0]['sha']
def getFileContent(item):
ignoreExtensions = ['jpg']
filename, extension = os.path.splitext(item['path'])
if extension in ignoreExtensions:
return []
content = requests.get(item['url']).json()
lines = content['content'].split('\n')
lines = map(base64.b64decode, lines)
print 'path', item['path']
print 'lines', "".join(lines[:5])
return "".join(lines)
def getRepoContents(url, sha):
# /repos/:owner/:repo/git/trees/:sha?recursive=1
url = url + ('/git/trees/%s?recursive=1' % sha)
# print 'url', url
response = requests.get(url)
return response.json()
The code is run from here:
import github
import json
def processRepoContents(repoContents):
# for each entry in the repo
for tree in repoContents['tree']:
contentType = tree['type']
print 'contentType --- ', contentType
# if type is "blob" get the content
if contentType == 'blob':
github.getFileContent(tree)
print '***blob***'
elif contentType == 'tree':
print '***tree***'
# if type is "tree" get the subtree
if __name__ == '__main__':
repos, next = github.getRepos()
for repo in repos[0:10]:
# repoJson = github.getRepo(repo['url'])
sha = github.getRepoSHA(repo['url'])
repoJson = github.getRepoContents(repo['url'], sha)
processRepoContents(repoJson)
I was hoping to get some feedback as to whether or not I am doing anything that would be considered not a best practice.
Also - I have all these functions in a file called github.py
and I then include it by using import github
wherever I need it. I'm assuming that it would not make sense to create a class to wrap these functions, since there is never any state for the class to keep track of that the functions would require. Does this reasoning make sense or should I wrap these functions in a class?
If anyone is interested they can see all of the code in the repo here - I would love to get feed back on the rest of the code (there isn't much more), but didn't think it would all fit in this question.
1 Answer 1
PEP8 - click, read, apply
PEP8 gives coding conventions for the Python code comprising the standard library in the main Python distribution
- you should use 4 spaces per indentation level
- function names should be lowercase, with words separated by underscores as necessary to improve readability. mixedCase is allowed only in contexts where that's already the prevailing style (e.g. threading.py), to retain backwards compatibility
- variable names should also stick to the above rule
- there should be two newlines between each method
- use augmented assigments where possible (
url = url + ('/git/trees/%s?recursive=1' % sha)
->url += '/git/trees/%s?recursive=1' % sha
) - you're sometimes using double-quotes around strings, and sometimes single-quotes. Choose one and stick to that.
- use
format()
instead of the old%
- I'd rather suggest you use the
print()
function even if you're using python 2.7. It will make you code easily portable. - you're lacking docstrings. Write docstrings for all public modules, functions, classes, and methods. They are not necessary for non-public methods, but you should have a comment that describes what the method does. This comment should appear after the
def
line.)
Code:
import base64
import os
import re
import requests
def get_repos(since=0):
"""
Left as an exercise for OP
"""
url = 'http://api.github.com/repositories'
data = '{{since: {}}}'.format(since)
response = requests.get(url, data=data)
if response.status_code == 403:
print('Problem making request! {}'.format(response.status_code))
print(response.headers)
matches = re.match(r'<.+?>', response.headers['Link'])
next = matches.group(0)[1:-1]
return response.json(), next
def get_repo(url):
"""
Left as an exercise for OP
"""
return requests.get(url).json()
def get_readme(url):
"""
Left as an exercise for OP
"""
url += '/readme'
return requests.get(url).json()
# todo: return array of all commits so we can examine each one
def get_repo_sha(url):
"""
Left as an exercise for OP
"""
commits = requests.get(url + '/commits').json()
return commits[0]['sha']
def get_file_content(item):
"""
Left as an exercise for OP
"""
ignore_extensions = ['jpg']
filename, extension = os.path.splitext(item['path'])
if extension in ignore_extensions:
return []
content = requests.get(item['url']).json()
lines = content['content'].split('\n')
lines = map(base64.b64decode, lines)
print('Path: '.format(item['path']))
print('Lines: '.format(''.join(lines[:5])))
return ''.join(lines)
def get_repo_contents(url, sha):
"""
Left as an exercise for OP
"""
url += '/git/trees/{}?recursive=1'.format(sha)
return requests.get(url).json()
For the second .py
file:
- the above rules apply
- don't import modules you're not using (
import json
)
Code
import github
def process_repo_contents(repo_contents):
"""
Left as an exercise for OP
"""
for tree in repo_contents['tree']:
content_type = tree['type']
print('content_type --- {}'.format(content_type))
if content_type == 'blob':
github.get_file_content(tree)
print('***blob***')
elif content_type == 'tree':
print('***tree***')
if __name__ == '__main__':
repos, next = github.get_repos()
for repo in repos[0:10]:
sha = github.get_repo_sha(repo['url'])
repo_json = github.get_repo_contents(repo['url'], sha)
process_repo_contents(repo_json)
Pay more attention to variable names and data types
next
is a builtin keyword so I'd recommend you to change it to something elseignore_extensions
is a list in your case. And it only keeps one string. Don't you think it would be more appropriate to make it a string from the beginning ?
-
\$\begingroup\$ I assume it's just because they're placeholders, but your docstrings aren't compliant with PEP-257. \$\endgroup\$jonrsharpe– jonrsharpe2017年01月02日 08:59:31 +00:00Commented Jan 2, 2017 at 8:59
-
\$\begingroup\$ @jonrsharpe they're just placeholders, indeed \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2017年01月02日 09:01:02 +00:00Commented Jan 2, 2017 at 9:01
-
\$\begingroup\$ Great answer - thanks for the info @Dex'ter. Do you have any input on my question regarding class usage? Safe to assume my reasoning was correct since you didn't mention it? \$\endgroup\$Abe Miessler– Abe Miessler2017年01月03日 05:45:56 +00:00Commented Jan 3, 2017 at 5:45
-
\$\begingroup\$ Personally, I'm not a fan of OOP-abuse style. So no, in you case, encapsulating everything inside a class wouldn't do much sense for me. OO is about encapsulating mutable state and is therefore more appropriate for interactive applications, GUI's, and API's exposing mutable state. It is no coincidence that OO was developed in parallel with the first GUI's. \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2017年01月03日 06:40:41 +00:00Commented Jan 3, 2017 at 6:40
prowler.py
from your github. I find it relevant for this review. \$\endgroup\$