Scraping GitHub for security vulnerabilities

Question 1

I haven't done much Python programming, but I really like the language so I've been using it for side projects that I work on. The problem with this is that I don't have the opportunity to have my peers review my code and give suggestions.

I've been working on a project that will be scraping GitHub looking for security vulnerabilities. I've created a separate file in the project that contains all functions that interact with GitHub's API.

import requests
import re
import base64
import os
def getRepos(since=0):
 url = 'http://api.github.com/repositories'
 data = """{
 since: %s
 }""" % since
 response = requests.get(url, data=data)
 if response.status_code == 403:
 print "Problem making request!", response.status_code
 print response.headers
 matches = re.match(r'<.+?>', response.headers['Link'])
 next = matches.group(0)[1:-1]
 return response.json(), next
def getRepo(url):
 response = requests.get(url)
 return response.json()
def getReadMe(url):
 url = url + "/readme"
 response = requests.get(url)
 return response.json()
# todo: return array of all commits so we can examine each one 
def getRepoSHA(url):
 # /repos/:owner/:repo/commits
 commits = requests.get(url + "/commits").json()
 return commits[0]['sha']
def getFileContent(item):
 ignoreExtensions = ['jpg']
 filename, extension = os.path.splitext(item['path'])
 if extension in ignoreExtensions:
 return []
 content = requests.get(item['url']).json()
 lines = content['content'].split('\n')
 lines = map(base64.b64decode, lines)
 print 'path', item['path']
 print 'lines', "".join(lines[:5])
 return "".join(lines)
def getRepoContents(url, sha):
 # /repos/:owner/:repo/git/trees/:sha?recursive=1
 url = url + ('/git/trees/%s?recursive=1' % sha)
 # print 'url', url
 response = requests.get(url)
 return response.json()

The code is run from here:

import github
import json
def processRepoContents(repoContents):
 # for each entry in the repo
 for tree in repoContents['tree']:
 contentType = tree['type']
 print 'contentType --- ', contentType
 # if type is "blob" get the content
 if contentType == 'blob':
 github.getFileContent(tree)
 print '***blob***'
 elif contentType == 'tree':
 print '***tree***'
 # if type is "tree" get the subtree
if __name__ == '__main__':
 repos, next = github.getRepos()
 for repo in repos[0:10]:
 # repoJson = github.getRepo(repo['url'])
 sha = github.getRepoSHA(repo['url'])
 repoJson = github.getRepoContents(repo['url'], sha)
 processRepoContents(repoJson)

I was hoping to get some feedback as to whether or not I am doing anything that would be considered not a best practice.

Also - I have all these functions in a file called github.py and I then include it by using import github wherever I need it. I'm assuming that it would not make sense to create a class to wrap these functions, since there is never any state for the class to keep track of that the functions would require. Does this reasoning make sense or should I wrap these functions in a class?

If anyone is interested they can see all of the code in the repo here - I would love to get feed back on the rest of the code (there isn't much more), but didn't think it would all fit in this question.

Question 2

I think you should also add prowler.py from your github. I find it relevant for this review.

Question 3

Sure thing @Dex'ter - added

Question 4

PEP8 - click, read, apply

PEP8 gives coding conventions for the Python code comprising the standard library in the main Python distribution

you should use 4 spaces per indentation level
function names should be lowercase, with words separated by underscores as necessary to improve readability. mixedCase is allowed only in contexts where that's already the prevailing style (e.g. threading.py), to retain backwards compatibility
variable names should also stick to the above rule
there should be two newlines between each method
use augmented assigments where possible (url = url + ('/git/trees/%s?recursive=1' % sha) -> url += '/git/trees/%s?recursive=1' % sha)
you're sometimes using double-quotes around strings, and sometimes single-quotes. Choose one and stick to that.
use format() instead of the old %
I'd rather suggest you use the print() function even if you're using python 2.7. It will make you code easily portable.
you're lacking docstrings. Write docstrings for all public modules, functions, classes, and methods. They are not necessary for non-public methods, but you should have a comment that describes what the method does. This comment should appear after the def line.)

Code:

import base64
import os
import re
import requests
def get_repos(since=0):
 """
 Left as an exercise for OP
 """
 url = 'http://api.github.com/repositories'
 data = '{{since: {}}}'.format(since)
 response = requests.get(url, data=data)
 if response.status_code == 403:
 print('Problem making request! {}'.format(response.status_code))
 print(response.headers)
 matches = re.match(r'<.+?>', response.headers['Link'])
 next = matches.group(0)[1:-1]
 return response.json(), next
def get_repo(url):
 """
 Left as an exercise for OP
 """
 return requests.get(url).json()
def get_readme(url):
 """
 Left as an exercise for OP
 """
 url += '/readme'
 return requests.get(url).json()
# todo: return array of all commits so we can examine each one
def get_repo_sha(url):
 """
 Left as an exercise for OP
 """
 commits = requests.get(url + '/commits').json()
 return commits[0]['sha']
def get_file_content(item):
 """
 Left as an exercise for OP
 """
 ignore_extensions = ['jpg']
 filename, extension = os.path.splitext(item['path'])
 if extension in ignore_extensions:
 return []
 content = requests.get(item['url']).json()
 lines = content['content'].split('\n')
 lines = map(base64.b64decode, lines)
 print('Path: '.format(item['path']))
 print('Lines: '.format(''.join(lines[:5])))
 return ''.join(lines)
def get_repo_contents(url, sha):
 """
 Left as an exercise for OP
 """
 url += '/git/trees/{}?recursive=1'.format(sha)
 return requests.get(url).json()

For the second .py file:

the above rules apply
don't import modules you're not using (import json)

Code

import github
def process_repo_contents(repo_contents):
 """
 Left as an exercise for OP
 """
 for tree in repo_contents['tree']:
 content_type = tree['type']
 print('content_type --- {}'.format(content_type))
 if content_type == 'blob':
 github.get_file_content(tree)
 print('***blob***')
 elif content_type == 'tree':
 print('***tree***')
if __name__ == '__main__':
 repos, next = github.get_repos()
 for repo in repos[0:10]:
 sha = github.get_repo_sha(repo['url'])
 repo_json = github.get_repo_contents(repo['url'], sha)
 process_repo_contents(repo_json)

Pay more attention to variable names and data types

next is a builtin keyword so I'd recommend you to change it to something else
ignore_extensions is a list in your case. And it only keeps one string. Don't you think it would be more appropriate to make it a string from the beginning ?

Question 5

I assume it's just because they're placeholders, but your docstrings aren't compliant with PEP-257.

Question 6

@jonrsharpe they're just placeholders, indeed

Question 7

Great answer - thanks for the info @Dex'ter. Do you have any input on my question regarding class usage? Safe to assume my reasoning was correct since you didn't mention it?

Question 8

Personally, I'm not a fan of OOP-abuse style. So no, in you case, encapsulating everything inside a class wouldn't do much sense for me. OO is about encapsulating mutable state and is therefore more appropriate for interactive applications, GUI's, and API's exposing mutable state. It is no coincidence that OO was developed in parallel with the first GUI's.

score 7 · Accepted Answer · 2017-01-02 07:49:54Z

PEP8 - click, read, apply

PEP8 gives coding conventions for the Python code comprising the standard library in the main Python distribution

you should use 4 spaces per indentation level
function names should be lowercase, with words separated by underscores as necessary to improve readability. mixedCase is allowed only in contexts where that's already the prevailing style (e.g. threading.py), to retain backwards compatibility
variable names should also stick to the above rule
there should be two newlines between each method
use augmented assigments where possible (url = url + ('/git/trees/%s?recursive=1' % sha) -> url += '/git/trees/%s?recursive=1' % sha)
you're sometimes using double-quotes around strings, and sometimes single-quotes. Choose one and stick to that.
use format() instead of the old %
I'd rather suggest you use the print() function even if you're using python 2.7. It will make you code easily portable.
you're lacking docstrings. Write docstrings for all public modules, functions, classes, and methods. They are not necessary for non-public methods, but you should have a comment that describes what the method does. This comment should appear after the def line.)

Code:

import base64
import os
import re
import requests
def get_repos(since=0):
 """
 Left as an exercise for OP
 """
 url = 'http://api.github.com/repositories'
 data = '{{since: {}}}'.format(since)
 response = requests.get(url, data=data)
 if response.status_code == 403:
 print('Problem making request! {}'.format(response.status_code))
 print(response.headers)
 matches = re.match(r'<.+?>', response.headers['Link'])
 next = matches.group(0)[1:-1]
 return response.json(), next
def get_repo(url):
 """
 Left as an exercise for OP
 """
 return requests.get(url).json()
def get_readme(url):
 """
 Left as an exercise for OP
 """
 url += '/readme'
 return requests.get(url).json()
# todo: return array of all commits so we can examine each one
def get_repo_sha(url):
 """
 Left as an exercise for OP
 """
 commits = requests.get(url + '/commits').json()
 return commits[0]['sha']
def get_file_content(item):
 """
 Left as an exercise for OP
 """
 ignore_extensions = ['jpg']
 filename, extension = os.path.splitext(item['path'])
 if extension in ignore_extensions:
 return []
 content = requests.get(item['url']).json()
 lines = content['content'].split('\n')
 lines = map(base64.b64decode, lines)
 print('Path: '.format(item['path']))
 print('Lines: '.format(''.join(lines[:5])))
 return ''.join(lines)
def get_repo_contents(url, sha):
 """
 Left as an exercise for OP
 """
 url += '/git/trees/{}?recursive=1'.format(sha)
 return requests.get(url).json()

For the second .py file:

the above rules apply
don't import modules you're not using (import json)

Code

import github
def process_repo_contents(repo_contents):
 """
 Left as an exercise for OP
 """
 for tree in repo_contents['tree']:
 content_type = tree['type']
 print('content_type --- {}'.format(content_type))
 if content_type == 'blob':
 github.get_file_content(tree)
 print('***blob***')
 elif content_type == 'tree':
 print('***tree***')
if __name__ == '__main__':
 repos, next = github.get_repos()
 for repo in repos[0:10]:
 sha = github.get_repo_sha(repo['url'])
 repo_json = github.get_repo_contents(repo['url'], sha)
 process_repo_contents(repo_json)

Pay more attention to variable names and data types

next is a builtin keyword so I'd recommend you to change it to something else
ignore_extensions is a list in your case. And it only keeps one string. Don't you think it would be more appropriate to make it a string from the beginning ?

I assume it's just because they're placeholders, but your docstrings aren't compliant with PEP-257.
Great answer - thanks for the info @Dex'ter. Do you have any input on my question regarding class usage? Safe to assume my reasoning was correct since you didn't mention it?
Personally, I'm not a fan of OOP-abuse style. So no, in you case, encapsulating everything inside a class wouldn't do much sense for me. OO is about encapsulating mutable state and is therefore more appropriate for interactive applications, GUI's, and API's exposing mutable state. It is no coincidence that OO was developed in parallel with the first GUI's.

Stack Exchange Network

Scraping GitHub for security vulnerabilities

1 Answer 1

PEP8 - click, read, apply

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related