I wanted to do some statistics over the history of a Git repository. At first I tried using GitPython, but it wasn't as trivial as I imagined it to be. In the end, I just call the necessary git commands with the submodule module.
Can this code be considered clean and readable or does it have some style issues?
import argparse
import os
import subprocess
def main(args):
if not os.path.isdir(args.path):
print "not a directory"
return
if ".git" not in os.listdir(args.path):
print "not a repo"
return
os.chdir(args.path)
commitIDs = []
start = 0 #load commits in batches, to avoid too long hangs
max_count = 100
while True:
text = subprocess.check_output(["git", "rev-list", args.branch, "--max-count=%s" % max_count, "--skip=%s" % start])
start += max_count
for line in text.splitlines():
commitIDs.append(line)
#print "loaded", len(commits), "commits"
if len(text.splitlines()) != max_count:
break
for commitID in commitIDs[::-args.skip]: #start with the oldest commit
print commitID
#do something with the commit
#for example:
#devnull = open(os.devnull, 'w')
#subprocess.call(["git", "checkout", commitID], stdout=devnull)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('path', nargs="?", default=".")
parser.add_argument('--branch', '-b', default="master")
parser.add_argument('--skip', '-s', type=int, default=1, help='use every n-th commit')
args = parser.parse_args()
main(args)
Update:
import argparse
import os
import subprocess
import sys
import git
def main(args):
try:
repo = git.Repo(args.path)
except:
sys.exit("no such repo")
try:
text = repo.git.rev_list(args.branch).splitlines()
except:
sys.exit("no such branch")
commit_ids = text[::args.skip]
print "loaded %s commits" % len(commit_ids)
for commit_id in commit_ids[::-1]: #start with the oldest commit
print commit_id
#do something with the commit
#for example:
#repo.git.checkout(commit_id)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('path', nargs="?", default=".")
parser.add_argument('--branch', '-b', default="master")
parser.add_argument('--skip', '-s', type=int, default=1, help='use every n-th commit')
args = parser.parse_args()
main(args)
sys.exit(0)
3 Answers 3
Great, glad you got rid of that loop: some more things I'd like to point out. As this is essentially another question now I'll add another answer
I don't think a function should call
sys.exit()
. I'd rather you raised an exception instead and callsys.exit()
in__main__
I.e. instead of:
try: repo = git.Repo(args.path) except: sys.exit("no such repo")
I would prefer:
from git import Repo, InvalidGitRepositoryError #... try: repo = git.Repo(args.path) except InvalidGitRepositoryError, e: raise InvalidGitRepositoryError("%s is not a git repository" % (str(e)))
This enables other Python classes to use your
main
function (which is a terrible name) in a more predictable way (Python doesn't close calling your function).I would also rather you check for the branch than assume if
rev-list
gives an error that the branch doesn't exist. There may be some other case that causes it to throw an exception.class InvalidBranchError(Exception): pass if args.branch not in repo.branches: raise InvalidBranchError("Branch does not exist: %s" % (args.branch))
Instead of
main
taking an arguments parser, I would prefer you used regular arguments (again so other python modules can use your code more readily):def main(path=".", branch="master"): pass main(path=args.path, branch=args.branch)
-
\$\begingroup\$ Thanks, I will work on it tomorrow. On the note of the main function, I just thought that passing args object to the function isn't a good idea too as it makes it harder to call it manually. \$\endgroup\$Adrian17– Adrian172014年06月01日 22:00:27 +00:00Commented Jun 1, 2014 at 22:00
-
\$\begingroup\$ And about the first point, I'm not sure if I get it, your suggested code raises pretty much the same error, so what's the difference between not handling it at all and seeing
git.exc.InvalidGitRepositoryError: C:\
and handling to seegit.exc.InvalidGitRepositoryError: C:\ is not a git repository
? \$\endgroup\$Adrian17– Adrian172014年06月01日 22:09:28 +00:00Commented Jun 1, 2014 at 22:09 -
\$\begingroup\$ I figured in
__main__
you would catch exceptions and dosys.exit( str(exception) )
so adding some details may be useful if you decided to use that approach \$\endgroup\$megawac– megawac2014年06月01日 23:15:52 +00:00Commented Jun 1, 2014 at 23:15
You actually can access the commit_id with GitPython methods via the hexsha
attribute. Short example:
import git
def main():
repo = git.Repo('.')
for commit in repo.iter_commits('master'):
# do something with the commit, for example:
repo.git.diff(commit.parents[0].hexsha, commit.hexsha)
# commit.committed_datetime.strftime("%a %d. %b %Y")
# commit.message.rstrip()
# see http://gitpython.readthedocs.io/en/stable/tutorial.html#the-commit-object for more information on the available attributes/methods of the commit object
if __name__ == "__main__":
main()
-
2\$\begingroup\$ The goal of the site is to provide value by reviewing the code in the question. Can you explain how your answer improves upon the original question? \$\endgroup\$forsvarir– forsvarir2016年07月20日 10:24:19 +00:00Commented Jul 20, 2016 at 10:24
-
\$\begingroup\$ Thanks for clarification, I wasn't aware that. Basically this is a very short answer to following sentence form the original question: 'At first I tried using GitPython, but it wasn't as trivial as I imagined it to be.' I didn't have the time to fully rewrite the code, but thought might profit if I share my 'solution'. \$\endgroup\$Andi– Andi2016年07月26日 14:00:37 +00:00Commented Jul 26, 2016 at 14:00
A few small comments:
- I would suggest to use exit codes (
sys.exit()
), it will help you if you are planning to you your script together with other scripts (chaining it, or use from shell scripts) - I would consider that
subprocess.check_output(["git", "rev-list...
may return not a list of commits (but error for example). In this case you may go to infinite loop. - Usually you can execute git commands from any subfolder of a git repo. In your case you are limiting it to be a root folder by using this condition
if ".git" not in os.listdir(args.path)
- Instead of
for line in text.splitlines(): commitIDs.append(line)
, docommitIDs.extend(text.splitlines())
- You execute
text.splitlines()
twice, consider creating a temporary variable for it - You are constructing a huge list first and use
args.skip
on it. Instead you can applyargs.skip
inside yourwhile
loop. It will limit amound of memory you need. commitIDs
=>commit_ids
-
\$\begingroup\$ Your advice about commitIDs.extend() and moving args.skip inside the white loop seem to conflict with each other. The only way I think could implement this is by a non trivial list comprehension inside extend(). Any hints for this one? \$\endgroup\$Adrian17– Adrian172014年06月01日 17:47:40 +00:00Commented Jun 1, 2014 at 17:47
-
\$\begingroup\$ they don't necessary conflict
commits = text.splitlines()[::args.skip]
,commitIDs.extend(commits)
\$\endgroup\$RomanI– RomanI2014年06月01日 17:59:42 +00:00Commented Jun 1, 2014 at 17:59 -
\$\begingroup\$ The problem with this approach is that it doesn't keep skip state between batches if batch size is not divisible by skip - if you have two batches of size 4: [0, 1, 2, 3] and [4, 5, 6, 7] and skip 3 in the end you should get [0, 3, 6], while your approach would give [0, 3, 4, 7]. \$\endgroup\$Adrian17– Adrian172014年06月01日 18:11:49 +00:00Commented Jun 1, 2014 at 18:11
-
\$\begingroup\$ Actually, @megawac is right, loading everything in one go is not a problem at all as it turns out. I was being too cautious. \$\endgroup\$Adrian17– Adrian172014年06月01日 18:30:59 +00:00Commented Jun 1, 2014 at 18:30
-
\$\begingroup\$ Ya, I would be tempted to file a git bug if you ran into a case where your loop method was faster than calling revlist once \$\endgroup\$megawac– megawac2014年06月01日 18:45:18 +00:00Commented Jun 1, 2014 at 18:45