Counting the number of days worked for all commiters to a git repo

Question 1

I wrote this script to collect evidence of the number of days worked for the purpose of claiming some government tax credits.

I'm looking for some ways to clean it up. I'm especially wondering if there is a cleaner way to uniqueify a list than list(set(my_list) and maybe a better way to do:

d = dict(zip(commiters, [0 for x in xrange(len(commiters))]))

import os
from pprint import pprint
lines = os.popen('git log --all').read().split('\n')
author_lines = filter(lambda str: str.startswith('Author'), lines)
date_lines = filter(lambda str: str.startswith('Date'), lines)
author_lines = map(lambda str: str[8:], author_lines)
date_lines = map(lambda str: str[8:18].strip(), date_lines)
lines = zip(author_lines, date_lines)
lines = sorted(list(set(lines)), key = lambda tup: tup[0])
commiters = list(set(map(lambda tup: tup[0], lines)))
d = dict(zip(commiters, [0 for x in xrange(len(commiters))]))
for item in lines:
 d[item[0]] += 1
pprint(d)

Question 2

For this part:

author_lines = filter(lambda str: str.startswith('Author'), lines)
date_lines = filter(lambda str: str.startswith('Date'), lines)
author_lines = map(lambda str: str[8:], author_lines)
date_lines = map(lambda str: str[8:18].strip(), date_lines)

This might be clearer, not that I have anything against map or filter, but list comprehensions do combine them nicely when you need both:

author_lines = [line[8:] for line in lines if line.startswith('Author')]
date_lines = [line[8:18].strip() for line in lines if line.startswith('Date')]

This:

lines = sorted(list(set(lines)), key = lambda tup: tup[0])

Can become:

lines = sorted(set(lines), key = lambda tup: tup[0])

for slightly less repetition (sorted automatically converts to a list).
And are you sure the key is even necessary? Tuples get sorted by elements just fine, the only reason to sort specifically by only the first element is if you want to preserve the original order of lines with the same author, rather than sorting them by the date line.
... Actually, why are you even sorting this at all? I don't see anything in the rest of the code that will work any differently whether it's sorted or not.

For this:

commiters = list(set(map(lambda tup: tup[0], lines)))

Why are you zipping author_lines and date_lines and then unzipping again? Just do:

commiters = set(author_lines)

or am I missing something?

And this:

d = dict(zip(commiters, [0 for x in xrange(len(commiters))]))
for item in lines:
 d[item[0]] += 1

You're just getting commit counts, right? Use Counter:

import collections
d = collections.Counter([author_line for author_line,_ in lines])

Or, if your python version doesn't have collections.Counter:

import collections
d = collections.defaultdict(lambda: 0)
for author_line,_ in lines:
 d[author_line] += 1

... Wait, are you even using date_lines anywhere? If not, what are they there for?

Question 3

date_lines is there so that list(set(lines)) will leave behind only the lines that have a unique author and day of commit. I'm trying to count the number of days that a particular author made a commit. Its true that they don't need to be sorted, that was left over from when I changed my algorithm slightly. I also like your version of collecting the author_lines and date_lines. I'll look into the collections library more, I'm not sure if I like Counter, but I definitely like the for author_line,_ in lines syntax.

Question 4

@Drew - Oh, right! That makes sense. In that case I'd probably do the list(set(lines)) as you did and then just do a list comprehension to discard all the date lines, since you no longer need them and they only seem to get in the way later. (Also I'd probably rename some of your variables, like lines, to make it more obvious what they're doing, like author_day_pairs or something).

Question 5

Oooh, getting rid of the date_lines afterwards is a great idea. Thanks!

weronika weronika 2061 silver badge4 bronze badges · Accepted Answer · 2012-07-04 06:08:29Z

For this part:

author_lines = filter(lambda str: str.startswith('Author'), lines)
date_lines = filter(lambda str: str.startswith('Date'), lines)
author_lines = map(lambda str: str[8:], author_lines)
date_lines = map(lambda str: str[8:18].strip(), date_lines)

This might be clearer, not that I have anything against map or filter, but list comprehensions do combine them nicely when you need both:

author_lines = [line[8:] for line in lines if line.startswith('Author')]
date_lines = [line[8:18].strip() for line in lines if line.startswith('Date')]

This:

lines = sorted(list(set(lines)), key = lambda tup: tup[0])

Can become:

lines = sorted(set(lines), key = lambda tup: tup[0])

for slightly less repetition (sorted automatically converts to a list).
And are you sure the key is even necessary? Tuples get sorted by elements just fine, the only reason to sort specifically by only the first element is if you want to preserve the original order of lines with the same author, rather than sorting them by the date line.
... Actually, why are you even sorting this at all? I don't see anything in the rest of the code that will work any differently whether it's sorted or not.

For this:

commiters = list(set(map(lambda tup: tup[0], lines)))

Why are you zipping author_lines and date_lines and then unzipping again? Just do:

commiters = set(author_lines)

or am I missing something?

And this:

d = dict(zip(commiters, [0 for x in xrange(len(commiters))]))
for item in lines:
 d[item[0]] += 1

You're just getting commit counts, right? Use Counter:

import collections
d = collections.Counter([author_line for author_line,_ in lines])

Or, if your python version doesn't have collections.Counter:

import collections
d = collections.defaultdict(lambda: 0)
for author_line,_ in lines:
 d[author_line] += 1

... Wait, are you even using date_lines anywhere? If not, what are they there for?

date_lines is there so that list(set(lines)) will leave behind only the lines that have a unique author and day of commit. I'm trying to count the number of days that a particular author made a commit. Its true that they don't need to be sorted, that was left over from when I changed my algorithm slightly. I also like your version of collecting the author_lines and date_lines. I'll look into the collections library more, I'm not sure if I like Counter, but I definitely like the for author_line,_ in lines syntax.
@Drew - Oh, right! That makes sense. In that case I'd probably do the list(set(lines)) as you did and then just do a list comprehension to discard all the date lines, since you no longer need them and they only seem to get in the way later. (Also I'd probably rename some of your variables, like lines, to make it more obvious what they're doing, like author_day_pairs or something).
Oooh, getting rid of the date_lines afterwards is a great idea. Thanks!

Stack Exchange Network

Counting the number of days worked for all commiters to a git repo

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Counting the number of days worked for all commiters to a git repo

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions