I have scripts to display commit statistics and merge statistics of my repos, and they work. I wrote them for my personal usage, and because I was interested in finding trends in my git repos.
This script reports statistics about commits (number, average length in words, etc.). Relevant commits can be selected using git-rev-list
options.
Features and times
count
: report number of commits (not a performance issue)len
: report length in words of commit message and commit hash (~20s for 1561 commits)len min
,len max
, andlen avg
: report minimum, maximum, or average commit message length in words and commit hash (~10-15s for the same)
Benchmarks run with bash's
time
on my dotfiles repo
A previous implementation using for-loops had similar performance.
Obviously, the algorithms are O(n). They are still too slow for every-day usage.
#! /usr/bin/env bash
set -euo pipefail
USAGE='[-h] (count | len [min|max|avg]) [rev-list options]
Display commit statistics
Filter commits based on [rev-list options]'
SUBDIRECTORY_OK=true
# source git-sh-setup for some helpers
set +u
source "$(git --exec-path)"/git-sh-setup
set -u
SIZER=(
wc
# count words
-w
)
size() {
local commit="1ドル"
git log "$commit" -1 --format=%B | "${SIZER[@]}" | tr -d ' '
}
commits_list() {
command=(
git
rev-list
# start somewhere
--all
)
if (($# > 0)) ; then
command+=("$@")
fi
"${command[@]}" 2>/dev/null
}
commit_count() {
commits_list "$@" | wc -l | tr -d ' '
}
commit_len() {
commits_list "$@" |
while read c ; do
size "$c" | tr -d '[:space:]'
printf ' %s\n' "$c"
done
}
commit_len_min() {
commit_len "$@" |
sort -n |
head -n 1
}
commit_len_max() {
commit_len "$@" |
sort -rn |
head -n 1
}
commit_len_avg() {
local num=0
{
printf '%s\n' '5k'
while read c ; do
((++num))
size "$c"
((num >= 2)) && printf '%s\n' '+'
done < <(commits_list "$@")
printf '%s\n' "$num" '/p'
} | dc
}
main() {
(($# >= 1)) || usage
case "1ドル" in
count) commit_count "${@:2}" ;;
len)
if (($# >= 2)); then
case "2ドル" in
max|min|avg) commit_len_"2ドル" "${@:3}" ;;
*) commit_len "${@:2}" ;;
esac
else
commit_len "${@:2}"
fi
;;
*) usage ;;
esac
}
main "$@"
Shell scripts being hard to profile, I've been unable to identify the bottleneck (though commit_len
seems like a good place to start).
I run shellcheck
regularly.
2 Answers 2
Some suggestions:
shellcheck
should give you a few suggestions. I won't mention things I expect it to find.- Uppercase names are by convention only used for exported variables.
SUBDIRECTORY_OK
is unused. If it's a magic variable this probably should be mentioned.SIZER
is only used once, so it should be inlined.wc -w
8.30 from GNU coreutils, at least, does not output any spaces, sotr -d ' '
might be unnecessary.(($# > 0))
would usually be written[[ "$#" -gt 0 ]]
.- Throwing away standard error means the script will be harder to debug. If there's specific output there you want to hide you can use
cmd 2> >(grep -v ... >&2)
(削除)commit_len
is slow because for each commit you run agit
command & more to count the number of commits before it. Which means you traverse the Git history N times. I think you'll get the same result by runningsize "1ドル"
. (削除ここまで)- You can use
shift
to simplify things like"${@:2}"
to just"$@"
. dc
is not a tool I'm familiar with, but it will certainly be faster to count using something likeawk
to gobble the whole stream in one command.while read
is actually surprisingly slow.
-
\$\begingroup\$ 3) see
man git-sh-setup
. 5)wc -w somefile
gives ` 92 Desktop/cs.txt ` on my machine (note the leading spaces). 6) But>
is more readable than-gt
. 8)size
only counts the size of a single commit (log -1
)--if there were a way to log all commits with their message sizes, that would be the fastest. 10)dc
is a stack-based calculator. That said, if I could usesize
in an awk invocation, that would be a considerable improvement (I could even fixcommit_len
then). \$\endgroup\$D. Ben Knoble– D. Ben Knoble2019年08月14日 14:31:30 +00:00Commented Aug 14, 2019 at 14:31 -
\$\begingroup\$ I did manage to convert
commits_list
to awk, which helped.commit_avg
was also easy. The result is incredible. \$\endgroup\$D. Ben Knoble– D. Ben Knoble2019年08月14日 20:28:47 +00:00Commented Aug 14, 2019 at 20:28 -
\$\begingroup\$ 5) Try it with standard input instead of a file. Good to hear
awk
helped! \$\endgroup\$l0b0– l0b02019年08月14日 20:41:25 +00:00Commented Aug 14, 2019 at 20:41
I've managed to drastically improve the performance by combining awk with some creative formatting: now that everything is awk, the script flies under 0.3s for even my ~1600 commits.
Result
#! /usr/bin/env bash
set -euo pipefail
USAGE='[-h] (count | len [min|max|avg]) [rev-list options]
Display commit statistics
Filter commits based on [rev-list options]'
SUBDIRECTORY_OK=true
# source git-sh-setup for some helpers
set +u
source "$(git --exec-path)"/git-sh-setup
set -u
commits_list() {
command=(
git
log
--pretty'='format:$'\a%n%H\t%s %b'
# start somewhere
--all
)
if (($# > 0)) ; then
command+=("$@")
fi
"${command[@]}" |
awk '
/'$'\a''/ && NR != 1 { printf "\n"; next }
{ printf "%s ", 0ドル }
END { printf "\n" }
'
}
commit_count() {
git rev-list --all --count "$@"
}
commit_len() {
commits_list "$@" |
awk -F$'\t' '{ print split(2,ドル_," "), 1ドル }'
}
commit_len_min() {
commit_len "$@" |
sort -n |
head -n 1
}
commit_len_max() {
commit_len "$@" |
sort -rn |
head -n 1
}
commit_len_avg() {
commit_len "$@" |
awk '
{ sum += 1ドル }
END { print sum/NR }
'
}
main() {
(($# >= 1)) || usage
case "1ドル" in
count) commit_count "${@:2}" ;;
len)
if (($# >= 2)); then
case "2ドル" in
max|min|avg) commit_len_"2ドル" "${@:3}" ;;
*) commit_len "${@:2}" ;;
esac
else
commit_len "${@:2}"
fi
;;
*) usage ;;
esac
}
main "$@"
PS4='\t'
or similar. That can identify commands that take over a second. \$\endgroup\$size
ing about 61 commits/s, so at ~1600 commits this takes 26s! \$\endgroup\$