Git Insert Delete Graph

Question 1

I attempted to recreate the GitHub code frequency graph (example) with a daily granularity using Perl and git log. How did I do, and what improvements can I make? I know that I should try to reduce the frequency of keys on the x-axis, but I have no idea how to do that and maintain scale.

Example of the output

It should be noted that the default application for open SVG files should be set to a browser of some sort.

#!/usr/bin/env perl
# Henry J Schmale
# November 4, 2015
#
# Creates an insertion and delation graph per day graph for a git repo. It
# outputs an svg of the graph on standard output.
#
# This script can take the name of a directory to produce the graph for
# that directory if no param is given, then it does it in the current
# directory. 
#
# Requires SVG::TT:Graph::Line
use strict;
use warnings;
# Get the path to the stylesheet first
use File::Spec;
use File::Basename;
my $graphsty = dirname(File::Spec->rel2abs(__FILE__)) . '/svg-graph-ss.css';
# CD into the directory specified if specified 
if(-e $ARGV[0] and -d $ARGV[0]){
 chdir $ARGV[0];
}
# Indexed by date
my %commits;
# get the git log and preprocess it
my $gitlogOutput = qx(git log --numstat --pretty="%H %aI" | grep -v '^\$');
my @lines = split /\n/, $gitlogOutput;
my $date;
my $hash;
foreach (@lines) {
 chomp;
 my @fields = split /\s+/;
 # Length of sha1sum
 if(length($fields[0]) > 39){
 $hash = $fields[0];
 $date = substr($fields[1], 0, 10);
 }else{
 $commits{$date}->{ins} += $fields[0];
 $commits{$date}->{del} += $fields[1];
 }
}
use DateTime;
use Date::Parse;
use Data::Dumper;
my $firstDate = getDateTime(((sort keys %commits)[0]));
my $lastDate = getDateTime(((sort keys %commits)[-1]));
# print "$firstDate\t".((sort keys %commits)[0])."\n";
# print "$lastDate\t".((sort keys %commits)[-1])."\n";
# print (scalar keys %commits)."\n";
while($firstDate->add(days => 1) < $lastDate){
 my $key = $firstDate->ymd('-');
 if(!defined $commits{$key}){
 $commits{$key}->{ins} = 0;
 $commits{$key}->{del} = 0;
 }
}
# print scalar keys %commits;
# print "\n";
# Prepare data for graphing by converting them to arrays
my (@key, @ins, @del, @net);
my $i = 0;
foreach (sort keys %commits){
 $key[$i] = $_;
 $ins[$i] = $commits{$_}{ins};
 # del must be negitive in order for graph to look right, with delations
 # being below the x axis.
 $del[$i] = -$commits{$_}{del};
 $net[$i] = $commits{$_}{ins} - $commits{$_}{del};
 # printf("%s,%s,%s,%s\n", $_, $ins[$i], $del[$i], $net[$i]);
 $i++;
}
# Get the max value in the data sets
my @allpoints;
my ($min, $max);
push @allpoints, @ins;
push @allpoints, @del;
push @allpoints, @net;
for(@allpoints){
 $min = $_ if !$min || $_ < $min;
 $max = $_ if !$max || $_ > $max;
}
# Graph it
use SVG::TT::Graph::Line;
my $graph = SVG::TT::Graph::Line->new({
 width => 1200,
 height => 800,
 fields => \@key,
 scale_integers => 1,
 rotate_x_labels => 1,
 show_data_values => 0,
 show_data_points => 0,
 min_scale_value => $min,
 max_scale_value => $max,
 style_sheet => $graphsty,
 });
# Add the data
$graph->add_data({
 'data' => \@ins,
 title => 'Inserts Per Day'
 });
$graph->add_data({
 'data' => \@del,
 title => 'Deletions Per Day'
 });
$graph->add_data({
 'data' => \@net,
 title => 'Net Insert/Del Per Day'
 });
# Print file 
my $filepath = '/tmp/gitGraph.svg';
open my $FD,'>',$filepath or die $!;
print $FD $graph->burn();
close $FD;
# Open it up in the browser or prefered method for opening the file
$filepath = 'file://'.$filepath;
qx(xdg-open $filepath);
sub getDateTime {
 my ($ss,$mm,$hh,$day,$month,$year,$zone) = strptime($_[0]);
 # print "\t$_[0]\n";
 return DateTime->new(
 day => $day,
 month => $month + 1,
 # We have to add 1900 here inorder to make the date format work
 year => ($year + 1900)
 );
}

Question 2

Empty string for every odd element on X axis?

Question 3

@mpapec It only happens on the first run for a repository, or if you have dozens of branches.

Question 4

Just a style thing, but the use SVG::TT::Graph::Line; should be at the top. So should the other use statements that are in the middle. They will all be loaded at compile time anyway.

Question 5

A simple thing to add would be a Logarithmic Y-scaling function, so that significant bursts in either losses or gains don't get over-exaggerated perceptually. Essentially you want 100-commit days to only take up at most, 10x as much space as 1 commit days, so "low activity" wavering is still visible.

I'd also suggest breaking windows of about 7 days into single data points, by aggregating them and representing them as 5 lines each:

Max number of commits per day in 7-day-range,
Upper Quartile number of commits per day in 7-day-range
Median number of commits per day ...
Lower Quartile ....
Min...

And then render it with some shades like this ( half-assed mockup )

enter image description here

With this, you could afford to narrow the sample window to "per hour" say, but still represent the data accumulated in terms of weeks, so you'd have "max commits/hour this week", "min commits/hour this week", etc, all represented in a single vertical column.

Question 6

IMO a CR post almost always should include some code (or at least a reference to original code) otherwise it's more about UX (which is good and desiderable but maybe off-topic)

Question 7

I mostly agree. But here it's not entirely obvious what the author is asking for, seems like they're wanting both. The code ... is ... well, more scary than I can begin to start on :/

Question 8

Overview

It is great that you:

Used strict and warnings
Leveraged other people's code by using the CPAN modules
Used comments at the top of the file to describe the code

Here are some adjustments for you to consider, mainly for coding style.

Warning

I get a warning if I don't specify a directory on the command line. The -e file test isn't doing what you expect. It does not check if the argument was passed in. This is one way to do it:

# If a directory is specified on the command line, change into it
chdir $ARGV[0] if $ARGV[0] and -d $ARGV[0];

Modules

It is customary to place all the use statements together at the top of the file:

Since this is not used, just delete the line:

use Data::Dumper;

Namespace

It is best to import only what is needed to avoid namespace pollution. For example, change:

use File::Basename;

to:

use File::Basename qw(dirname);

These are all the use statements:

use strict;
use warnings;
use File::Spec qw();
use File::Basename qw(dirname);
use DateTime qw();
use Date::Parse qw(strptime);
use SVG::TT::Graph::Line qw();

Provide more debugging information to the user (the file name) if open fails. Change:

open my $FD,'>',$filepath or die $!;

to:

open my $FD, '>', $filepath or die "$filepath $!";

There is no need for the following scalar variable and for split:

my $gitlogOutput = qx(git log --numstat --pretty="%H %aI" | grep -v '^\$');
my @lines = split /\n/, $gitlogOutput;

Simplified as:

my @lines = qx(git log --numstat --pretty="%H %aI" | grep -v '^\$');

The $hash variable is declared and set, but it is never read. It can be deleted.

Since you don't use the @allpoints array anywhere else, you can get rid of the variable and simplify all these lines:

my @allpoints;
push @allpoints, @ins;
push @allpoints, @del;
push @allpoints, @net;
for(@allpoints){

as:

for (@ins, @del, @net) {

Since the "quote executable" operator interpolates variables, you can simplify these 2 lines:

$filepath = 'file://'.$filepath;
qx(xdg-open $filepath);

as:

qx(xdg-open file://$filepath);

There is no need for the single quotes around the data key:

$graph->add_data({
 'data' => \@ins,
 title => 'Inserts Per Day'

It would be more consistent to omit them as you did for the title key:

$graph->add_data({
 data => \@ins,
 title => 'Inserts Per Day'

Since foreach is identical to for, and you already use for elsewhere in the code, you should just use for everywhere.

Kent Fredric Kent Fredric 3862 silver badges7 bronze badges · Answer 1 · 2016-04-29 11:18:17Z

A simple thing to add would be a Logarithmic Y-scaling function, so that significant bursts in either losses or gains don't get over-exaggerated perceptually. Essentially you want 100-commit days to only take up at most, 10x as much space as 1 commit days, so "low activity" wavering is still visible.

I'd also suggest breaking windows of about 7 days into single data points, by aggregating them and representing them as 5 lines each:

Max number of commits per day in 7-day-range,
Upper Quartile number of commits per day in 7-day-range
Median number of commits per day ...
Lower Quartile ....
Min...

And then render it with some shades like this ( half-assed mockup )

enter image description here

With this, you could afford to narrow the sample window to "per hour" say, but still represent the data accumulated in terms of weeks, so you'd have "max commits/hour this week", "min commits/hour this week", etc, all represented in a single vertical column.

IMO a CR post almost always should include some code (or at least a reference to original code) otherwise it's more about UX (which is good and desiderable but maybe off-topic)
I mostly agree. But here it's not entirely obvious what the author is asking for, seems like they're wanting both. The code ... is ... well, more scary than I can begin to start on :/

toolic toolic 14.5k5 gold badges29 silver badges203 bronze badges · Answer 2 · 2024-03-26 20:24:19Z

Overview

It is great that you:

Used strict and warnings
Leveraged other people's code by using the CPAN modules
Used comments at the top of the file to describe the code

Here are some adjustments for you to consider, mainly for coding style.

Warning

I get a warning if I don't specify a directory on the command line. The -e file test isn't doing what you expect. It does not check if the argument was passed in. This is one way to do it:

# If a directory is specified on the command line, change into it
chdir $ARGV[0] if $ARGV[0] and -d $ARGV[0];

Modules

It is customary to place all the use statements together at the top of the file:

Since this is not used, just delete the line:

use Data::Dumper;

Namespace

It is best to import only what is needed to avoid namespace pollution. For example, change:

use File::Basename;

to:

use File::Basename qw(dirname);

These are all the use statements:

use strict;
use warnings;
use File::Spec qw();
use File::Basename qw(dirname);
use DateTime qw();
use Date::Parse qw(strptime);
use SVG::TT::Graph::Line qw();

Provide more debugging information to the user (the file name) if open fails. Change:

open my $FD,'>',$filepath or die $!;

to:

open my $FD, '>', $filepath or die "$filepath $!";

There is no need for the following scalar variable and for split:

my $gitlogOutput = qx(git log --numstat --pretty="%H %aI" | grep -v '^\$');
my @lines = split /\n/, $gitlogOutput;

Simplified as:

my @lines = qx(git log --numstat --pretty="%H %aI" | grep -v '^\$');

The $hash variable is declared and set, but it is never read. It can be deleted.

Since you don't use the @allpoints array anywhere else, you can get rid of the variable and simplify all these lines:

my @allpoints;
push @allpoints, @ins;
push @allpoints, @del;
push @allpoints, @net;
for(@allpoints){

as:

for (@ins, @del, @net) {

Since the "quote executable" operator interpolates variables, you can simplify these 2 lines:

$filepath = 'file://'.$filepath;
qx(xdg-open $filepath);

as:

qx(xdg-open file://$filepath);

There is no need for the single quotes around the data key:

$graph->add_data({
 'data' => \@ins,
 title => 'Inserts Per Day'

It would be more consistent to omit them as you did for the title key:

$graph->add_data({
 data => \@ins,
 title => 'Inserts Per Day'

Since foreach is identical to for, and you already use for elsewhere in the code, you should just use for everywhere.

Stack Exchange Network

Git Insert Delete Graph

2 Answers 2

Overview

Warning

Modules

Namespace

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Git Insert Delete Graph

2 Answers 2

Overview

Warning

Modules

Namespace

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions