I made a script for a project I'm working on to create a tree object from a filesystem directory, without any working tree operations (can be run in a bare repository). The tree can be inserted into (a) commit(s) by using an index filter that invokes git read-tree --prefix=</...>
.
#!/bin/bash
# createSubtreeRec <path>
# Creates a subtree with all the contents of path
# does not need to be in a work tree.
# Tree and related objects are added to the repository (unreachable)
# recursive.
# Prints the line that should be used to include the built tree in a parent
# in the form used by git mktree on stdout. This might be a blob or tree.
# <mode> <type> <hash> <name>
# ^space ^space ^tab
function createSubtreeRec () {
local path="1ドル"
local hash=""
local type=""
local mode=""
local treeData=""
#echo "" >&2
if [ -d "$path" ]; then
for fullpath in $path/*; do
# edited from my original post to simplify the code
treeData="$(printf '%s\n%s' "$treeData" "$(createSubtreeRec "$fullpath")")"
done
# ignore empty dirs
if [ "$treeData" ]; then
# remove leading \n
treeData="${treeData:1}"
#echo "$treeData" >&2
hash="$(echo "$treeData" | git mktree)"
printf '040000 tree %s\t%s' $hash "$(basename "$path")"
fi
elif [ -f "$path" ]; then
hash=$(git hash-object "$path" -w)
type=blob
mode=100644
if [ -x "$path" ]; then
mode=100755
fi
printf '%s blob %s\t%s' $mode $hash "$(basename "$path")"
fi
#echo "" >&2
}
# createSubtree <workdir> <path>
# Creates a subtree starting in <workdir> with the names in <path> as recursive trees,
# Use to get a tree that does not contain <workdir> but contains all of <path>
# and all contents beneath it.
# Prints the top tree's hash on standard output.
function createSubtree () {
local workdir="1ドル"
local path="2ドル"
# Get tree from
if [ "$path" != "" ]; then
local entry=$(createSubtreeRec "$workdir/$path")
else
local entry=$(createSubtreeRec "$workdir")
fi
if [ "$entry" ]; then
while [ "$path" ] && [ "$path" != "." ] ; do
local hash=$(echo "$entry" | git mktree)
path="$(dirname "$path")"
local entry="$(printf '%s %s %s\t%s' '040000' tree $hash $(basename "$path"))"
done
printf '%s\n' $hash
fi
}
createSubtreeRec
outputs the full tree data line because it can sometimes output blobs.
This code is a significant part of my runtime, so any performance improving observations would be appreciated.
Part of me just wants this available online as it would have helped me out, and I can't find any git builtin commands for this.
2 Answers 2
Consider portable shell
Do we really need Bash for this? If we can manage without local
variables and string indexing, then we could use a leaner, more portable shell instead. Certainly, there's no benefit using the non-standard function
keyword - just omit it.
We probably do want local
in the recursive function, but it's good to be clear that we've made the choice for a good reason.
Add some error checking options
set -eu -o pipefail
Use more quoting
We don't want $path
or $fullpath
to be split:
for fullpath in "$path"/*; do
f=$(basename "$fullpath")
There's some more needed later:
treeData=$(printf '%s\n%s %s %s\t%s' "$treeData" $mode $type $hash "$f")
elif [ -d "$path/$f" ]; then
# recurse
treeData=$(printf '%s\n%s' "$treeData" "$(createSubtreeRec "$fullpath")")
Shellcheck can't know that $hash
will always be a single token, so it recommends more quoting than is actually necessary.
-
\$\begingroup\$ Are local variables slower? This is part of a significantly larger script that uses variables like
hash
andpath
in calling functions of this code, so I thought it'd be better to make everything local. \$\endgroup\$Emma Talbert– Emma Talbert2019年07月03日 17:26:01 +00:00Commented Jul 3, 2019 at 17:26 -
\$\begingroup\$ I wouldn't expect a speed difference, it's just less portable. In this case, avoiding them would likely make the code much more complex, so I'd leave them be. I didn't realise you wanted a performance review - I've now added a tag for you. \$\endgroup\$Toby Speight– Toby Speight2019年07月03日 17:28:56 +00:00Commented Jul 3, 2019 at 17:28
This code is a significant part of my runtime, so any performance improving observations would be appreciated.
I'm not sure the following will bring significant improvements, but if you want to squeeze out every bit of performance, I have a few ideas.
(optimizing for perf) Replace printf
with interpolated strings
The main reasons to use printf
:
- Print text within trailing newline (better than non-portable
echo -n ...
) - Use advanced formatting features as in the C-function with similar name
In many places in the code these features are not needed, either because the strings don't use special formatting features, or because the strings are only used in $(...)
context, where the trailing newlines will be stripped anyway.
So I suggest to replace these:
treeData="$(printf '%s\n%s' "$treeData" "$(createSubtreeRec "$fullpath")")" ... printf '040000 tree %s\t%s' $hash "$(basename "$path")" ... printf '%s blob %s\t%s' $mode $hash "$(basename "$path")" ... local entry="$(printf '%s %s %s\t%s' '040000' tree $hash $(basename "$path"))"
With:
treeData=$treeData$'\n'$(createSubtreeRec "$fullpath")
...
echo "040000 tree $hash"$'\t'"$(basename "$path")"
...
echo "$mode blob $hash"$'\t'"$(basename "$path")"
...
local entry="040000 tree $hash"$'\t'$(basename "$path")
You could embed tab characters in a string, it would make those lines easier to read:
echo "040000 tree $hash $(basename "$path")"
# ^
# tab
But I recommend to use $'\t'
as above, because I don't like invisible characters :-)
(optimizing for perf) Replace basename "..."
with substitution
Instead of basename "$path"
, you could write ${path##*/}
.
This might be a bit faster because substitution is native Bash, while basename
is a program, with the overhead of executing another process.
And I think the result should be equivalent, because the path separator /
is not allowed in file and directory names.
It may be tempting to replace dirname "$path"
with ${path%/*}
. That requires a bit more careful thought, because the return value won't always be the same.
Use here-strings
Instead of echo ... | cmd
, use cmd <<< "..."
.
(optimizing for perf) Rearrange operations to avoid redundant conditions
Consider this piece of code from your post, with minor optimizations to remove some noise:
if [ "$path" ]; then entry=$(createSubtreeRec "$workdir/$path") else entry=$(createSubtreeRec "$workdir") fi if [ "$entry" ]; then while [ "$path" ] && [ "$path" != "." ] ; do hash=$(git mktree <<< "$entry") path=$(dirname "$path") entry="040000 tree $hash"$'\t'"${path##*/}" done echo "$hash" fi
Can you spot some odd execution paths?
$path
is either empty to begin with, or it will never be empty- In the last iteration of the
while
loop,entry
is computed and it will not be used - If
$path
is.
to begin with, thencreateSubtreeRec "$workdir/."
will get executed, and I doubt that's what you want.
To eliminate some redundant paths, sometimes it helps to first increase the redundancy. This is logically equivalent to the above code:
if [ "$path" ]; then
entry=$(createSubtreeRec "$workdir/$path")
if [ "$entry" ]; then
while [ "$path" ] && [ "$path" != "." ] ; do
hash=$(git mktree <<< "$entry")
path=$(dirname "$path")
entry="040000 tree $hash"$'\t'"${path##*/}"
done
echo "$hash"
fi
else
entry=$(createSubtreeRec "$workdir")
if [ "$entry" ]; then
while [ "$path" ] && [ "$path" != "." ] ; do
hash=$(git mktree <<< "$entry")
path=$(dirname "$path")
entry="040000 tree $hash"$'\t'"${path##*/}"
done
echo "$hash"
fi
fi
Now let's try to simplify it. In the else
branch we know that $path
is empty, and it cannot possibly change, so we can drop the while
loop:
else
entry=$(createSubtreeRec "$workdir")
if [ "$entry" ]; then
echo "$hash"
fi
fi
And since $hash
is only ever set in the while
loop we dropped, we can drop that too:
else
entry=$(createSubtreeRec "$workdir")
if [ "$entry" ]; then
echo
fi
fi
At this point I wonder if you really wanted to print a blank line in this case, or if it's a bug. (I let you decide!)
Back to the if
branch, we have this code now:
entry=$(createSubtreeRec "$workdir/$path")
if [ "$entry" ]; then
while [ "$path" ] && [ "$path" != "." ] ; do
hash=$(git mktree <<< "$entry")
path=$(dirname "$path")
entry="040000 tree $hash"$'\t'"${path##*/}"
done
echo "$hash"
fi
Since at this point $path
is not empty and will never be, we can simplify the while
statement, eliminating one unnecessary evaluation per iteration:
while [ "$path" != "." ] ; do
This is still not ideal, since this condition will always be true for the first time, and the last computation of entry
will not be used. We can rearrange the loop to improve on both of these points:
while true; do
hash=$(git mktree <<< "$entry")
path=$(dirname "$path")
[ "$path" != "." ] || break
entry="040000 tree $hash"$'\t'"${path##*/}"
done
echo "$hash"
I doubt any of this would have a measurable performance difference except in extremely deep paths, I thought it was an interesting exercise to share.
Missed one important double-quoting
There is one spot that missed an important double-quoting:
local entry="$(printf '%s %s %s\t%s' '040000' tree $hash $(basename "$path"))"
If the basename of $path
contains spaces, then the result of $(basename "$path")
will be subject to word-splitting. Correctly quoted, the above line would be:
local entry=$(printf '%s %s %s\t%s' '040000' tree "$hash" "$(basename "$path")")
Note:
It's true that in this example we know that
$hash
is safe even without double-quoting. You never know how the script might evolve, and somebody might assign something unsafe to this variable. Also, when you assign something to a variable, you don't want to verify if all uses of the variable are safe enough. If you develop the habit to always double-quote variables used in command arguments, you can rest assured that your code is safe.I dropped the double-quotes around the right-hand side of the assignment, because it's already safe like that. If the value contained literal spaces, that would have to be quoted though. If you feel safer to always write
v="..."
, that's fine.
No need for input validation?
From the posted code it's not clear if createSubtree
is guaranteed to get safe input. It seems to me that path
is expected to be valid relative path from worktree
, but I don't see how that is actually enforced.
You might want to add some sanity checks and fail fast (and with a big bang) when receiving something unexpected.
Minor things
I'm not a fan of assigning a default value and overwriting it like this:
mode=100644 if [ -x "$path" ]; then mode=100755 fi
I would rather spell out an else
.
Instead of duplicating the pattern %(mode) %(type) %(hash)\t%(name)
in multiple places, it might be a good idea to create a helper function for it.
The type
variable is assigned once and never used.