3
\$\begingroup\$

I made a script for a project I'm working on to create a tree object from a filesystem directory, without any working tree operations (can be run in a bare repository). The tree can be inserted into (a) commit(s) by using an index filter that invokes git read-tree --prefix=</...>.

#!/bin/bash
# createSubtreeRec <path>
# Creates a subtree with all the contents of path
# does not need to be in a work tree.
# Tree and related objects are added to the repository (unreachable)
# recursive.
# Prints the line that should be used to include the built tree in a parent
# in the form used by git mktree on stdout. This might be a blob or tree.
# <mode> <type> <hash> <name>
# ^space ^space ^tab
function createSubtreeRec () {
 local path="1ドル"
 local hash=""
 local type=""
 local mode=""
 local treeData=""
 #echo "" >&2
 if [ -d "$path" ]; then
 for fullpath in $path/*; do
 # edited from my original post to simplify the code
 treeData="$(printf '%s\n%s' "$treeData" "$(createSubtreeRec "$fullpath")")"
 done
 # ignore empty dirs
 if [ "$treeData" ]; then
 # remove leading \n
 treeData="${treeData:1}"
 #echo "$treeData" >&2
 hash="$(echo "$treeData" | git mktree)"
 printf '040000 tree %s\t%s' $hash "$(basename "$path")"
 fi
 elif [ -f "$path" ]; then
 hash=$(git hash-object "$path" -w)
 type=blob
 mode=100644
 if [ -x "$path" ]; then
 mode=100755
 fi
 printf '%s blob %s\t%s' $mode $hash "$(basename "$path")"
 fi
 #echo "" >&2
}
# createSubtree <workdir> <path>
# Creates a subtree starting in <workdir> with the names in <path> as recursive trees,
# Use to get a tree that does not contain <workdir> but contains all of <path> 
# and all contents beneath it.
# Prints the top tree's hash on standard output.
function createSubtree () {
 local workdir="1ドル"
 local path="2ドル"
 # Get tree from
 if [ "$path" != "" ]; then
 local entry=$(createSubtreeRec "$workdir/$path")
 else
 local entry=$(createSubtreeRec "$workdir")
 fi
 if [ "$entry" ]; then
 while [ "$path" ] && [ "$path" != "." ] ; do
 local hash=$(echo "$entry" | git mktree)
 path="$(dirname "$path")"
 local entry="$(printf '%s %s %s\t%s' '040000' tree $hash $(basename "$path"))"
 done
 printf '%s\n' $hash
 fi
}

createSubtreeRec outputs the full tree data line because it can sometimes output blobs.

This code is a significant part of my runtime, so any performance improving observations would be appreciated.

Part of me just wants this available online as it would have helped me out, and I can't find any git builtin commands for this.

asked Jul 3, 2019 at 15:48
\$\endgroup\$

2 Answers 2

2
\$\begingroup\$

Consider portable shell

Do we really need Bash for this? If we can manage without local variables and string indexing, then we could use a leaner, more portable shell instead. Certainly, there's no benefit using the non-standard function keyword - just omit it.

We probably do want local in the recursive function, but it's good to be clear that we've made the choice for a good reason.

Add some error checking options

set -eu -o pipefail

Use more quoting

We don't want $path or $fullpath to be split:

 for fullpath in "$path"/*; do
 f=$(basename "$fullpath")

There's some more needed later:

 treeData=$(printf '%s\n%s %s %s\t%s' "$treeData" $mode $type $hash "$f")
 elif [ -d "$path/$f" ]; then
 # recurse
 treeData=$(printf '%s\n%s' "$treeData" "$(createSubtreeRec "$fullpath")")

Shellcheck can't know that $hash will always be a single token, so it recommends more quoting than is actually necessary.

answered Jul 3, 2019 at 17:02
\$\endgroup\$
2
  • \$\begingroup\$ Are local variables slower? This is part of a significantly larger script that uses variables like hash and path in calling functions of this code, so I thought it'd be better to make everything local. \$\endgroup\$ Commented Jul 3, 2019 at 17:26
  • \$\begingroup\$ I wouldn't expect a speed difference, it's just less portable. In this case, avoiding them would likely make the code much more complex, so I'd leave them be. I didn't realise you wanted a performance review - I've now added a tag for you. \$\endgroup\$ Commented Jul 3, 2019 at 17:28
1
\$\begingroup\$

This code is a significant part of my runtime, so any performance improving observations would be appreciated.

I'm not sure the following will bring significant improvements, but if you want to squeeze out every bit of performance, I have a few ideas.

(optimizing for perf) Replace printf with interpolated strings

The main reasons to use printf:

  • Print text within trailing newline (better than non-portable echo -n ...)
  • Use advanced formatting features as in the C-function with similar name

In many places in the code these features are not needed, either because the strings don't use special formatting features, or because the strings are only used in $(...) context, where the trailing newlines will be stripped anyway. So I suggest to replace these:

treeData="$(printf '%s\n%s' "$treeData" "$(createSubtreeRec "$fullpath")")"
...
printf '040000 tree %s\t%s' $hash "$(basename "$path")"
...
printf '%s blob %s\t%s' $mode $hash "$(basename "$path")"
...
local entry="$(printf '%s %s %s\t%s' '040000' tree $hash $(basename "$path"))"

With:

treeData=$treeData$'\n'$(createSubtreeRec "$fullpath")
...
echo "040000 tree $hash"$'\t'"$(basename "$path")"
...
echo "$mode blob $hash"$'\t'"$(basename "$path")"
...
local entry="040000 tree $hash"$'\t'$(basename "$path")

You could embed tab characters in a string, it would make those lines easier to read:

echo "040000 tree $hash $(basename "$path")"
# ^
# tab

But I recommend to use $'\t' as above, because I don't like invisible characters :-)

(optimizing for perf) Replace basename "..." with substitution

Instead of basename "$path", you could write ${path##*/}. This might be a bit faster because substitution is native Bash, while basename is a program, with the overhead of executing another process. And I think the result should be equivalent, because the path separator / is not allowed in file and directory names.

It may be tempting to replace dirname "$path" with ${path%/*}. That requires a bit more careful thought, because the return value won't always be the same.

Use here-strings

Instead of echo ... | cmd, use cmd <<< "...".

(optimizing for perf) Rearrange operations to avoid redundant conditions

Consider this piece of code from your post, with minor optimizations to remove some noise:

if [ "$path" ]; then
 entry=$(createSubtreeRec "$workdir/$path")
else
 entry=$(createSubtreeRec "$workdir")
fi
if [ "$entry" ]; then
 while [ "$path" ] && [ "$path" != "." ] ; do
 hash=$(git mktree <<< "$entry")
 path=$(dirname "$path")
 entry="040000 tree $hash"$'\t'"${path##*/}"
 done
 echo "$hash"
fi

Can you spot some odd execution paths?

  • $path is either empty to begin with, or it will never be empty
  • In the last iteration of the while loop, entry is computed and it will not be used
  • If $path is . to begin with, then createSubtreeRec "$workdir/." will get executed, and I doubt that's what you want.

To eliminate some redundant paths, sometimes it helps to first increase the redundancy. This is logically equivalent to the above code:

if [ "$path" ]; then
 entry=$(createSubtreeRec "$workdir/$path")
 if [ "$entry" ]; then
 while [ "$path" ] && [ "$path" != "." ] ; do
 hash=$(git mktree <<< "$entry")
 path=$(dirname "$path")
 entry="040000 tree $hash"$'\t'"${path##*/}"
 done
 echo "$hash"
 fi
else
 entry=$(createSubtreeRec "$workdir")
 if [ "$entry" ]; then
 while [ "$path" ] && [ "$path" != "." ] ; do
 hash=$(git mktree <<< "$entry")
 path=$(dirname "$path")
 entry="040000 tree $hash"$'\t'"${path##*/}"
 done
 echo "$hash"
 fi
fi

Now let's try to simplify it. In the else branch we know that $path is empty, and it cannot possibly change, so we can drop the while loop:

else
 entry=$(createSubtreeRec "$workdir")
 if [ "$entry" ]; then
 echo "$hash"
 fi
fi

And since $hash is only ever set in the while loop we dropped, we can drop that too:

else
 entry=$(createSubtreeRec "$workdir")
 if [ "$entry" ]; then
 echo
 fi
fi

At this point I wonder if you really wanted to print a blank line in this case, or if it's a bug. (I let you decide!)

Back to the if branch, we have this code now:

 entry=$(createSubtreeRec "$workdir/$path")
 if [ "$entry" ]; then
 while [ "$path" ] && [ "$path" != "." ] ; do
 hash=$(git mktree <<< "$entry")
 path=$(dirname "$path")
 entry="040000 tree $hash"$'\t'"${path##*/}"
 done
 echo "$hash"
 fi

Since at this point $path is not empty and will never be, we can simplify the while statement, eliminating one unnecessary evaluation per iteration:

 while [ "$path" != "." ] ; do

This is still not ideal, since this condition will always be true for the first time, and the last computation of entry will not be used. We can rearrange the loop to improve on both of these points:

 while true; do
 hash=$(git mktree <<< "$entry")
 path=$(dirname "$path")
 [ "$path" != "." ] || break
 entry="040000 tree $hash"$'\t'"${path##*/}"
 done
 echo "$hash"

I doubt any of this would have a measurable performance difference except in extremely deep paths, I thought it was an interesting exercise to share.

Missed one important double-quoting

There is one spot that missed an important double-quoting:

local entry="$(printf '%s %s %s\t%s' '040000' tree $hash $(basename "$path"))"

If the basename of $path contains spaces, then the result of $(basename "$path") will be subject to word-splitting. Correctly quoted, the above line would be:

local entry=$(printf '%s %s %s\t%s' '040000' tree "$hash" "$(basename "$path")")

Note:

  • It's true that in this example we know that $hash is safe even without double-quoting. You never know how the script might evolve, and somebody might assign something unsafe to this variable. Also, when you assign something to a variable, you don't want to verify if all uses of the variable are safe enough. If you develop the habit to always double-quote variables used in command arguments, you can rest assured that your code is safe.

  • I dropped the double-quotes around the right-hand side of the assignment, because it's already safe like that. If the value contained literal spaces, that would have to be quoted though. If you feel safer to always write v="...", that's fine.

No need for input validation?

From the posted code it's not clear if createSubtree is guaranteed to get safe input. It seems to me that path is expected to be valid relative path from worktree, but I don't see how that is actually enforced. You might want to add some sanity checks and fail fast (and with a big bang) when receiving something unexpected.

Minor things

I'm not a fan of assigning a default value and overwriting it like this:

mode=100644
if [ -x "$path" ]; then
 mode=100755
fi

I would rather spell out an else.


Instead of duplicating the pattern %(mode) %(type) %(hash)\t%(name) in multiple places, it might be a good idea to create a helper function for it.


The type variable is assigned once and never used.

answered Aug 3, 2019 at 16:00
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.