I'm a HPC guy so I'm all about "make it work, then make it fast." I have a little bash script which gets a Physical Source Lines Of Code (PSLOC) and Logical SLOC (currently just of Java code, but it would work for C and C++ too). It works by finding all *.java
files in a directory and then goes through each file found and uses sed to delete stuff I don't want. It first does PSLOC by deleting all comments, whitespace, and blank lines. It then gets LSLOC by taking PSLOC and deleting all lines starting with #
or @
, replacing {}()
's with a blank space, deleting whitespace, and finally deleting blank lines. I get the line count for PSLOC and LSLOC with wc -l
.
This works, but I'm a scripting noob and haven't done anything like this before. What I'm wondering is how can I make this script faster and easier to read/maintain:
#!/bin/bash
DIR="C:/dev/somejavafolder"
LANGUAGE="java"
find "$DIR" -type f -name "*.$LANGUAGE" > filelist
while read FILE; do
PSLOC="$FILE.psloc"
LSLOC="$FILE.lsloc"
# Get PSLOC
sed -e "s/\/\/.*$//" $FILE > $PSLOC # deletes all /* to */ lines
sed -i -e 's|/\*|\n&|g;s|*/|&\n|g' $PSLOC # ^
sed -i -e '/\/\*/,/*\//d' $PSLOC # deletes all // lines
sed -i -e 's/^[ \t]*//' -e 's/[ \t]*$//' $PSLOC # deletes all whitespace
sed -i -e '/^$/d' $PSLOC # deletes all blank lines
wc -l $PSLOC
# Get LSLOC
sed -e 's/#.*//' $PSLOC > $LSLOC # deletes all # lines
sed -i -e 's/@.*//' $LSLOC # deletes all # lines
sed -i -e 's/{/ /g;s/}/ /g;s/)/ /g;s/(/ /g' $LSLOC # replaces {}() with spaces
sed -i -e 's/^[ \t]*//' -e 's/[ \t]*$//' $LSLOC # deletes all whitespace
sed -i -e '/^$/d' $LSLOC # deletes all blank lines
wc -l $LSLOC
done < filelist
find "$DIR" -name "*.*sloc" -type f -delete
This gets PSLOC and LSLOC counts of 531 java files (~80524 lines) in about 4 minutes.
Edit: per @choroba's suggestion, I changed the file to
#!/bin/bash
DIR="C:/dev/somejavafolder"
LANGUAGE="java"
find "$DIR" -type f -name "*.$LANGUAGE" > filelist
while read FILE; do
PSLOC="$FILE.psloc"
LSLOC="$FILE.lsloc"
sed -e "s/\/\/.*$//" -e 's|/\*|\n&|g;s|*/|&\n|g' -e '/\/\*/,/*\//d' -e 's/^[ \t]*//' -e 's/[ \t]*$//' -e '/^$/d' $FILE > $PSLOC # deletes all /* to */ lines
wc -l $PSLOC
sed -e 's/#.*//' -e 's/@.*//' -e 's/{/ /g;s/}/ /g;s/)/ /g;s/(/ /g' -e 's/^[ \t]*//' -e 's/[ \t]*$//' -e '/^$/d' $PSLOC > $LSLOC # deletes all # lines
wc -l $LSLOC
done < filelist
find "$DIR" -name "*.*sloc" -type f -delete
and the runtime went down to about 80 seconds. Awesome! Any other ideas?
Edit 2: So now it's even more succinct without writing the changes to files, just spitting out the linecount. How would I go about spitting the two numbers resulting from sed ... | wc -l
on the same line separated by a space without doing some assignment like a=($(sed ... | wc -l))
and using echo
?
#!/bin/bash
DIR="C:/dev/somejavafolder"
LANGUAGE="java"
wc -l `find $DIR -type f -name '*.java'` > OSLOC
sed -i -e "s/^[ \t]*//" -e "s/[ \t]*$//" OSLOC
find "$DIR" -type f -name "*.$LANGUAGE" > filelist
while read FILE; do
# Get PSLOC
sed -e "s/\/\/.*$//" -e "s|/\*|\n&|g;s|*/|&\n|g" -e "/\/\*/,/*\//d" -e "s/^[ \t]*//" -e "s/[ \t]*$//" -e "/^$/d" $FILE | wc -l
# Get LSLOC
sed -e "s/\/\/.*$//" -e "s|/\*|\n&|g;s|*/|&\n|g" -e "/\/\*/,/*\//d" -e "s/#.*//" -e "s/@.*//" -e "s/{/ /g;s/}/ /g;s/)/ /g;s/(/ /g" -e "s/^[ \t]*//" -e "s/[ \t]*$//" -e "/^$/d" $FILE | wc -l
done < filelist
find "$DIR" -name "*.*sloc" -type f -delete
1 Answer 1
Starting from the top (of the final program in the question):
#!/bin/bash DIR="C:/dev/somejavafolder" LANGUAGE="java"
It's better to use lower-case names for shell variables, to reduce the chance of accidentally colliding with environment variables (e.g. LANG
) which change the behaviour of child processes.
wc -l `find $DIR -type f -name '*.java'` > OSLOC
Prefer to use modern $(...)
notation for process substitution rather than old-fashioned `...`
. It's nestable, and easier to visually pair up the delimiters.
You forgot to quote "$DIR"
to prevent word splitting after substitution.
I'm not sure why you use *.java
as the search pattern here, rather than *.$LANGUAGE
as used later in the script. I would have expected these to be consistent.
Note that we are counting the number of lines in the output, not the number of files. So the output will be wrong if any of our filenames contain newline characters. We might be better creating an array variable of all the matching filenames:
mapfile -d '' -t files < <(find "$DIR" -type f -name "*.$LANGUAGE")
Then we can count the entries simply as ${#files[@]}
.
sed -i -e "s/^[ \t]*//" -e "s/[ \t]*$//" OSLOC
Here, we open the file we just wrote to change it in-place. It would have been more efficient to insert this sed
command into the pipe that writes to OSLOC
in the first place. I.e.
wc -l `...` | sed ... >OSLOC
However, this filtering shouldn't be necessary, as POSIX-conformant wc -l
doesn't produce leading or trailing whitespace.
find "$DIR" -type f -name "*.$LANGUAGE" > filelist while read FILE; do ⋮ done < filelist
Again here we're writing a file only to re-open it for reading as soon as we've finished. This serialises the operations (the read
loop won't start until find
exits), increasing the latency of our script.
For this kind of input, we really should be using read -r
so as not to interpret \
as an escape sequence introducer.
If we have the filenames in an array variable as above, we don't need to use find
a second time, but can simply use the array:
for file in "${files[@]}"
do
⋮
done
# Get PSLOC sed -e "s/\/\/.*$//" -e "s|/\*|\n&|g;s|*/|&\n|g" -e "/\/\*/,/*\//d" -e "s/^[ \t]*//" -e "s/[ \t]*$//" -e "/^$/d" $FILE | wc -l # Get LSLOC sed -e "s/\/\/.*$//" -e "s|/\*|\n&|g;s|*/|&\n|g" -e "/\/\*/,/*\//d" -e "s/#.*//" -e "s/@.*//" -e "s/{/ /g;s/}/ /g;s/)/ /g;s/(/ /g" -e "s/^[ \t]*//" -e "s/[ \t]*$//" -e "/^$/d" $FILE | wc -l
These two sed programs are probably long enough to be worth writing as scripts themselves, where we can include comments and simplify the commands:
#!/usr/bin/sed
s,//.*,, # remove single-line comments // ...
s|/\*|\n&|g # remove C-style comments /* ... */
s|*/|&\n|g
/\/\*/,/\*\//d
/^[ \t]*$/d # remove blank lines
#!/usr/bin/sed
s,//.*,, # remove single-line comments // ...
s|/\*|\n&|g # remove C-style comments /* ... */
s|*/|&\n|g
/\/\*/,/\*\//d
s/[@#].*// # remove # and @ decorations
/^[ \t{}()]*$/d # remove blank and bracket-only lines
Even if written in-line, we should consider using multiple lines for improved readability:
# Get PSLOC
sed -e 's,//.*,,' \
-e 's|/\*|\n&|g' \
-e 's|*/|&\n|g' \
-e '/\/\*/,/*\//d' \
-e '/^[ \t]*$/d' \
"$FILE" | wc -l
# Get LSLOC
sed -e 's,//.*,,' \
-e 's|/\*|\n&|g' \
-e 's|*/|&\n|g' \
-e '/\/\*/,/*\//d' \
-e 's/[@#].*//' \
-e '/^[ \t{}()]*$/d' \
"$FILE" | wc -l
Given that the second invocation begins with the same commands as the first, we could avoid repeating them by using a tee in the pipe:
exec 3>&1
# Get PSLOC
sed -e 's,//.*,,' \
-e 's|/\*|\n&|g' \
-e 's|*/|&\n|g' \
-e '\!/\*!,\!\*/!d' \
-e '/^[ \t]*$/d' \
"$file" | tee >(wc -l >&3) | # PSLOC
sed \
-e 's/[@#].*//' \
-e '/^[ \t{}()]*$/d' | wc -l # LSLOC
I also made the above a bit more portable, by eliminating the GNU ;
as line separator, and I properly quoted the filename.
When we have mixed //
and /*
comments, I think the counting could go catastrophically wrong if we have something like
/* This does integer division like Python '//' */
It might be better to strip out C-style comments first, then C++-style ones. However, neither approach is totally foolproof.
find "$DIR" -name "*.*sloc" -type f -delete
It's not clear why removing these files is a part of this script. Really we shouldn't be writing or removing any files when counting - it should be a read-only operation. Perhaps the intent was to delete the temporary filelist
we created? It's better to remove the need for temporary files (or if unavoidable, to use mktemp
, which respect's the user's TMPDIR
and avoids clobbering existing files).
Modified code
#!/bin/bash
set -eu -o pipefail
dir=C:/dev/somejavafolder
language=java
mapfile -d '' -t files < <(find "$dir" -type f -name "*.$language")
echo "${#files[@]}" >OSLOC
exec 3>&1
for file in "${files[@]}"
do
sed -e 's,//.*,,' \
-e 's|/\*|\n&|g' \
-e 's|*/|&\n|g' \
-e '\!/\*!,\!\*/!d' \
-e '/^[ \t]*$/d' \
"$file" | tee >(wc -l >&3) | # PSLOC
sed \
-e 's/[@#].*//' \
-e '/^[ \t{}()]*$/d' | wc -l # LSLOC
done
-e
's of one sed call? \$\endgroup\$wc -l
with the single line sed calls so that it just counts what the file would be without actually writing it? \$\endgroup\$> file
do| wc -l
. \$\endgroup\$#
or@
. \$\endgroup\$