SLOC Counter Script Optimization

Question 1

I'm a HPC guy so I'm all about "make it work, then make it fast." I have a little bash script which gets a Physical Source Lines Of Code (PSLOC) and Logical SLOC (currently just of Java code, but it would work for C and C++ too). It works by finding all *.java files in a directory and then goes through each file found and uses sed to delete stuff I don't want. It first does PSLOC by deleting all comments, whitespace, and blank lines. It then gets LSLOC by taking PSLOC and deleting all lines starting with # or @, replacing {}()'s with a blank space, deleting whitespace, and finally deleting blank lines. I get the line count for PSLOC and LSLOC with wc -l.

This works, but I'm a scripting noob and haven't done anything like this before. What I'm wondering is how can I make this script faster and easier to read/maintain:

#!/bin/bash
DIR="C:/dev/somejavafolder"
LANGUAGE="java"
find "$DIR" -type f -name "*.$LANGUAGE" > filelist
while read FILE; do
 PSLOC="$FILE.psloc"
 LSLOC="$FILE.lsloc"
 # Get PSLOC
 sed -e "s/\/\/.*$//" $FILE > $PSLOC # deletes all /* to */ lines
 sed -i -e 's|/\*|\n&|g;s|*/|&\n|g' $PSLOC # ^
 sed -i -e '/\/\*/,/*\//d' $PSLOC # deletes all // lines
 sed -i -e 's/^[ \t]*//' -e 's/[ \t]*$//' $PSLOC # deletes all whitespace
 sed -i -e '/^$/d' $PSLOC # deletes all blank lines
 wc -l $PSLOC
 # Get LSLOC
 sed -e 's/#.*//' $PSLOC > $LSLOC # deletes all # lines
 sed -i -e 's/@.*//' $LSLOC # deletes all # lines
 sed -i -e 's/{/ /g;s/}/ /g;s/)/ /g;s/(/ /g' $LSLOC # replaces {}() with spaces
 sed -i -e 's/^[ \t]*//' -e 's/[ \t]*$//' $LSLOC # deletes all whitespace
 sed -i -e '/^$/d' $LSLOC # deletes all blank lines
 wc -l $LSLOC
done < filelist
find "$DIR" -name "*.*sloc" -type f -delete

This gets PSLOC and LSLOC counts of 531 java files (~80524 lines) in about 4 minutes.

Edit: per @choroba's suggestion, I changed the file to

#!/bin/bash
DIR="C:/dev/somejavafolder"
LANGUAGE="java"
find "$DIR" -type f -name "*.$LANGUAGE" > filelist
while read FILE; do 
 PSLOC="$FILE.psloc"
 LSLOC="$FILE.lsloc"
 sed -e "s/\/\/.*$//" -e 's|/\*|\n&|g;s|*/|&\n|g' -e '/\/\*/,/*\//d' -e 's/^[ \t]*//' -e 's/[ \t]*$//' -e '/^$/d' $FILE > $PSLOC # deletes all /* to */ lines
 wc -l $PSLOC
 sed -e 's/#.*//' -e 's/@.*//' -e 's/{/ /g;s/}/ /g;s/)/ /g;s/(/ /g' -e 's/^[ \t]*//' -e 's/[ \t]*$//' -e '/^$/d' $PSLOC > $LSLOC # deletes all # lines
 wc -l $LSLOC
done < filelist 
find "$DIR" -name "*.*sloc" -type f -delete

and the runtime went down to about 80 seconds. Awesome! Any other ideas?

Edit 2: So now it's even more succinct without writing the changes to files, just spitting out the linecount. How would I go about spitting the two numbers resulting from sed ... | wc -l on the same line separated by a space without doing some assignment like a=($(sed ... | wc -l)) and using echo?

#!/bin/bash
DIR="C:/dev/somejavafolder"
LANGUAGE="java"
wc -l `find $DIR -type f -name '*.java'` > OSLOC
sed -i -e "s/^[ \t]*//" -e "s/[ \t]*$//" OSLOC
find "$DIR" -type f -name "*.$LANGUAGE" > filelist
while read FILE; do
 # Get PSLOC 
 sed -e "s/\/\/.*$//" -e "s|/\*|\n&|g;s|*/|&\n|g" -e "/\/\*/,/*\//d" -e "s/^[ \t]*//" -e "s/[ \t]*$//" -e "/^$/d" $FILE | wc -l
 # Get LSLOC
 sed -e "s/\/\/.*$//" -e "s|/\*|\n&|g;s|*/|&\n|g" -e "/\/\*/,/*\//d" -e "s/#.*//" -e "s/@.*//" -e "s/{/ /g;s/}/ /g;s/)/ /g;s/(/ /g" -e "s/^[ \t]*//" -e "s/[ \t]*$//" -e "/^$/d" $FILE | wc -l
done < filelist
find "$DIR" -name "*.*sloc" -type f -delete

Question 2

Can't you stack all the sed expressions into several -e's of one sed call?

Question 3

@choroba, would it be possible to combine the wc -l with the single line sed calls so that it just counts what the file would be without actually writing it?

Question 4

Sure, instead of > file do | wc -l.

Question 5

@choroba, see my second edit for another question about formatting the output

Question 6

BTW Java code comments don't start with # or @.

Question 7

Starting from the top (of the final program in the question):

#!/bin/bash
DIR="C:/dev/somejavafolder"
LANGUAGE="java"

It's better to use lower-case names for shell variables, to reduce the chance of accidentally colliding with environment variables (e.g. LANG) which change the behaviour of child processes.

wc -l `find $DIR -type f -name '*.java'` > OSLOC

Prefer to use modern $(...) notation for process substitution rather than old-fashioned `...`. It's nestable, and easier to visually pair up the delimiters.

You forgot to quote "$DIR" to prevent word splitting after substitution.

I'm not sure why you use *.java as the search pattern here, rather than *.$LANGUAGE as used later in the script. I would have expected these to be consistent.

Note that we are counting the number of lines in the output, not the number of files. So the output will be wrong if any of our filenames contain newline characters. We might be better creating an array variable of all the matching filenames:

mapfile -d '' -t files < <(find "$DIR" -type f -name "*.$LANGUAGE")

Then we can count the entries simply as ${#files[@]}.

sed -i -e "s/^[ \t]*//" -e "s/[ \t]*$//" OSLOC

Here, we open the file we just wrote to change it in-place. It would have been more efficient to insert this sed command into the pipe that writes to OSLOC in the first place. I.e.

wc -l `...` | sed ... >OSLOC

However, this filtering shouldn't be necessary, as POSIX-conformant wc -l doesn't produce leading or trailing whitespace.

find "$DIR" -type f -name "*.$LANGUAGE" > filelist
while read FILE; do
⋮
done < filelist

Again here we're writing a file only to re-open it for reading as soon as we've finished. This serialises the operations (the read loop won't start until find exits), increasing the latency of our script.

For this kind of input, we really should be using read -r so as not to interpret \ as an escape sequence introducer.

If we have the filenames in an array variable as above, we don't need to use find a second time, but can simply use the array:

for file in "${files[@]}"
do
⋮
done

 # Get PSLOC 
 sed -e "s/\/\/.*$//" -e "s|/\*|\n&|g;s|*/|&\n|g" -e "/\/\*/,/*\//d" -e "s/^[ \t]*//" -e "s/[ \t]*$//" -e "/^$/d" $FILE | wc -l
 # Get LSLOC
 sed -e "s/\/\/.*$//" -e "s|/\*|\n&|g;s|*/|&\n|g" -e "/\/\*/,/*\//d" -e "s/#.*//" -e "s/@.*//" -e "s/{/ /g;s/}/ /g;s/)/ /g;s/(/ /g" -e "s/^[ \t]*//" -e "s/[ \t]*$//" -e "/^$/d" $FILE | wc -l

These two sed programs are probably long enough to be worth writing as scripts themselves, where we can include comments and simplify the commands:

#!/usr/bin/sed
s,//.*,, # remove single-line comments // ...
s|/\*|\n&|g # remove C-style comments /* ... */
s|*/|&\n|g
/\/\*/,/\*\//d
/^[ \t]*$/d # remove blank lines

#!/usr/bin/sed
s,//.*,, # remove single-line comments // ...
s|/\*|\n&|g # remove C-style comments /* ... */
s|*/|&\n|g
/\/\*/,/\*\//d
s/[@#].*// # remove # and @ decorations
/^[ \t{}()]*$/d # remove blank and bracket-only lines

Even if written in-line, we should consider using multiple lines for improved readability:

 # Get PSLOC 
 sed -e 's,//.*,,' \
 -e 's|/\*|\n&|g' \
 -e 's|*/|&\n|g' \
 -e '/\/\*/,/*\//d' \
 -e '/^[ \t]*$/d' \
 "$FILE" | wc -l
 # Get LSLOC 
 sed -e 's,//.*,,' \
 -e 's|/\*|\n&|g' \
 -e 's|*/|&\n|g' \
 -e '/\/\*/,/*\//d' \
 -e 's/[@#].*//' \
 -e '/^[ \t{}()]*$/d' \
 "$FILE" | wc -l

Given that the second invocation begins with the same commands as the first, we could avoid repeating them by using a tee in the pipe:

exec 3>&1
 # Get PSLOC 
 sed -e 's,//.*,,' \
 -e 's|/\*|\n&|g' \
 -e 's|*/|&\n|g' \
 -e '\!/\*!,\!\*/!d' \
 -e '/^[ \t]*$/d' \
 "$file" | tee >(wc -l >&3) | # PSLOC
 sed \
 -e 's/[@#].*//' \
 -e '/^[ \t{}()]*$/d' | wc -l # LSLOC

I also made the above a bit more portable, by eliminating the GNU ; as line separator, and I properly quoted the filename.

When we have mixed // and /* comments, I think the counting could go catastrophically wrong if we have something like

/* This does integer division like Python '//' */

It might be better to strip out C-style comments first, then C++-style ones. However, neither approach is totally foolproof.

find "$DIR" -name "*.*sloc" -type f -delete

It's not clear why removing these files is a part of this script. Really we shouldn't be writing or removing any files when counting - it should be a read-only operation. Perhaps the intent was to delete the temporary filelist we created? It's better to remove the need for temporary files (or if unavoidable, to use mktemp, which respect's the user's TMPDIR and avoids clobbering existing files).

Modified code

#!/bin/bash
set -eu -o pipefail
dir=C:/dev/somejavafolder
language=java
mapfile -d '' -t files < <(find "$dir" -type f -name "*.$language")
echo "${#files[@]}" >OSLOC
exec 3>&1
for file in "${files[@]}"
do
 sed -e 's,//.*,,' \
 -e 's|/\*|\n&|g' \
 -e 's|*/|&\n|g' \
 -e '\!/\*!,\!\*/!d' \
 -e '/^[ \t]*$/d' \
 "$file" | tee >(wc -l >&3) | # PSLOC
 sed \
 -e 's/[@#].*//' \
 -e '/^[ \t{}()]*$/d' | wc -l # LSLOC
done

Toby Speight Toby Speight 87.9k14 gold badges104 silver badges325 bronze badges · Answer 1 · 2022-12-06 09:42:48Z

Starting from the top (of the final program in the question):

#!/bin/bash
DIR="C:/dev/somejavafolder"
LANGUAGE="java"

It's better to use lower-case names for shell variables, to reduce the chance of accidentally colliding with environment variables (e.g. LANG) which change the behaviour of child processes.

wc -l `find $DIR -type f -name '*.java'` > OSLOC

Prefer to use modern $(...) notation for process substitution rather than old-fashioned `...`. It's nestable, and easier to visually pair up the delimiters.

You forgot to quote "$DIR" to prevent word splitting after substitution.

I'm not sure why you use *.java as the search pattern here, rather than *.$LANGUAGE as used later in the script. I would have expected these to be consistent.

Note that we are counting the number of lines in the output, not the number of files. So the output will be wrong if any of our filenames contain newline characters. We might be better creating an array variable of all the matching filenames:

mapfile -d '' -t files < <(find "$DIR" -type f -name "*.$LANGUAGE")

Then we can count the entries simply as ${#files[@]}.

sed -i -e "s/^[ \t]*//" -e "s/[ \t]*$//" OSLOC

Here, we open the file we just wrote to change it in-place. It would have been more efficient to insert this sed command into the pipe that writes to OSLOC in the first place. I.e.

wc -l `...` | sed ... >OSLOC

However, this filtering shouldn't be necessary, as POSIX-conformant wc -l doesn't produce leading or trailing whitespace.

find "$DIR" -type f -name "*.$LANGUAGE" > filelist
while read FILE; do
⋮
done < filelist

Again here we're writing a file only to re-open it for reading as soon as we've finished. This serialises the operations (the read loop won't start until find exits), increasing the latency of our script.

For this kind of input, we really should be using read -r so as not to interpret \ as an escape sequence introducer.

If we have the filenames in an array variable as above, we don't need to use find a second time, but can simply use the array:

for file in "${files[@]}"
do
⋮
done

 # Get PSLOC 
 sed -e "s/\/\/.*$//" -e "s|/\*|\n&|g;s|*/|&\n|g" -e "/\/\*/,/*\//d" -e "s/^[ \t]*//" -e "s/[ \t]*$//" -e "/^$/d" $FILE | wc -l
 # Get LSLOC
 sed -e "s/\/\/.*$//" -e "s|/\*|\n&|g;s|*/|&\n|g" -e "/\/\*/,/*\//d" -e "s/#.*//" -e "s/@.*//" -e "s/{/ /g;s/}/ /g;s/)/ /g;s/(/ /g" -e "s/^[ \t]*//" -e "s/[ \t]*$//" -e "/^$/d" $FILE | wc -l

These two sed programs are probably long enough to be worth writing as scripts themselves, where we can include comments and simplify the commands:

#!/usr/bin/sed
s,//.*,, # remove single-line comments // ...
s|/\*|\n&|g # remove C-style comments /* ... */
s|*/|&\n|g
/\/\*/,/\*\//d
/^[ \t]*$/d # remove blank lines

#!/usr/bin/sed
s,//.*,, # remove single-line comments // ...
s|/\*|\n&|g # remove C-style comments /* ... */
s|*/|&\n|g
/\/\*/,/\*\//d
s/[@#].*// # remove # and @ decorations
/^[ \t{}()]*$/d # remove blank and bracket-only lines

Even if written in-line, we should consider using multiple lines for improved readability:

 # Get PSLOC 
 sed -e 's,//.*,,' \
 -e 's|/\*|\n&|g' \
 -e 's|*/|&\n|g' \
 -e '/\/\*/,/*\//d' \
 -e '/^[ \t]*$/d' \
 "$FILE" | wc -l
 # Get LSLOC 
 sed -e 's,//.*,,' \
 -e 's|/\*|\n&|g' \
 -e 's|*/|&\n|g' \
 -e '/\/\*/,/*\//d' \
 -e 's/[@#].*//' \
 -e '/^[ \t{}()]*$/d' \
 "$FILE" | wc -l

Given that the second invocation begins with the same commands as the first, we could avoid repeating them by using a tee in the pipe:

exec 3>&1
 # Get PSLOC 
 sed -e 's,//.*,,' \
 -e 's|/\*|\n&|g' \
 -e 's|*/|&\n|g' \
 -e '\!/\*!,\!\*/!d' \
 -e '/^[ \t]*$/d' \
 "$file" | tee >(wc -l >&3) | # PSLOC
 sed \
 -e 's/[@#].*//' \
 -e '/^[ \t{}()]*$/d' | wc -l # LSLOC

I also made the above a bit more portable, by eliminating the GNU ; as line separator, and I properly quoted the filename.

When we have mixed // and /* comments, I think the counting could go catastrophically wrong if we have something like

/* This does integer division like Python '//' */

It might be better to strip out C-style comments first, then C++-style ones. However, neither approach is totally foolproof.

find "$DIR" -name "*.*sloc" -type f -delete

It's not clear why removing these files is a part of this script. Really we shouldn't be writing or removing any files when counting - it should be a read-only operation. Perhaps the intent was to delete the temporary filelist we created? It's better to remove the need for temporary files (or if unavoidable, to use mktemp, which respect's the user's TMPDIR and avoids clobbering existing files).

Modified code

#!/bin/bash
set -eu -o pipefail
dir=C:/dev/somejavafolder
language=java
mapfile -d '' -t files < <(find "$dir" -type f -name "*.$language")
echo "${#files[@]}" >OSLOC
exec 3>&1
for file in "${files[@]}"
do
 sed -e 's,//.*,,' \
 -e 's|/\*|\n&|g' \
 -e 's|*/|&\n|g' \
 -e '\!/\*!,\!\*/!d' \
 -e '/^[ \t]*$/d' \
 "$file" | tee >(wc -l >&3) | # PSLOC
 sed \
 -e 's/[@#].*//' \
 -e '/^[ \t{}()]*$/d' | wc -l # LSLOC
done

Stack Exchange Network

SLOC Counter Script Optimization

1 Answer 1

Modified code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

SLOC Counter Script Optimization

1 Answer 1

Modified code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions