Finding (and counting) duplicate JS/Java files

Question 1

I have the following script which takes minutes to give output.

printf "\nDuplicate JS Filenames...\n"
(
 find . -name '*.js' -type f -exec basename {} \; | sort | uniq -c | grep -v "^[ \t]*1 ";
 echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
 echo "$(find . -name '*.js' -type f -exec basename {} \; | sort | uniq -c | grep -v "^[ \t]*1 " | wc -l) duplicates found";
)
printf "\nDuplicate Java Filenames...\n"
(
 find . -name '*.java' -type f -exec basename {} \; | sort | uniq -c | grep -v "^[ \t]*1 ";
 echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
 echo "$(find . -name '*.java' -type f -exec basename {} \; | sort | uniq -c | grep -v "^[ \t]*1 " | wc -l) duplicates found";
)

I know that I make the same request, or similar ones, a couple of times.

How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?

Question 2

Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.

If you are using GNU find (verify with find --version), you can get find to print the basenames directly:

find . -name '*.js' -type f -printf '%f\n'

On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.

If your system does not come with GNU find (e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils), you can use sed to do the same as basename but for all found files at once:

find . -name '*.js' -type f | sed 's@.*/@@'

On my system this is only slightly slower than using -printf.

If you want to reduce the amount of times you run find, you can just save the output in a variable:

filelist="$(find . -name '*.js' -type f -printf '%f\n' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ \t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ \t]*1 " | wc -l) duplicates found"

Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.

Question 3

Alternatively, check whether your basename accepts --multiple arguments.

Question 4

On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...

Question 5

A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...

Question 6

Question posted as codereview.stackexchange.com/questions/190215/…

Adaephon AdaephonAdaephon 2431 silver badge7 bronze badges · Accepted Answer · 2018-03-22 14:10:00Z

Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.

If you are using GNU find (verify with find --version), you can get find to print the basenames directly:

find . -name '*.js' -type f -printf '%f\n'

On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.

If your system does not come with GNU find (e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils), you can use sed to do the same as basename but for all found files at once:

find . -name '*.js' -type f | sed 's@.*/@@'

On my system this is only slightly slower than using -printf.

If you want to reduce the amount of times you run find, you can just save the output in a variable:

filelist="$(find . -name '*.js' -type f -printf '%f\n' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ \t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ \t]*1 " | wc -l) duplicates found"

Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.

Alternatively, check whether your basename accepts --multiple arguments.
On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
Question posted as codereview.stackexchange.com/questions/190215/…

Stack Exchange Network

Finding (and counting) duplicate JS/Java files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding (and counting) duplicate JS/Java files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions