I have the following script which takes minutes to give output.
printf "\nDuplicate JS Filenames...\n"
(
find . -name '*.js' -type f -exec basename {} \; | sort | uniq -c | grep -v "^[ \t]*1 ";
echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
echo "$(find . -name '*.js' -type f -exec basename {} \; | sort | uniq -c | grep -v "^[ \t]*1 " | wc -l) duplicates found";
)
printf "\nDuplicate Java Filenames...\n"
(
find . -name '*.java' -type f -exec basename {} \; | sort | uniq -c | grep -v "^[ \t]*1 ";
echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
echo "$(find . -name '*.java' -type f -exec basename {} \; | sort | uniq -c | grep -v "^[ \t]*1 " | wc -l) duplicates found";
)
I know that I make the same request, or similar ones, a couple of times.
How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?
1 Answer 1
Aside from running essentially the same find
command three times, the main issue is that you run a separate basename
instance for every single found file.
If you are using GNU find
(verify with find --version
), you can get find to print the basenames directly:
find . -name '*.js' -type f -printf '%f\n'
On my system this is about 900 times faster than calling basename
when run on a directory with about 200,000 files in it.
If your system does not come with GNU find
(e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils
), you can use sed
to do the same as basename
but for all found files at once:
find . -name '*.js' -type f | sed 's@.*/@@'
On my system this is only slightly slower than using -printf
.
If you want to reduce the amount of times you run find
, you can just save the output in a variable:
filelist="$(find . -name '*.js' -type f -printf '%f\n' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ \t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ \t]*1 " | wc -l) duplicates found"
Note that on bash
you need to put double-quotes around $filelist
so that the newlines are not squashed.
-
\$\begingroup\$ Alternatively, check whether your basename accepts
--multiple
arguments. \$\endgroup\$Toby Speight– Toby Speight2018年03月22日 15:02:23 +00:00Commented Mar 22, 2018 at 15:02 -
\$\begingroup\$ On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time... \$\endgroup\$user3341592– user33415922018年03月22日 15:27:14 +00:00Commented Mar 22, 2018 at 15:27
-
\$\begingroup\$ A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as
package-info.java
,AllTests.java
andConstants.java
), how could I remove those lines from the output? I guess chaininggrep -v
commands one after the other is not the right solution... \$\endgroup\$user3341592– user33415922018年03月22日 15:29:39 +00:00Commented Mar 22, 2018 at 15:29 -
\$\begingroup\$ Question posted as codereview.stackexchange.com/questions/190215/… \$\endgroup\$user3341592– user33415922018年03月22日 15:48:50 +00:00Commented Mar 22, 2018 at 15:48