3
\$\begingroup\$

I am trying to see how I can speed up the below script that reports disk usage.

The timed find command towards the end is the problematic line that I am trying to speed up. This script is run on directories that have over 6-7TB of data, and it takes 16-18hrs. However, I want it to run in under 8hrs. Can someone please suggest alternate ways to modify this script?

# -disk_check.csh takes dir name as a mandatory argument and an options <num> or -verbose as a second argument.
# Ex1: disk_check <dir_name> - Reports out the disk usage per user and the total disk consumption
# Ex2: disk_check <dir_name> -verbose -Along with the above, it also lists all files by size in the given directory
# Ex3: disk_check <dir_name> -<num> -Similar to Ex2, But here it reports out the top <num> files by size in the given directory
if ($#argv == 0) then
 echo " Error : Dir path missing"
 echo " Syntax : disk_check <dir-name> <verbose>"
 echo " verbose gives a list of all files per individual sorted by size" 
 exit 0 
endif
set cwd = $argv[1]
if ($cwd =~ "-help") then
 echo " Error : Dir path missing"
 echo " Syntax : disk_check <dir-name> <-verbose>"
 echo " -verbose gives a list of all files per individual sorted by size" 
 exit 0
endif
if ($#argv > 1) then
 set opt = $argv[2]
#echo "opt : $opt"
endif
if ( -d $cwd ) then
set ava = `df -h $cwd | tail -1 | awk '{print 1ドル'}`
set tot = `df -h $cwd | tail -1 | awk '{print 2ドル'}`
set ad = `df -h $cwd | tail -1 | awk '{print 3ドル'}`
set pcu = `df -h $cwd | tail -1 | awk '{print 4ドル'}`
echo ""
echo "Summary for dir ${cwd}: $tot Used (${pcu})"
echo "-----------------------------------------------------------------------------"
echo " Total Volume $ava"
echo " Available on disk $ad "
echo " Percentage used $pcu"
echo ""
echo "Summary by User:"
printf "%sUser%15sSize%10sCount\n" ""
echo "---------------------------------------------"
# This is the command that takes a long time:
time find $cwd -type f -printf "%u %s\n" | awk '{user[1ドル]+=2ドル;count[1ドル]++}; END{ for( i in user) printf "%s%-13s%5s%-0.2f%s%5s%7s\n","", i, "", user[i]/1024**3,"GB", "", count[i]}'| sort -nk2 -r
if ($#argv > 1) then
 if ($opt =~ "-verbose") then
 echo "\nDetail, Sorted by size"
 printf " User%15sFile%15sSize\n" ""
 echo "---------------------------------------------------"
 find $cwd -type f -not -path '*/\.*' -printf "%-13u | %-50p | %-10s \n" | sort -nk5 -r 
endif
toolic
15.2k5 gold badges29 silver badges213 bronze badges
asked Dec 10, 2018 at 5:25
\$\endgroup\$
5
  • 1
    \$\begingroup\$ Have you considered turning on disk quotas? Then the filesystem keeps track of the usage per user, and you can run quota to get a report. \$\endgroup\$ Commented Dec 10, 2018 at 5:51
  • \$\begingroup\$ Have you profiled which part of the script is taking such a long time? \$\endgroup\$ Commented Dec 10, 2018 at 6:48
  • \$\begingroup\$ @Mast The specific find command has already been identified as a performance problem. \$\endgroup\$ Commented Dec 10, 2018 at 7:47
  • \$\begingroup\$ @200_success That specific line is the problem, yes. It's also doing most of the heavy lifting of the script, a find, sort and awk call. Given the amount of data it's used on, I'm not sure find is the only problem here. \$\endgroup\$ Commented Dec 10, 2018 at 8:38
  • \$\begingroup\$ @Mast The time taken by awk and sort are surely insignificant compared to scanning an entire filesystem! \$\endgroup\$ Commented Dec 10, 2018 at 12:07

1 Answer 1

1
\$\begingroup\$

Potential bugs

When I run the code without the -verbose option like this:

disk_check.csh /tmp

I see this message on stderr:

 then: then/endif not found.

I see two potential places in the code where there could be a missing endif. Here is one:

if ( -d $cwd ) then
set ava = `df -h $cwd | tail -1 | awk '{print 1ドル'}`
set tot = `df -h $cwd | tail -1 | awk '{print 2ドル'}`

Here is another:

if ($opt =~ "-verbose") then
 echo "\nDetail, Sorted by size"
 printf " User%15sFile%15sSize\n" ""

Both areas of the code should be reviewed.

Note that this message is easy to miss if it is mixed in with all the other expected output on stdout.

DRY

It is great that you print out usage information. However, this code is nearly duplicated twice. I say "nearly" because the only difference I can see is:

verbose

vs. :

-verbose

Unfortunately, since you are using a shell scripting language with very limited programming capability, there is no clean way around this.

Comment

To reduce clutter, delete this commented-out code line:

#echo "opt : $opt"

Documentation

It is great that you added header comments to describe your code.

However, you mention an option you refer to as <num> and -<num>, but it is unclear how it should be used. It does not seem to have any affect for me. You should elaborate with more concrete examples.

The comments should mention the -help option since the code uses it.

answered Jan 17 at 12:15
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.