Bash Script for file auditing, push information to server and be able to view in a web page

Question 1

Edit: The goal of this Question was only seeking code review and guidance in learning or tips for improvement. I apologize if that was not clear or aligned properly with the goals of this forum.
Script Goal: audit file existence and differences between two red hat Enterprise Linux Servers
- Place information in sql server tables to use for analysis
- Maintain ability to view the diff in a web page
  - Process Steps
  - Set array with Directorys to check
    - Set array of file types to look for
    - Build a list of existing files for each Environment
  - Script will solve the following
  - If the files exist in both servers are they the identical? (md5)*
    - If no what is different (Diff)?
    - Does the file contain hard coded values (grep)
  - View Diff in web page (currently reading text file using into site php)

Code

DIR=(/Dir/Durp/DurpaDurp/Scriptdir1 /Dir/Durp/DurpaDurp/Scriptdir2 /Dir/Durp/DurpaDurp/Scriptdir3 ) 
f_Type=("*.sh" "*.txt" "*.log") 
f_Type2=("*.img" "*.rpt" )
for((i=0; i<${#DIR[@]}; i++)) 
do
 echo "CHECKING: ${DIR[$i]}"
 cd "${DIR[$i]}" 
 for((x=0; x<${#f_Type[@]}; x++)) 
 do
 echo "FOR FILE TYPE: ${f_Type[$x]}"
 find $PWD -type f -name "${f_Type[$x]}" | sed 's/^/ENV1|/'| column -t >> "$filelistENV1"
 done 
if [[ ${DIR[$i]} == "/Dir/Durp/DurpaDurp/Script3" ]]; 
 then
 for((y=0; y<${#f_Type[@]}; y++)) 
 do
 echo "FOR FILE TYPE: ${f_Type2[$y]}"
 find $PWD -type f -name "${f_Type2[$y]}" | sed 's/^/ENV1|/'| column -t >> "$filelistENV1"
 done 
fi
done

After these files are output to the text files the script will delete the existing data from a staging table in SQL Server 2008 R2 and insert the new data.

for((i=0; i<${#ENV[@]}; i++))
do
 sqlcmd -S $_DB_CONN -d $_DB -Q "DELETE FROM ['$_DB']..['$_TABLE'] WHERE ENV = '${ENV[$i]}'";
done
bcp $_DB_CONN.."$_TABLE" in "$filelistENV1" -f "$_SCRIPTDIR/STG.fmt" -e $_ERRDIR/ERROR_STG$(date -d "today" +"%Y%m%d%H%M").txt -S $_DB_CONN -d "$_DB"

the format file creates 2 columns

AbsoluteFilePath | ENV
gets a list of files from the database to compare

`

sqlcmd -S $_DB_CONN -d $_DB -s "|" -h-1 -m -1 -W -i $_SCRIPTDIR/SQL/EXPORT_COMPARE.sql -o $_INPUT/comp_list.txt set NOCOUNT ON;

compare the md5sum of the files.

for i in $(cat "$_INPUT/comp_list.txt")
do 
export filename=$(basename "$i") 
export path=$(dirname "$i") 
env1_md5sum=$(md5sum "$i") 
env1="${env1_md5sum%% *}" 
export tmpdir=("$_TMPDIR$path")
if ssh "$_CONN" stat $path'$filename' \> /dev/null 2\>\&1 
then 
 env2_md5sum=$(ssh $_CONN "cd $path; find -name '$filename' -exec md5sum {} \;")
 env2_md5="${env2_md5sum%% *}"
 if [[ $env1_md5 == $env2_md5 ]]; then 
 echo $filename $path >> "$matchingMD5"
 else 
 echo "md5 does not match, getting copy of file"
 echo "$i" >> "$no_matchMD5"
 mkdir -p $tmpdir
 scp $_CONN:$i $tmpdir
 fi
fi done

run a diff on files that do not match

for x in $(cat "$no_matchMD5") 
do 
comp_filename=$(basename "$x")
env2file=(/"$ScriptsDir"/tmp"$x")
DIFF=$(diff --ignore-all-space --ignore-blank-lines --brief "$x" "$env2file" &>/dev/null

Question 2

Performance pitfalls

Running ssh in a loop tends to be slow. Running it twice for every file in a list is probably extremely slow. There's no easy fix for this. You need to rethink how to solve the problem of matching paths between two systems.

Off the top of my head:

Get the list of paths from both systems, and then try to match those locally. This can be done with one ssh call per system: a major improvement.
For the list of matched paths, get the md5 sums. Again this can be done with one ssh call per system.
Compare the hashes, and build a new list of mismatched files.
For the final comparison of files, you could fetch them one by one to conserve disk space. If the number of remaining files is expected to be small, then one scp call per file might be acceptable. Or if disk space is not an issue, then you could transfer all the files with one call.

A much smaller performance issue is running sed ... | column ... for each file type for each base directory. You could instead make the loop body output only the output of the multiple find calls, and run the sed ... pipeline on the entire loop (writing as done | sed ...).

Looping over arrays

Instead of this:

for((i=0; i<${#DIR[@]}; i++)) 
do
 echo "CHECKING: ${DIR[$i]}"

When you don't need the array indexes, just the elements, you can iterate like this:

for dir in "${DIR[@]}"
do
 echo "CHECKING: $dir"

Most of the loops in the posted script can be replaced with this simpler, more intuitive style.

Simple mistakes

Use . instead of $PWD
Double-quote variables used in command arguments: instead of find $var, write find "$var"
Don't export if you don't need to
Don't create arrays if you need a simple variable: instead of tmpdir=("$_TMPDIR$path") write tmpdir="$_TMPDIR$path"
Strive for simple writing style: instead of env2file=(/"$ScriptsDir"/tmp"$x"), write env2file="$ScriptsDir/tmp$x"

Question 3

Thank you for the feedback, I will clarify and minimize my post.

Question 4

I tried to implement all of your suggestions to the best of my ability. It made a huge difference. Thanks again

Question 5

@Maggie You followed the suggestions very well, nicely done! You could post your revised code as a new question, and get more reviews for further tips

Question 6

While more work is needed, After implementing some of the suggested improvements the scripts execution time dropped from 13 minutes to about 30 Seconds. That is a massive improvement. (only posting as an answer to show the updated code)

Most Significant Changes
- Preforming MD5sum is now done on each server then sent to DB
- fetching files from env2 for comparison is now in one sftp connection using a batchfile.

FINDFILES() {
sqlcmd -S $_DB_CONN -Q "TRUNCATE TABLE [DASHDB]..[STAGING_TBL]"
ssh "$_CONN" "$_SHAREDPATH/scripts/findFiles.sh"
DIR=("/durp/durpdurp/scripts/apps" "/durp/durpdurp/tests/utilities" "/durp/audit/utilities" "/work")
f_Type=("*.sh" "*.sql" "*.log" "*.rpt" "*.php" "*.html") 
Patterns="patternabc|anotherpattern123|someserver|somepassword" 
for dir in "${DIR[@]}"
do
 for x in "${f_Type[@]}"
 do 
 # using "$dir to get the Absolute path in the output instead of . printing relative path
 find "$dir" -type f -name "$x" -exec md5sum {} + | sed 's/^/Q|/'| column -t >> "$_OUTPUT/md5sum.txt"
 find "$dir" -type f -name "$x" | xargs -0 grep --files-with-matches "$Patterns" >> "$_OUTPUT/hardCoded.txt"
 done
done
sed --in-place 's/ /|/g' "$_OUTPUT/md5sum.txt"
# still not sure why quoting $_DB_CONN breaks this line but not sqlcmd
bcp DASHDB.dbo.STAGING_TBL in "$_OUTPUT/md5sum.txt" -S $_DB_CONN -t "|" -c -e "$_ERRDIR/Error$(date -d "today" +"%Y%m%d%H%M").txt"
rm --force "$_OUTPUT/md5sum.txt"
}
FINDFILES
DIFFCHECK(){
#Gets list of files in both servers with md5 that do not match 
sqlcmd -S "$_DB_CONN" -s "|" -h-1 -m -1 -W -i "$_SCRIPTDIR/SQL/EXPORT_NOMD5_MATCH.sql" -o "$_OUTPUT/MD5_noMatch_compare.txt" 
 if rm --recursive --force "${_TMPDIR:?}/"*; then 
 echo "$_TMPDIR subfolders removed"
 else
 exit 1
 fi 
#Create SFTP file to fetch all files with one connection
while read -r dir; do
 echo "get $dir $_TMPDIR$dir" >> "$_OUTPUT/batchfile.txt"
 path=$(dirname "$dir")
 mkdir --parents "$_TMPDIR$path"
done < "$_OUTPUT"/MD5_noMatch_compare.txt
 #Execute Batchfile
 sftp -b "$_OUTPUT/batchfile.txt" "$_CONN" 
 rm --force "$_OUTPUT/batchfile.txt"
#narrow the list to compare, ignore white space Diffs 
while read -r dir; do
 diff --ignore-all-space --ignore-blank-lines --brief "$dir" "$_TMPDIR$dir"
 result=$?
 if [[ $result -eq 1 ]]; 
 then echo "$dir" >> "$_OUTPUT/diff_files.txt" 
 fi 
done < "$_OUTPUT"/MD5_noMatch_compare.txt
rm --force "$_OUTPUT"/MD5_noMatch_compare.txt
}
DIFFCHECK
endTime=$(date +%s)
runTime=$((endTime - startTime))
echo "Audit Has Ended: $((runTime / 60)) minutes and $((runTime % 60)) seconds have elapsed." >> "$_OUTPUT/findFilesRun.log"
exit 0
````

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2019-02-21 18:39:15Z

Performance pitfalls

Running ssh in a loop tends to be slow. Running it twice for every file in a list is probably extremely slow. There's no easy fix for this. You need to rethink how to solve the problem of matching paths between two systems.

Off the top of my head:

Get the list of paths from both systems, and then try to match those locally. This can be done with one ssh call per system: a major improvement.
For the list of matched paths, get the md5 sums. Again this can be done with one ssh call per system.
Compare the hashes, and build a new list of mismatched files.
For the final comparison of files, you could fetch them one by one to conserve disk space. If the number of remaining files is expected to be small, then one scp call per file might be acceptable. Or if disk space is not an issue, then you could transfer all the files with one call.

A much smaller performance issue is running sed ... | column ... for each file type for each base directory. You could instead make the loop body output only the output of the multiple find calls, and run the sed ... pipeline on the entire loop (writing as done | sed ...).

Looping over arrays

Instead of this:

for((i=0; i<${#DIR[@]}; i++)) 
do
 echo "CHECKING: ${DIR[$i]}"

When you don't need the array indexes, just the elements, you can iterate like this:

for dir in "${DIR[@]}"
do
 echo "CHECKING: $dir"

Most of the loops in the posted script can be replaced with this simpler, more intuitive style.

Simple mistakes

Use . instead of $PWD
Double-quote variables used in command arguments: instead of find $var, write find "$var"
Don't export if you don't need to
Don't create arrays if you need a simple variable: instead of tmpdir=("$_TMPDIR$path") write tmpdir="$_TMPDIR$path"
Strive for simple writing style: instead of env2file=(/"$ScriptsDir"/tmp"$x"), write env2file="$ScriptsDir/tmp$x"

Thank you for the feedback, I will clarify and minimize my post.
I tried to implement all of your suggestions to the best of my ability. It made a huge difference. Thanks again
@Maggie You followed the suggestions very well, nicely done! You could post your revised code as a new question, and get more reviews for further tips

Stack Exchange Network

Bash Script for file auditing, push information to server and be able to view in a web page

2 Answers 2

Performance pitfalls

Looping over arrays

Simple mistakes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Bash Script for file auditing, push information to server and be able to view in a web page

2 Answers 2

Performance pitfalls

Looping over arrays

Simple mistakes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions