Edit: The goal of this Question was only seeking code review and guidance in learning or tips for improvement. I apologize if that was not clear or aligned properly with the goals of this forum.
Script Goal: audit file existence and differences between two red hat Enterprise Linux Servers
- Place information in sql server tables to use for analysis
- Maintain ability to view the diff in a web page
- Process Steps
- Set array with Directorys to check
- Set array of file types to look for
- Build a list of existing files for each Environment
- Script will solve the following
- If the files exist in both servers are they the identical? (md5)*
- If no what is different (Diff)?
- Does the file contain hard coded values (grep)
- View Diff in web page (currently reading text file using into site php)
Code
DIR=(/Dir/Durp/DurpaDurp/Scriptdir1 /Dir/Durp/DurpaDurp/Scriptdir2 /Dir/Durp/DurpaDurp/Scriptdir3 ) f_Type=("*.sh" "*.txt" "*.log") f_Type2=("*.img" "*.rpt" ) for((i=0; i<${#DIR[@]}; i++)) do echo "CHECKING: ${DIR[$i]}" cd "${DIR[$i]}" for((x=0; x<${#f_Type[@]}; x++)) do echo "FOR FILE TYPE: ${f_Type[$x]}" find $PWD -type f -name "${f_Type[$x]}" | sed 's/^/ENV1|/'| column -t >> "$filelistENV1" done if [[ ${DIR[$i]} == "/Dir/Durp/DurpaDurp/Script3" ]]; then for((y=0; y<${#f_Type[@]}; y++)) do echo "FOR FILE TYPE: ${f_Type2[$y]}" find $PWD -type f -name "${f_Type2[$y]}" | sed 's/^/ENV1|/'| column -t >> "$filelistENV1" done fi done
After these files are output to the text files the script will delete the existing data from a staging table in SQL Server 2008 R2 and insert the new data.
for((i=0; i<${#ENV[@]}; i++)) do sqlcmd -S $_DB_CONN -d $_DB -Q "DELETE FROM ['$_DB']..['$_TABLE'] WHERE ENV = '${ENV[$i]}'"; done bcp $_DB_CONN.."$_TABLE" in "$filelistENV1" -f "$_SCRIPTDIR/STG.fmt" -e $_ERRDIR/ERROR_STG$(date -d "today" +"%Y%m%d%H%M").txt -S $_DB_CONN -d "$_DB"
the format file creates 2 columns
AbsoluteFilePath | ENV
gets a list of files from the database to compare
`
sqlcmd -S $_DB_CONN -d $_DB -s "|" -h-1 -m -1 -W -i $_SCRIPTDIR/SQL/EXPORT_COMPARE.sql -o $_INPUT/comp_list.txt set NOCOUNT ON;
compare the md5sum of the files.
for i in $(cat "$_INPUT/comp_list.txt") do export filename=$(basename "$i") export path=$(dirname "$i") env1_md5sum=$(md5sum "$i") env1="${env1_md5sum%% *}" export tmpdir=("$_TMPDIR$path") if ssh "$_CONN" stat $path'$filename' \> /dev/null 2\>\&1 then env2_md5sum=$(ssh $_CONN "cd $path; find -name '$filename' -exec md5sum {} \;") env2_md5="${env2_md5sum%% *}" if [[ $env1_md5 == $env2_md5 ]]; then echo $filename $path >> "$matchingMD5" else echo "md5 does not match, getting copy of file" echo "$i" >> "$no_matchMD5" mkdir -p $tmpdir scp $_CONN:$i $tmpdir fi fi done
run a diff on files that do not match
for x in $(cat "$no_matchMD5") do comp_filename=$(basename "$x") env2file=(/"$ScriptsDir"/tmp"$x") DIFF=$(diff --ignore-all-space --ignore-blank-lines --brief "$x" "$env2file" &>/dev/null
2 Answers 2
Performance pitfalls
Running ssh
in a loop tends to be slow. Running it twice for every file in a list is probably extremely slow. There's no easy fix for this. You need to rethink how to solve the problem of matching paths between two systems.
Off the top of my head:
- Get the list of paths from both systems, and then try to match those locally. This can be done with one
ssh
call per system: a major improvement. - For the list of matched paths, get the md5 sums. Again this can be done with one
ssh
call per system. - Compare the hashes, and build a new list of mismatched files.
- For the final comparison of files, you could fetch them one by one to conserve disk space. If the number of remaining files is expected to be small, then one
scp
call per file might be acceptable. Or if disk space is not an issue, then you could transfer all the files with one call.
A much smaller performance issue is running sed ... | column ...
for each file type for each base directory. You could instead make the loop body output only the output of the multiple find
calls, and run the sed ...
pipeline on the entire loop (writing as done | sed ...
).
Looping over arrays
Instead of this:
for((i=0; i<${#DIR[@]}; i++)) do echo "CHECKING: ${DIR[$i]}"
When you don't need the array indexes, just the elements, you can iterate like this:
for dir in "${DIR[@]}"
do
echo "CHECKING: $dir"
Most of the loops in the posted script can be replaced with this simpler, more intuitive style.
Simple mistakes
- Use
.
instead of$PWD
- Double-quote variables used in command arguments: instead of
find $var
, writefind "$var"
- Don't
export
if you don't need to - Don't create arrays if you need a simple variable: instead of
tmpdir=("$_TMPDIR$path")
writetmpdir="$_TMPDIR$path"
- Strive for simple writing style: instead of
env2file=(/"$ScriptsDir"/tmp"$x")
, writeenv2file="$ScriptsDir/tmp$x"
-
\$\begingroup\$ Thank you for the feedback, I will clarify and minimize my post. \$\endgroup\$Maggie– Maggie2019年02月21日 19:07:34 +00:00Commented Feb 21, 2019 at 19:07
-
\$\begingroup\$ I tried to implement all of your suggestions to the best of my ability. It made a huge difference. Thanks again \$\endgroup\$Maggie– Maggie2019年02月23日 12:18:17 +00:00Commented Feb 23, 2019 at 12:18
-
\$\begingroup\$ @Maggie You followed the suggestions very well, nicely done! You could post your revised code as a new question, and get more reviews for further tips \$\endgroup\$janos– janos2019年02月23日 12:37:20 +00:00Commented Feb 23, 2019 at 12:37
While more work is needed, After implementing some of the suggested improvements the scripts execution time dropped from 13 minutes to about 30 Seconds. That is a massive improvement. (only posting as an answer to show the updated code)
- Most Significant Changes
- Preforming MD5sum is now done on each server then sent to DB
- fetching files from env2 for comparison is now in one sftp connection using a batchfile.
FINDFILES() {
sqlcmd -S $_DB_CONN -Q "TRUNCATE TABLE [DASHDB]..[STAGING_TBL]"
ssh "$_CONN" "$_SHAREDPATH/scripts/findFiles.sh"
DIR=("/durp/durpdurp/scripts/apps" "/durp/durpdurp/tests/utilities" "/durp/audit/utilities" "/work")
f_Type=("*.sh" "*.sql" "*.log" "*.rpt" "*.php" "*.html")
Patterns="patternabc|anotherpattern123|someserver|somepassword"
for dir in "${DIR[@]}"
do
for x in "${f_Type[@]}"
do
# using "$dir to get the Absolute path in the output instead of . printing relative path
find "$dir" -type f -name "$x" -exec md5sum {} + | sed 's/^/Q|/'| column -t >> "$_OUTPUT/md5sum.txt"
find "$dir" -type f -name "$x" | xargs -0 grep --files-with-matches "$Patterns" >> "$_OUTPUT/hardCoded.txt"
done
done
sed --in-place 's/ /|/g' "$_OUTPUT/md5sum.txt"
# still not sure why quoting $_DB_CONN breaks this line but not sqlcmd
bcp DASHDB.dbo.STAGING_TBL in "$_OUTPUT/md5sum.txt" -S $_DB_CONN -t "|" -c -e "$_ERRDIR/Error$(date -d "today" +"%Y%m%d%H%M").txt"
rm --force "$_OUTPUT/md5sum.txt"
}
FINDFILES
DIFFCHECK(){
#Gets list of files in both servers with md5 that do not match
sqlcmd -S "$_DB_CONN" -s "|" -h-1 -m -1 -W -i "$_SCRIPTDIR/SQL/EXPORT_NOMD5_MATCH.sql" -o "$_OUTPUT/MD5_noMatch_compare.txt"
if rm --recursive --force "${_TMPDIR:?}/"*; then
echo "$_TMPDIR subfolders removed"
else
exit 1
fi
#Create SFTP file to fetch all files with one connection
while read -r dir; do
echo "get $dir $_TMPDIR$dir" >> "$_OUTPUT/batchfile.txt"
path=$(dirname "$dir")
mkdir --parents "$_TMPDIR$path"
done < "$_OUTPUT"/MD5_noMatch_compare.txt
#Execute Batchfile
sftp -b "$_OUTPUT/batchfile.txt" "$_CONN"
rm --force "$_OUTPUT/batchfile.txt"
#narrow the list to compare, ignore white space Diffs
while read -r dir; do
diff --ignore-all-space --ignore-blank-lines --brief "$dir" "$_TMPDIR$dir"
result=$?
if [[ $result -eq 1 ]];
then echo "$dir" >> "$_OUTPUT/diff_files.txt"
fi
done < "$_OUTPUT"/MD5_noMatch_compare.txt
rm --force "$_OUTPUT"/MD5_noMatch_compare.txt
}
DIFFCHECK
endTime=$(date +%s)
runTime=$((endTime - startTime))
echo "Audit Has Ended: $((runTime / 60)) minutes and $((runTime % 60)) seconds have elapsed." >> "$_OUTPUT/findFilesRun.log"
exit 0
````
Explore related questions
See similar questions with these tags.