1
\$\begingroup\$

I have a script which is having too many modules. The script is working perfectly, but the only issue is it is taking too much time and I need to reduce the complexity of the script.

I have a source file (it can have multiple rows of data, each column separated by a '|'). The script should check all the rows for all the columns and find the columns which is having incorrect data (e.g., null in a non-null column, space in a not space column, alphabets in a number only column, and any value which is not a valid value for that column).

Now I have a master file, which has those column names and the position of the occurrence of that column in the source file with some indicators which will decide what all we have to perform on that column.

Example of master file:

MBR_SRC_SYS,15,H,N,N,N,1,FAC|NSC|WGS|ACE|CS9|CHP
CDH_AMT,27,H,1,1,1,N,N

There are 15 columns in my source file will be MBR_SRC_SYS and 'H' Indicates I have to perform a hard error check. Later is the indicator for the other checks like not null, not space, number format, and valid value and the last contains the valid values this particular column can have.

Format of master file:

Clm Name,Position,HardErrorCheck,NullCheck,SpaceCheck,NumberCheck,ValidValueCheck,ValidValues

Example of source file:

DTL|CLM| |RMB_CLM_2015_V01|RMBFCSNSC|15135NSC|rmb_fcsdmsclm_n.dat|rmb_fcsdmsclm_n.trg|NSC
2015093QA01109920150514 4 3132PD|WFS|2015093QA011099|2015年05月14日
04:31:32|131|99|NAP||27002|6MB09| | |2014年10月04日
00:00:00|2015年05月15日 00:00:00|2015年05月15日 00:00:00|2015年05月15日
00:00:00|2015年05月15日
 00:00:00|97.26|0.00|0.00|0.00|0.00|0.00|0.00|0.00|0.00|0.00|0.00|0.00|0|0|WIFE
|A|SMITH |603T40775 |270022222 
|603T40775 |20 |F|1966年12月02日
00:00:00|SUB|MD|SN|G2001|NTINCR|MFFFF| |91|91| |NSC|NAIIH11X|H| | | |
| | | | | | | | |2015年05月15日 00:00:00| | | | | | | | | | | | | |
|W|D|N|P|PF|OP|N|CHK|0001025000| 7021693440|351159676 |Y| |I~IN~
~ ~ ~ | ~ ~ | | | | | | | | |0.00|0.00|0.00| | | | | | | | | | | | | |
| | |

What I am doing is having 2 while loops. With one loop I am reading the lines of the source file. Now inside this while loop after reading I am having another while loop to read the master file and do all the necessary checks.

Suppose my source file is having 500 records with 100 columns. Total loop count will be 500*100 = 50000 iterations. This is too much, because my source file can have as many as 5 million records.

I am new to Unix, so I am not sure, but scripting languages like PHP have a timeout and will definitely not be able to run this long. So what should be the solution for this?

#Log file creation
#exec 1> $CODE/WCC_FOA_RMBHUB/logs/RMB_HARD_ERROR_CHECK_ROWMAJOR_$(date +"%Y%m%d_%H%M%S").log 2>&1
SrcFilePath=$PMDIR/SrcFiles/WCC_FOA_RMBHUB
ListFilePath=$CODE/WCC_FOA_RMBHUB/lists
SrcFileName=1ドル
HardErrorFile=HARD_ERR_RMB_TEST_FILE.dat
MasterFile=2ドル
NullCheck()
{
value="1ドル"
if [ ! -n "$value" ] ; then
 return 10
else
 return 1
fi
}
SpaceCheck()
{
value="1ドル"
NullCheck "$value"
t=$?
if [ "$t" != "10" ]; then
value=$(echo $value)
 if [ ! -z "$value" -a "$value" != " " ]; then
 return 1
 else
 return 10
 fi
else
 return 1
fi
}
NumberCheck()
{
value="1ドル"
if [ $value -eq $value 2>/dev/null ]; then
 return 1
else
 return 10
fi
}
ValidCheck()
{
value="1ドル"
# Replace space with 'SPACE'
SpaceCheck "$value"
t1=$?
if [ "$t1" == "10" ]; then
 value="SPACE"
fi
ValidFromMapping="2ドル"
needle="|"
NumberOfDelimeter=$(echo $ValidFromMapping | awk 'BEGIN{FS="|"} {print NF}')
v=1
vf=0
while [ $NumberOfDelimeter -gt 0 ]
do
 #echo YES
 ValidValue=$(echo "$ValidFromMapping" | cut -d'|' -f$v)
 #echo "Checking with the value $ValidValue."
 if [ "$ValidValue" == "$value" ]; then
 vf=1 
 fi
 v=$(expr $v + 1)
 NumberOfDelimeter=$(expr $NumberOfDelimeter - 1)
done
if [ $vf -eq 1 ]; then
 #echo "Match Found"
 return 1
else
 #echo "No Match Found"
 return 10
fi
}
SoftErrThreshHold=0
 #echo "Errrr... We got a hard error check column. We have to check $ColumnName in the whole god dam source table. This column is available in column $ColNum .We got to do it brah !!!\n"
 while read RowOne
 do
 HardErrorFlag=1
 SoftErrorFlag=1
 RecordColumn=$(echo "$RowOne" | cut -d'|' -f1)
 #echo "Started New Row"
 if [ "$RecordColumn" == "DTL" ]; then
 KeyColumn=$(echo "$RowOne" | cut -d'|' -f9)
 ErrorText="$KeyColumn |HARD ERROR OCCURRED IN THE FOLLOWING COLUMN(S):"
 SoftErrorText="$KeyColumn |SOFT ERROR OCCURRED IN THE FOLLOWING COLUMN(S):"
 while read line
 do
 ColumnName=`echo $line | cut -d',' -f1`
 ColNum=`echo $line | cut -d',' -f2`
 ErrorCheck=`echo $line | cut -d',' -f3`
 NullCheckStatus=`echo $line | cut -d',' -f4`
 SpaceCheckStatus=`echo $line | cut -d',' -f5`
 NumberCheckStatus=`echo $line | cut -d',' -f6`
 ValidCheckStatus=`echo $line | cut -d',' -f7`
 ValidValues=`echo $line | cut -d',' -f8`
 # Hard Error Check - 
 if [ "$ErrorCheck" == "H" ]; then
 #echo "Doing Hard Error"
 IsNull=1;
 IsNumber=1;
 IsSpace=1;
 IsValid=1;
 HardErrorCheckColumnValue=$(echo "$RowOne" | cut -d'|' -f$ColNum)
 #echo "Value is $HardErrorCheckColumnValue."
 #Funtions Return 10 if it is NULL/SPACE/NOT A NUMBER
 #echo $NullCheckStatus
 if [ "$NullCheckStatus" == "1" ]; then 
 NullCheck "$HardErrorCheckColumnValue"
 IsNull=$?
 fi
 #echo $NumberCheckStatus
 if [ "$NumberCheckStatus" == "1" ]; then 
 NumberCheck "$HardErrorCheckColumnValue"
 IsNumber=$?
 NullCheck "$HardErrorCheckColumnValue"
 IsNull=$?
 SpaceCheck "$HardErrorCheckColumnValue"
 IsSpace=$?
 fi
 #echo $SpaceCheckStatus
 if [ "$SpaceCheckStatus" == "1" ]; then 
 SpaceCheck "$HardErrorCheckColumnValue"
 IsSpace=$?
 fi
 #echo $ValidCheckStatus
 if [ "$ValidCheckStatus" == "1" ]; then
 ValidCheck "$HardErrorCheckColumnValue" "$ValidValues"
 IsValid=$? 
 fi
 if [ $IsNull -eq 10 ] || [ $IsNumber -eq 10 ] || [ $IsSpace -eq 10 ] || [ $IsValid -eq 10 ]; then
 HardErrorFlag=10
 if [ $IsNull -eq 10 ]; then
 ErrorText=$ErrorText" $ColumnName is Null or Invalid; "
 fi
 if [ $IsNumber -eq 10 ]; then
 ErrorText=$ErrorText" $ColumnName is not a valid number; "
 fi
 if [ $IsSpace -eq 10 ]; then
 ErrorText=$ErrorText" $ColumnName is Space or Invalid; "
 fi
 if [ $IsValid -eq 10 ]; then
 ErrorText=$ErrorText" $ColumnName is not a valid value; "
 fi
 fi 
 fi
 # Soft Error Check -
 if [ "$ErrorCheck" == "S" ]; then
 #echo "Doing Soft Error"
 IsSValid=1
 IsSNull=1;
 IsSNumber=1;
 IsSSpace=1;
 SoftErrorCheckColumnValue=$(echo "$RowOne" | cut -d'|' -f$ColNum)
 #echo "Value is $SoftErrorCheckColumnValue."
 #Funtions Return 10 if it is NULL/SPACE/NOT A NUMBER
 #echo $NullCheckStatus
 if [ "$NullCheckStatus" == "1" ]; then 
 NullCheck "$SoftErrorCheckColumnValue"
 IsSNull=$?
 fi
 #echo $NumberCheckStatus
 if [ "$NumberCheckStatus" == "1" ]; then 
 NumberCheck "$SoftErrorCheckColumnValue"
 IsSNumber=$?
 NullCheck "$SoftErrorCheckColumnValue"
 IsSNull=$?
 SpaceCheck "$SoftErrorCheckColumnValue"
 IsSSpace=$?
 fi
 #echo $SpaceCheckStatus
 if [ "$SpaceCheckStatus" == "1" ]; then 
 SpaceCheck "$SoftErrorCheckColumnValue"
 IsSpace=$?
 fi
 #echo $ValidCheckStatus
 if [ "$ValidCheckStatus" == "1" ]; then
 ValidCheck "$SoftErrorCheckColumnValue" "$ValidValues"
 IsSValid=$?
 fi
 if [ $IsSNull -eq 10 ] || [ $IsSNumber -eq 10 ] || [ $IsSSpace -eq 10 ] || [ $IsSValid -eq 10 ]; then
 SoftErrorFlag=10
 if [ $IsSNull -eq 10 ]; then
 SoftErrorText=$SoftErrorText" $ColumnName is Null or Invalid; "
 fi
 if [ $IsSNumber -eq 10 ]; then
 SoftErrorText=$SoftErrorText" $ColumnName is not a valid number; "
 fi
 if [ $IsSSpace -eq 10 ]; then
 SoftErrorText=$SoftErrorText" $ColumnName is Space or Invalid; "
 fi
 if [ $IsSValid -eq 10 ]; then
 SoftErrorText=$SoftErrorText" $ColumnName is not a valid value; "
 fi
 SoftErrThreshHold=$(expr $SoftErrThreshHold + 1)
 fi
 fi
 done <$ListFilePath/$MasterFile
 if [ $HardErrorFlag -eq 10 ]; then
 echo $ErrorText
 fi
 if [ $SoftErrorFlag -eq 10 ]; then
 echo $SoftErrorText
 fi
 fi
 done <$SrcFilePath/$SrcFileName
if [ $SoftErrThreshHold -eq 0 ]; then
 echo "No Soft Error was found in the file"
else
 echo "Total Count Of SofErrors Found : $SoftErrThreshHold"
fi
janos
113k15 gold badges154 silver badges396 bronze badges
asked May 25, 2015 at 13:28
\$\endgroup\$
2
  • \$\begingroup\$ potentially unpopular opinion: Any business logic in bash will 'feel' complicated and unmaintainable. Why not use a different programming language? For example: many languages have CSV modules, so you'll no longer have to do your own parsing. \$\endgroup\$ Commented May 25, 2015 at 13:59
  • \$\begingroup\$ Even i know it can be done easily with many other languages. But this is how my requirement is and i have to do it on UNIX only. \$\endgroup\$ Commented May 25, 2015 at 14:42

1 Answer 1

2
\$\begingroup\$

This is extremely overcomplicated Bash code, violating good UNIX principles in terms of using exit codes.

Get rid of NullCheck

This function is literally causing more harm than help. Here's a code using NullCheck:

NullCheck "$value"
t=$?
if [ "$t" != "10" ]; then

This code, without NullCheck, is actually exactly the same thing:

if [ "$value" ]; then

There are so many related problems here:

  • The convention is 0 exit code for success, and 1 is the most common exit code for failure. The NullCheck method using 1 and 10, respectively, for these purposes is very confusing. All users of NullCheck must remember this, and effectively memorize a poor design choice that goes against good UNIX practices

  • The implementation of NullCheck is 12 lines of messy code. The result of the function is not usable directly, users much check the value of $?, which is extra hassle, and typically at least an extra line of code, in contrast with a direct "$value" check as above

To get rid of NullCheck, you would need to rewrite all uses. The result will be simpler and more intuitive, but still a lot of tedious work. For example this kind of code:

NullCheck "$SoftErrorCheckColumnValue"
IsSNull=$?
# ...
if [ $IsSNull -eq 10 ] 

... could be rewritten as:

IsSNonNull="$SoftErrorCheckColumnValue"
# ...
if [ ! "$IsSNonNull" ] 

Notice how one line is saved, and no more magic number 10. However, the logic had to be reversed, from thinking in terms of IsSNull to IsSNonNull.

Cleaning up SpaceCheck

Here's a simpler implementation of this function:

SpaceCheck() {
 test $(echo 1ドル)
}

This exits with success (code 0) if the parameter contains non-whitespace characters, otherwise it exits with failure (code 1).

Instead of using the original SpaceCheck function like this:

SpaceCheck "$value"
t1=$?
if [ "$t1" == "10" ]; then
 value="SPACE"
fi

You could use like this:

if ! SpaceCheck "$value"; then
 value=SPACE
fi

Notice the ! there: success of SpaceCheck means the parameter contained non-space characters.

This usage of ! would not be possible with the original exit codes of 1 and 10. Non-0 exit codes all mean failure, so ! SpaceCheck with the old implementation is always false.

Improving performance

Bash scripts are not meant to do complicated things. Complicated Bash scripts are significantly harder to write well than programs written in higher level languages. To keep complexity down, it's absolutely crucial to write Bash scripts well, and this script would take a lot of work to get there.

My suggestions above should give you some good ideas that you can apply to other parts of the script as well. Before thinking about performance, I'd rather you rewrite the script by applying these ideas, and post the revised version as a new question.

That being said, Bash is the wrong tool for this job. To @Gerard's suggestion in a comment, you responded:

this is how my requirement is and i have to do it on UNIX only

However, at the very least, you can use awk, since your script is already using it, and it would be much better. awk is perfectly well-suited for processing | delimited files.

So my bottom-line suggestion is to abandon this script entirely. My usual recommendation for complex scripts that push Bash' limits is Python. If that's not an option for you, then try awk or perl instead.

answered May 25, 2015 at 15:14
\$\endgroup\$
4
  • \$\begingroup\$ Your SpaceCheck implementation does not seem to be robust. For example, what if 1ドル is a = b or 1 -lt 0? \$\endgroup\$ Commented Jun 17, 2015 at 10:51
  • \$\begingroup\$ I think my implementation behaves as the original, nothing more, nothing less. Do you disagree? \$\endgroup\$ Commented Jun 17, 2015 at 12:01
  • \$\begingroup\$ If so, the original was wrong too. Something like echo -n "1ドル" | grep -q [^[:space:]] is needed. \$\endgroup\$ Commented Jun 17, 2015 at 12:59
  • \$\begingroup\$ Given the implementation, I assumed the intention was to check that the parameter has something other than whitespace, and I guessed that the function is poorly named. I admittedly didn't try to see beyond for the ultimate purpose. Since the posted code is assumed to be fully working, I assumed that implementation == intention, and chose to disregard naming. \$\endgroup\$ Commented Jun 17, 2015 at 13:26

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.