I have a script and I need to optimize it to reduce the execution time of this, currently, 6000 lines takes 1 minute and 38 seconds, I would like to reduce the time to take much less and I do not know how to optimize it anymore, I paste the script:
#! /bin/bash
function charge_files ()
{
database="test.db"
file="Prueba"
file2="def"
schema="esquema" #schema sqlite3
XML="Prueba.xml" #telegramas.xml
tempschema="tempschema"
if [ -f $XML ]; then
echo "============================="
echo "| XML CHARGED |"
echo "============================="
else
echo "============================="
echo "| XML NOT CHARGED |"
echo "============================="
exit 1
fi
}
function extract ()
{
host=''
i=0
while IFS= read -r line; do
# Find if it is a RecordtStart line
if [ "$(echo "$line" | grep -c "RecordStart")" -eq 1 ]
then
# If host == '', then it is the first host we see.
# Otherwise, we are changing host, so print an empty line
if [ "$host" != '' ]
then
echo ""
fi
# Collect the host information
connectioname=$(echo "$line" | awk '{print 5ドル}' | cut -d'=' -f2)
# Collect the ConnectorType information
connectortype=$(echo "$line" | awk '{print 7ドル}' | cut -d";" -f2 | cut -d"=" -f2)
# Done with this loop in the while, move on to the next
continue
fi
# Find if it is a Telegram line
if [ "$(echo "$line" | grep -c "Telegram")" -eq 1 ]
then
# Collect the Timestamp information
timestamp=$(echo "$line" | awk '{print 2ドル}' | cut -d"." -f1 | cut -d"=" -f2)
# Collect the service information
service=$(echo "$line" | awk '{print 3ドル}' | cut -d"=" -f2)
# Collect the FrameFormat information
frameformat=$(echo "$line" | awk '{print 4ドル}' | cut -d"=" -f2)
# Collect the RawData information
RawDatahost=$(echo "$line" | awk '{print 5ドル}' | cut -c 36-39)
#Collect the RawDate information2
RawDatahost3=$(echo "$line" | awk '{print 5ドル}' | cut -c 50-53)
# Print the information
i=$((i + 1))
echo "$connectioname $connectortype $timestamp $service $frameformat $((16#$RawDatahost)) $((16#$RawDatahost3))" >> $file
# Done with this loop in the while, move on to the next
continue
fi
done <$XML
}
function clean() {
#Clean the file
cat $file | tr -d '"' | tr -s " ">> $file2
cat $file2 | tr ' ' ',' >> definitivo
}
function create_schema(){
if [ -f "$schema" ]; then
echo "============================="
echo "| LOADED SCHEMA |"
echo "============================="
else
echo 'CREATE TABLE test (
KKID INTEGER PRIMARY KEY,
conection VARCHAR(20) NOT NULL,
ip VARCHAR(20) NOT NULL,
time DATETIME NOT NULL DEFAULT (strftime("%Y-%m-%d %H:%M:%S")),
service VARCHAR(20) NOT NULL,
frameformat VARCHAR(20) NOT NULL,
id_dispositivo VARCHAR(20) NOT NULL,
id_valor VARCHAR(20) NOT NULL
);' >> $schema
fi
if [ -f "$tempschema" ]; then
echo "============================="
echo "| LOADED TEMPSCHEMA |"
echo "============================="
else
echo 'create table temp (
conection VARCHAR(20) NOT NULL,
ip VARCHAR(20) NOT NULL,
time DATETIME NOT NULL DEFAULT (strftime("%Y-%m-%d %H:%M:%S")),
service VARCHAR(20) NOT NULL,
frameformat VARCHAR(20) NOT NULL,
id_dispositivo VARCHAR(20) NOT NULL,
id_valor VARCHAR(20) NOT NULL
);
.separator ","
.import ./definitivo temp
.exit' >> $tempschema
fi
}
function upload() {
#upload the schema to sqlite3 database
echo "$(sqlite3 "$database" < "$schema")"
#Create a temp table with the script
echo "$(sqlite3 $database < $tempschema)"
#upload the csv to a temp table
echo -e ".separator ","\\n.import definitivo temp" | sqlite3 $database
#make an insert from the temp to the database to get the atribute autoincrement
echo "$(sqlite3 $database "insert into test (conection, ip, time, service, frameformat, id_dispositivo, id_valor)SELECT * FROM temp;")"
#delate de table temp
echo "$(sqlite3 $database "drop table if exists temp;")"
#remove duplicated fields
echo "$(sqlite3 $database "DELETE FROM test WHERE oid NOT IN (SELECT min(oid) FROM test GROUP BY conection, ip, time, service, frameformat, id_dispositivo, id_valor)")"
rm definitivo
rm "$file"
rm "$file2"
}
charge_files
extract
clean
create_schema
upload
For those who know the script, pass an xml to a very simple csv format, remove the quotes and leave it clean so that later it can be introduced to a sqlite database, the schema has it in the same code, so that it is even simpler and create the necessary files for its execution.
I now attach the xml format, to think that xml can reach more than 7000 lines, but I'll give you an example
<CommunicationLog xmlns="http:telegrams">
<RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<RecordStop Timestamp="" />
<RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<RecordStop Timestamp="" />
</CommunicationLog>
-
\$\begingroup\$ Have you profiled your shell script in order to know, what to optimize? \$\endgroup\$ceving– ceving2017年11月20日 12:21:58 +00:00Commented Nov 20, 2017 at 12:21
-
1\$\begingroup\$ Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$Mast– Mast ♦2017年11月22日 09:16:30 +00:00Commented Nov 22, 2017 at 9:16
2 Answers 2
Doing I/O or spawning subshells (or doing just about anything) inside a while
loop will be slow
By I/O I mean this line with the redirect and append:
echo "$connectioname $connectortype $timestamp $service $frameformat \
$((16#$RawDatahost)) $((16#$RawDatahost3))" >> $file
Spawning a lot of subshells for the pipelines would also really slow down your code, as ceving's answer has shown. If you must set up a pipeline, I suggest you consider the faster, Bash-only alternative: process substitution.
So I turned to some awk
magic one-liner that does both the extract
and clean
functions at once:
grep -e 'RecordStart' -e 'Telegram' "$XML" | \
awk -F '"' '{ if( NF > 10 ) { conname = 8ドル; contype = 12ドル } else \
printf("%s,%s,%s,%s,%s,%x,%x\n", \
conname, contype, 2,ドル 4,ドル 6,ドル substr(8,ドル 36, 4), substr(8,ドル 50, 4)\
); }' >> definitivo
I don't think I need to explain the grep
part. -F
sets the field delimeter to "
. If NF
, the number of fields, is greater than 10, we're parsing the RecordStart
tag so we save the values for ConnectionName
and ConnectorType
. If NF
is not greater than 10, we're parsing the Telegram
tag so we print the needed columns, including the two previously saved values, and display the last two columns in hex format and the rest as string.
This will be the single change that will make your program fly in high speed lane. That's pretty much it. No command substitutions, almost no pipelines, not even Bash builtins which are still slow, and most importantly, no while read
loop.
Minor notes:
- The traditional way to define a function
foo
is to write it asfoo()
. These days, people writefunction foo
without the parentheses. There are minor differences between the two forms, but whatever you choose, don't writefunction foo()
. - Prefer single quotes over double quotes unless you need the expansion.
- For cleanup actions, it's best to write them in a
trap
so they will always be executed unless aSIGKILL
is received. [[ ]]
is faster and safer (takes care of most of the quoting inside it) than[ ]
.- I used a here-document to write the database schemas. There's a variation
<<-
that only works in newer versions of Bash which allows you to indent (with tabs) the input lines and the line containing the delimiter. I opted not to use it because of white space issues in copying and pasting. - In your
upload
function, I'm not sure what is there toecho
so I removed them in the revised code below.
Code
#! /bin/bash
database='test.db'
schema='esquema'
XML='Prueba.xml'
tempschema='tempschema'
trap 'rm -f definitivo' EXIT
function charge_files
{
if [[ -f $XML ]]
then
echo '============================='
echo '| XML CHARGED |'
echo '============================='
else
echo '============================='
echo '| XML NOT CHARGED |'
echo '============================='
exit 1
fi
}
function extract
{
grep -e 'RecordStart' -e 'Telegram' "$XML" | \
awk -F '"' '{ if( NF > 10 ) { conname = 8ドル; contype = 12ドル } else \
printf("%s,%s,%s,%s,%s,%x,%x\n", \
conname, contype, 2,ドル 4,ドル 6,ドル substr(8,ドル 36, 4), substr(8,ドル 50, 4)\
); }' >> definitivo
}
function create_schema
{
if [[ -f $schema ]]
then
echo '============================='
echo '| LOADED SCHEMA |'
echo '============================='
else
cat >> "$schema" <<'EOF'
CREATE TABLE test (
KKID INTEGER PRIMARY KEY,
conection VARCHAR(20) NOT NULL,
ip VARCHAR(20) NOT NULL,
time DATETIME NOT NULL,
service VARCHAR(20) NOT NULL,
frameformat VARCHAR(20) NOT NULL,
id_dispositivo VARCHAR(20) NOT NULL,
id_valor VARCHAR(20) NOT NULL
);
EOF
fi
if [[ -f $tempschema ]]
then
echo '============================='
echo '| LOADED TEMPSCHEMA |'
echo '============================='
else
cat >> "$tempschema" <<'EOF'
CREATE TABLE temp (
conection VARCHAR(20) NOT NULL,
ip VARCHAR(20) NOT NULL,
time DATETIME NOT NULL,
service VARCHAR(20) NOT NULL,
frameformat VARCHAR(20) NOT NULL,
id_dispositivo VARCHAR(20) NOT NULL,
id_valor VARCHAR(20) NOT NULL
);
.separator ","
.import ./definitivo temp
.exit
EOF
fi
}
function upload
{
# Upload the schema to sqlite3 database
sqlite3 "$database" < "$schema"
# Create a temp table with the script
sqlite3 "$database" < "$tempschema"
# Upload the csv to a temp table
sqlite3 "$database" < <(printf '.separator ","\n.import definitivo temp\n')
# Make an insert from the temp to the database
# to get the attribute autoincrement
sqlite3 "$database" "INSERT INTO test (conection, ip, time, service, \
frameformat, id_dispositivo, id_valor) SELECT * FROM temp;"
# Delete the table temp
sqlite3 "$database" 'DROP TABLE IF EXISTS temp;'
# Remove duplicate fields
sqlite3 "$database" "DELETE FROM test WHERE oid NOT IN (\
SELECT MIN(oid) FROM test GROUP BY \
conection, ip, time, service, \
frameformat, id_dispositivo, id_valor);"
}
charge_files
extract
create_schema
upload
-
\$\begingroup\$ I changed this line
printf("%s,%s,%s,%s,%s,%x,%x\n", \
-->printf("%s,%s,%s,%s,%s,%d,%d\n", \
i need it in decimal \$\endgroup\$Nicole– Nicole2017年11月22日 11:44:44 +00:00Commented Nov 22, 2017 at 11:44 -
\$\begingroup\$ @Nicole Ah yes, that
$((16#$RawDatahost))
thing is new to me. How is the running time now? There's still more optimization you could do withgrep
, but I don't think it matters at this point. \$\endgroup\$Gao– Gao2017年11月22日 12:28:11 +00:00Commented Nov 22, 2017 at 12:28 -
\$\begingroup\$ The time has been reduced from 1m to 38 seconds, not bad, now I just need to add more things to the code, because they ask me even more, but thanks for your help @Gao \$\endgroup\$Nicole– Nicole2017年11月23日 08:38:04 +00:00Commented Nov 23, 2017 at 8:38
This is one of the most inefficient ways to do a sub-string match:
"$(echo "$line" | grep -c "RecordStart")" -eq 1
See yourself.
This creates a small test file:
for ((i=0;i<999;i++)); do echo RecordStart; done > data
I took just your while
loop with the if
statement.
while IFS= read -r line; do if [ "$(echo "$line" | grep -c "RecordStart")" -eq 1 ] then echo true fi done
On my system this takes more than 2 seconds:
$ time ./slow.sh < data | wc -l
999
real 0m2.311s
user 0m0.036s
sys 0m0.240s
The reason is that for each loop one sub-shell gets started for the command substitution $()
and on sub-shell gets started for the pipe and an additional grep
has to be forked. It is much faster to avoid the three processes per loop by using Bash's built-in sub-string check:
while IFS= read -r line; do
case $line in
*RecordStart*)
echo true
;;
esac
done
This is way faster:
$ time ./fast.sh < data | wc -l
999
real 0m0.041s
user 0m0.028s
sys 0m0.008s
The same applies to your awk
and cut
commands. Try to use Bash's parameter expansion instead.
Alternatively you can replace the extract
function with a Perl implementation. Maybe this helps.
#! /usr/bin/perl
use strict;
use warnings;
my ($name, $type);
while (<>)
{
if (/\bRecordStart\b.*\bConnectionName="([^"]*)".*\bConnectorType="([^"]*)"/) {
($name, $type) = (1,ドル 2ドル);
}
if (/\bTelegram\b.*\bTimestamp="([^"]*)".*\bService="([^"]*)".*\bFrameFormat="([^"]*)".*\bRawData="([^"]*)/) {
print "$name $type 1ドル 2ドル 3ドル ", hex("0x4ドル"), "\n";
}
}
-
\$\begingroup\$ I have tried what you say and for the moment I have only reduced the time from 7m to 5m @ceving \$\endgroup\$Nicole– Nicole2017年11月21日 09:07:09 +00:00Commented Nov 21, 2017 at 9:07
-
\$\begingroup\$ @Nicole You have removed all command substitutions executing
awk
andcut
from the while loop? \$\endgroup\$ceving– ceving2017年11月21日 09:18:52 +00:00Commented Nov 21, 2017 at 9:18 -
\$\begingroup\$ The problem is i cant use perl @ceving \$\endgroup\$Nicole– Nicole2017年11月22日 08:52:51 +00:00Commented Nov 22, 2017 at 8:52