Loading records from an XML file into SQLite

Question 1

I have a script and I need to optimize it to reduce the execution time of this, currently, 6000 lines takes 1 minute and 38 seconds, I would like to reduce the time to take much less and I do not know how to optimize it anymore, I paste the script:

#! /bin/bash
function charge_files ()
{
 database="test.db"
 file="Prueba"
 file2="def"
 schema="esquema" #schema sqlite3
 XML="Prueba.xml" #telegramas.xml
 tempschema="tempschema"
 if [ -f $XML ]; then
 echo "============================="
 echo "| XML CHARGED |"
 echo "============================="
 else
 echo "============================="
 echo "| XML NOT CHARGED |"
 echo "============================="
 exit 1
 fi
}
function extract ()
{
 host=''
 i=0
 while IFS= read -r line; do
 # Find if it is a RecordtStart line
 if [ "$(echo "$line" | grep -c "RecordStart")" -eq 1 ]
 then
 # If host == '', then it is the first host we see.
 # Otherwise, we are changing host, so print an empty line
 if [ "$host" != '' ]
 then
 echo ""
 fi
 # Collect the host information
 connectioname=$(echo "$line" | awk '{print 5ドル}' | cut -d'=' -f2)
 # Collect the ConnectorType information
 connectortype=$(echo "$line" | awk '{print 7ドル}' | cut -d";" -f2 | cut -d"=" -f2)
 # Done with this loop in the while, move on to the next
 continue
 fi
 # Find if it is a Telegram line
 if [ "$(echo "$line" | grep -c "Telegram")" -eq 1 ]
 then
 # Collect the Timestamp information
 timestamp=$(echo "$line" | awk '{print 2ドル}' | cut -d"." -f1 | cut -d"=" -f2)
 # Collect the service information
 service=$(echo "$line" | awk '{print 3ドル}' | cut -d"=" -f2)
 # Collect the FrameFormat information
 frameformat=$(echo "$line" | awk '{print 4ドル}' | cut -d"=" -f2)
 # Collect the RawData information
 RawDatahost=$(echo "$line" | awk '{print 5ドル}' | cut -c 36-39)
 #Collect the RawDate information2
 RawDatahost3=$(echo "$line" | awk '{print 5ドル}' | cut -c 50-53)
 # Print the information
 i=$((i + 1))
 echo "$connectioname $connectortype $timestamp $service $frameformat $((16#$RawDatahost)) $((16#$RawDatahost3))" >> $file
 # Done with this loop in the while, move on to the next
 continue
 fi
 done <$XML
}
function clean() {
 #Clean the file
 cat $file | tr -d '"' | tr -s " ">> $file2
 cat $file2 | tr ' ' ',' >> definitivo
}
function create_schema(){
 if [ -f "$schema" ]; then
 echo "============================="
 echo "| LOADED SCHEMA |"
 echo "============================="
 else
 echo 'CREATE TABLE test (
 KKID INTEGER PRIMARY KEY,
 conection VARCHAR(20) NOT NULL,
 ip VARCHAR(20) NOT NULL,
 time DATETIME NOT NULL DEFAULT (strftime("%Y-%m-%d %H:%M:%S")),
 service VARCHAR(20) NOT NULL,
 frameformat VARCHAR(20) NOT NULL,
 id_dispositivo VARCHAR(20) NOT NULL,
 id_valor VARCHAR(20) NOT NULL
 );' >> $schema
 fi
 if [ -f "$tempschema" ]; then
 echo "============================="
 echo "| LOADED TEMPSCHEMA |"
 echo "============================="
 else
 echo 'create table temp (
conection VARCHAR(20) NOT NULL,
ip VARCHAR(20) NOT NULL,
time DATETIME NOT NULL DEFAULT (strftime("%Y-%m-%d %H:%M:%S")),
service VARCHAR(20) NOT NULL,
frameformat VARCHAR(20) NOT NULL,
id_dispositivo VARCHAR(20) NOT NULL,
id_valor VARCHAR(20) NOT NULL
);
.separator ","
.import ./definitivo temp
.exit' >> $tempschema
 fi
}
function upload() {
 #upload the schema to sqlite3 database
 echo "$(sqlite3 "$database" < "$schema")"
 #Create a temp table with the script
 echo "$(sqlite3 $database < $tempschema)"
 #upload the csv to a temp table
 echo -e ".separator ","\\n.import definitivo temp" | sqlite3 $database
 #make an insert from the temp to the database to get the atribute autoincrement
 echo "$(sqlite3 $database "insert into test (conection, ip, time, service, frameformat, id_dispositivo, id_valor)SELECT * FROM temp;")"
 #delate de table temp
 echo "$(sqlite3 $database "drop table if exists temp;")"
 #remove duplicated fields
 echo "$(sqlite3 $database "DELETE FROM test WHERE oid NOT IN (SELECT min(oid) FROM test GROUP BY conection, ip, time, service, frameformat, id_dispositivo, id_valor)")"
 rm definitivo
 rm "$file"
 rm "$file2"
}
charge_files
extract
clean
create_schema
upload

For those who know the script, pass an xml to a very simple csv format, remove the quotes and leave it clean so that later it can be introduced to a sqlite database, the schema has it in the same code, so that it is even simpler and create the necessary files for its execution.

I now attach the xml format, to think that xml can reach more than 7000 lines, but I'll give you an example

<CommunicationLog xmlns="http:telegrams">
<RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<RecordStop Timestamp="" />
<RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<RecordStop Timestamp="" />
</CommunicationLog>

Question 2

Have you profiled your shell script in order to know, what to optimize?

Question 3

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers .

Question 4

Doing I/O or spawning subshells (or doing just about anything) inside a `while` loop will be slow

By I/O I mean this line with the redirect and append:

echo "$connectioname $connectortype $timestamp $service $frameformat \
 $((16#$RawDatahost)) $((16#$RawDatahost3))" >> $file

Spawning a lot of subshells for the pipelines would also really slow down your code, as ceving's answer has shown. If you must set up a pipeline, I suggest you consider the faster, Bash-only alternative: process substitution.

So I turned to some awk magic one-liner that does both the extract and clean functions at once:

grep -e 'RecordStart' -e 'Telegram' "$XML" | \
 awk -F '"' '{ if( NF > 10 ) { conname = 8ドル; contype = 12ドル } else \
 printf("%s,%s,%s,%s,%s,%x,%x\n", \
 conname, contype, 2,ドル 4,ドル 6,ドル substr(8,ドル 36, 4), substr(8,ドル 50, 4)\
 ); }' >> definitivo

I don't think I need to explain the grep part. -F sets the field delimeter to ". If NF, the number of fields, is greater than 10, we're parsing the RecordStart tag so we save the values for ConnectionName and ConnectorType. If NF is not greater than 10, we're parsing the Telegram tag so we print the needed columns, including the two previously saved values, and display the last two columns in hex format and the rest as string.

This will be the single change that will make your program fly in high speed lane. That's pretty much it. No command substitutions, almost no pipelines, not even Bash builtins which are still slow, and most importantly, no while read loop.

Minor notes:

The traditional way to define a function foo is to write it as foo(). These days, people write function foo without the parentheses. There are minor differences between the two forms, but whatever you choose, don't write function foo().
Prefer single quotes over double quotes unless you need the expansion.
For cleanup actions, it's best to write them in a trap so they will always be executed unless a SIGKILL is received.
[[ ]] is faster and safer (takes care of most of the quoting inside it) than [ ].
I used a here-document to write the database schemas. There's a variation <<- that only works in newer versions of Bash which allows you to indent (with tabs) the input lines and the line containing the delimiter. I opted not to use it because of white space issues in copying and pasting.
In your upload function, I'm not sure what is there to echo so I removed them in the revised code below.

Code

#! /bin/bash
database='test.db'
schema='esquema'
XML='Prueba.xml'
tempschema='tempschema'
trap 'rm -f definitivo' EXIT
function charge_files
{
 if [[ -f $XML ]]
 then
 echo '============================='
 echo '| XML CHARGED |'
 echo '============================='
 else
 echo '============================='
 echo '| XML NOT CHARGED |'
 echo '============================='
 exit 1
 fi
}
function extract
{
 grep -e 'RecordStart' -e 'Telegram' "$XML" | \
 awk -F '"' '{ if( NF > 10 ) { conname = 8ドル; contype = 12ドル } else \
 printf("%s,%s,%s,%s,%s,%x,%x\n", \
 conname, contype, 2,ドル 4,ドル 6,ドル substr(8,ドル 36, 4), substr(8,ドル 50, 4)\
 ); }' >> definitivo
}
function create_schema
{
 if [[ -f $schema ]]
 then
 echo '============================='
 echo '| LOADED SCHEMA |'
 echo '============================='
 else
 cat >> "$schema" <<'EOF'
CREATE TABLE test (
KKID INTEGER PRIMARY KEY,
conection VARCHAR(20) NOT NULL,
ip VARCHAR(20) NOT NULL,
time DATETIME NOT NULL,
service VARCHAR(20) NOT NULL,
frameformat VARCHAR(20) NOT NULL,
id_dispositivo VARCHAR(20) NOT NULL,
id_valor VARCHAR(20) NOT NULL
);
EOF
 fi
 if [[ -f $tempschema ]]
 then
 echo '============================='
 echo '| LOADED TEMPSCHEMA |'
 echo '============================='
 else
 cat >> "$tempschema" <<'EOF'
CREATE TABLE temp (
conection VARCHAR(20) NOT NULL,
ip VARCHAR(20) NOT NULL,
time DATETIME NOT NULL,
service VARCHAR(20) NOT NULL,
frameformat VARCHAR(20) NOT NULL,
id_dispositivo VARCHAR(20) NOT NULL,
id_valor VARCHAR(20) NOT NULL
);
.separator ","
.import ./definitivo temp
.exit
EOF
 fi
}
function upload
{
 # Upload the schema to sqlite3 database
 sqlite3 "$database" < "$schema"
 # Create a temp table with the script
 sqlite3 "$database" < "$tempschema"
 # Upload the csv to a temp table
 sqlite3 "$database" < <(printf '.separator ","\n.import definitivo temp\n')
 # Make an insert from the temp to the database
 # to get the attribute autoincrement
 sqlite3 "$database" "INSERT INTO test (conection, ip, time, service, \
 frameformat, id_dispositivo, id_valor) SELECT * FROM temp;"
 # Delete the table temp
 sqlite3 "$database" 'DROP TABLE IF EXISTS temp;'
 # Remove duplicate fields
 sqlite3 "$database" "DELETE FROM test WHERE oid NOT IN (\
 SELECT MIN(oid) FROM test GROUP BY \
 conection, ip, time, service, \
 frameformat, id_dispositivo, id_valor);"
}
charge_files
extract
create_schema
upload

Question 5

I changed this line printf("%s,%s,%s,%s,%s,%x,%x\n", \ --> printf("%s,%s,%s,%s,%s,%d,%d\n", \ i need it in decimal

Question 6

@Nicole Ah yes, that $((16#$RawDatahost)) thing is new to me. How is the running time now? There's still more optimization you could do with grep, but I don't think it matters at this point.

Question 7

The time has been reduced from 1m to 38 seconds, not bad, now I just need to add more things to the code, because they ask me even more, but thanks for your help @Gao

Question 8

This is one of the most inefficient ways to do a sub-string match:

"$(echo "$line" | grep -c "RecordStart")" -eq 1

See yourself.

This creates a small test file:

for ((i=0;i<999;i++)); do echo RecordStart; done > data

I took just your while loop with the if statement.

while IFS= read -r line; do
 if [ "$(echo "$line" | grep -c "RecordStart")" -eq 1 ]
 then
 echo true
 fi
done

On my system this takes more than 2 seconds:

$ time ./slow.sh < data | wc -l
999
real 0m2.311s
user 0m0.036s
sys 0m0.240s

The reason is that for each loop one sub-shell gets started for the command substitution $() and on sub-shell gets started for the pipe and an additional grep has to be forked. It is much faster to avoid the three processes per loop by using Bash's built-in sub-string check:

while IFS= read -r line; do
 case $line in
 *RecordStart*)
 echo true
 ;;
 esac
done

This is way faster:

$ time ./fast.sh < data | wc -l
999
real 0m0.041s
user 0m0.028s
sys 0m0.008s

The same applies to your awk and cut commands. Try to use Bash's parameter expansion instead.

Alternatively you can replace the extract function with a Perl implementation. Maybe this helps.

#! /usr/bin/perl
use strict;
use warnings;
my ($name, $type);
while (<>)
{
 if (/\bRecordStart\b.*\bConnectionName="([^"]*)".*\bConnectorType="([^"]*)"/) {
 ($name, $type) = (1,ドル 2ドル);
 }
 if (/\bTelegram\b.*\bTimestamp="([^"]*)".*\bService="([^"]*)".*\bFrameFormat="([^"]*)".*\bRawData="([^"]*)/) {
 print "$name $type 1ドル 2ドル 3ドル ", hex("0x4ドル"), "\n";
 }
}

Question 9

I have tried what you say and for the moment I have only reduced the time from 7m to 5m @ceving

Question 10

@Nicole You have removed all command substitutions executing awk and cut from the while loop?

Question 11

The problem is i cant use perl @ceving

Gao Gao 1,2209 silver badges21 bronze badges · Accepted Answer · 2017-11-22 03:11:32Z

Doing I/O or spawning subshells (or doing just about anything) inside a `while` loop will be slow

By I/O I mean this line with the redirect and append:

echo "$connectioname $connectortype $timestamp $service $frameformat \
 $((16#$RawDatahost)) $((16#$RawDatahost3))" >> $file

Spawning a lot of subshells for the pipelines would also really slow down your code, as ceving's answer has shown. If you must set up a pipeline, I suggest you consider the faster, Bash-only alternative: process substitution.

So I turned to some awk magic one-liner that does both the extract and clean functions at once:

grep -e 'RecordStart' -e 'Telegram' "$XML" | \
 awk -F '"' '{ if( NF > 10 ) { conname = 8ドル; contype = 12ドル } else \
 printf("%s,%s,%s,%s,%s,%x,%x\n", \
 conname, contype, 2,ドル 4,ドル 6,ドル substr(8,ドル 36, 4), substr(8,ドル 50, 4)\
 ); }' >> definitivo

I don't think I need to explain the grep part. -F sets the field delimeter to ". If NF, the number of fields, is greater than 10, we're parsing the RecordStart tag so we save the values for ConnectionName and ConnectorType. If NF is not greater than 10, we're parsing the Telegram tag so we print the needed columns, including the two previously saved values, and display the last two columns in hex format and the rest as string.

This will be the single change that will make your program fly in high speed lane. That's pretty much it. No command substitutions, almost no pipelines, not even Bash builtins which are still slow, and most importantly, no while read loop.

Minor notes:

The traditional way to define a function foo is to write it as foo(). These days, people write function foo without the parentheses. There are minor differences between the two forms, but whatever you choose, don't write function foo().
Prefer single quotes over double quotes unless you need the expansion.
For cleanup actions, it's best to write them in a trap so they will always be executed unless a SIGKILL is received.
[[ ]] is faster and safer (takes care of most of the quoting inside it) than [ ].
I used a here-document to write the database schemas. There's a variation <<- that only works in newer versions of Bash which allows you to indent (with tabs) the input lines and the line containing the delimiter. I opted not to use it because of white space issues in copying and pasting.
In your upload function, I'm not sure what is there to echo so I removed them in the revised code below.

Code

#! /bin/bash
database='test.db'
schema='esquema'
XML='Prueba.xml'
tempschema='tempschema'
trap 'rm -f definitivo' EXIT
function charge_files
{
 if [[ -f $XML ]]
 then
 echo '============================='
 echo '| XML CHARGED |'
 echo '============================='
 else
 echo '============================='
 echo '| XML NOT CHARGED |'
 echo '============================='
 exit 1
 fi
}
function extract
{
 grep -e 'RecordStart' -e 'Telegram' "$XML" | \
 awk -F '"' '{ if( NF > 10 ) { conname = 8ドル; contype = 12ドル } else \
 printf("%s,%s,%s,%s,%s,%x,%x\n", \
 conname, contype, 2,ドル 4,ドル 6,ドル substr(8,ドル 36, 4), substr(8,ドル 50, 4)\
 ); }' >> definitivo
}
function create_schema
{
 if [[ -f $schema ]]
 then
 echo '============================='
 echo '| LOADED SCHEMA |'
 echo '============================='
 else
 cat >> "$schema" <<'EOF'
CREATE TABLE test (
KKID INTEGER PRIMARY KEY,
conection VARCHAR(20) NOT NULL,
ip VARCHAR(20) NOT NULL,
time DATETIME NOT NULL,
service VARCHAR(20) NOT NULL,
frameformat VARCHAR(20) NOT NULL,
id_dispositivo VARCHAR(20) NOT NULL,
id_valor VARCHAR(20) NOT NULL
);
EOF
 fi
 if [[ -f $tempschema ]]
 then
 echo '============================='
 echo '| LOADED TEMPSCHEMA |'
 echo '============================='
 else
 cat >> "$tempschema" <<'EOF'
CREATE TABLE temp (
conection VARCHAR(20) NOT NULL,
ip VARCHAR(20) NOT NULL,
time DATETIME NOT NULL,
service VARCHAR(20) NOT NULL,
frameformat VARCHAR(20) NOT NULL,
id_dispositivo VARCHAR(20) NOT NULL,
id_valor VARCHAR(20) NOT NULL
);
.separator ","
.import ./definitivo temp
.exit
EOF
 fi
}
function upload
{
 # Upload the schema to sqlite3 database
 sqlite3 "$database" < "$schema"
 # Create a temp table with the script
 sqlite3 "$database" < "$tempschema"
 # Upload the csv to a temp table
 sqlite3 "$database" < <(printf '.separator ","\n.import definitivo temp\n')
 # Make an insert from the temp to the database
 # to get the attribute autoincrement
 sqlite3 "$database" "INSERT INTO test (conection, ip, time, service, \
 frameformat, id_dispositivo, id_valor) SELECT * FROM temp;"
 # Delete the table temp
 sqlite3 "$database" 'DROP TABLE IF EXISTS temp;'
 # Remove duplicate fields
 sqlite3 "$database" "DELETE FROM test WHERE oid NOT IN (\
 SELECT MIN(oid) FROM test GROUP BY \
 conection, ip, time, service, \
 frameformat, id_dispositivo, id_valor);"
}
charge_files
extract
create_schema
upload

I changed this line printf("%s,%s,%s,%s,%s,%x,%x\n", \ --> printf("%s,%s,%s,%s,%s,%d,%d\n", \ i need it in decimal
@Nicole Ah yes, that $((16#$RawDatahost)) thing is new to me. How is the running time now? There's still more optimization you could do with grep, but I don't think it matters at this point.
The time has been reduced from 1m to 38 seconds, not bad, now I just need to add more things to the code, because they ask me even more, but thanks for your help @Gao

Stack Exchange Network

Loading records from an XML file into SQLite

2 Answers 2

Doing I/O or spawning subshells (or doing just about anything) inside a `while` loop will be slow

Minor notes:

Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Loading records from an XML file into SQLite

2 Answers 2

Doing I/O or spawning subshells (or doing just about anything) inside a while loop will be slow

Minor notes:

Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Doing I/O or spawning subshells (or doing just about anything) inside a `while` loop will be slow