Optimize and accelerate bash script to transform XML to SQLITE

Question 1

I have this script in which I read an XML and I pass it to CSV and at the end of the script I transform it into SQLITE

#!/bin/bash
rm -f -r rshost rscname rsctype ttstamp tservice tformat trdata trdata2
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 9ドル }' | cut -c27-30 > trdata
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 3ドル }'| cut -c1-19 > ttstamp
a=`cat 1ドル | grep RecordStart | head -1 | sed -e 's/"/ /g'| awk '{ print 15ドル }'`
b=`cat 1ドル | grep RecordStart | head -1 | sed -e 's/"/ /g' | awk '{print 12ドル }' | sed -e 's/;/ /g' | sed -e 's/=/ /g' | awk '{print 4ドル }'`
c=`cat 1ドル | grep RecordStart | head -1 | sed -e 's/"/ /g' | awk '{print 9,ドル10ドル }'`
touch rsctype rshost rscname
kk=`wc -l trdata | awk '{ print 1ドル }'`
for i in `seq 1 $kk`
do
 echo $a >> rsctype
 echo $b >> rshost
 echo $c >> rscname
done
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 5ドル }' > tservice
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 7ドル }' > tformat
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 9ドル }' | cut -c41-44 > trdata2
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 9ドル }' | cut -c31-34 > grhost
awk -Wposix '{printf("%d\n","0x" 1ドル)}' trdata > trdata3
awk -Wposix '{printf("%d\n","0x" 1ドル)}' trdata2 > trdata4
sed -i "s%^0%0/%g" grhost
cat grhost | cut -c1-3 > grhost2
sed -i "s%.\{4\}%/%g" grhost
pr -mts, grhost2 grhost > grhostfinal
sed -i "s/,//g" grhostfinal
cat grhostfinal | cut -c1-4 > grhostfinal1
cat grhostfinal | cut -c5 > grhostfinal2
awk -Wposix '{printf("%d\n","0x" 1ドル)}' grhostfinal2 > grhostfinal3
pr -mts, grhostfinal1 grhostfinal3 > grhostfinal4
sed -i "s/,//g" grhostfinal4
pr -mts, rshost ttstamp rsctype tservice tformat trdata4 trdata3 rscname grhostfinal4 > conjunto.csv
sed -i "s|^|,|g" conjunto.csv
sqlite3 test2.sqlite "select fecha from testxml4;" > data.csv
cat data.csv | sort | uniq > data2.csv
for k in `cat data2.csv`
do
 grep "$k" conjunto.csv >> quitar
done
diff quitar conjunto.csv | grep ">" | sed 's/^> //g' > diferencia.csv
echo `sqlite3 test2.sqlite < testxml`
python csv2sqlite.py diferencia.csv test2.sqlite testxml4
rm -f -r rshost rscname rsctype ttstamp tservice tformat trdata trdata2 trdata3 trdata4 grhost2 grhost grhostfinal3 grhostfinal1 grhostfinal2 grhostfinal grhostfinal4 a b c data.csv conjunto.csv data2.csv quitar

I have this XML (The data is a private)

 <CommunicationLog xmlns="http://knx.org/xml/telegrams/01">
 <RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" />
 <Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
 <Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
 <RecordStop Timestamp="" />
 <RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" />
 <Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
 <Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
 <RecordStop Timestamp="" />
</CommunicationLog>

Once analyzed the data, I take them to a CSV and with the Python program csv2sqlite.py

python csv2sqlite.py CSVFILE.csv DB.sqlite TABLESQLITE

My question is how can I make this script faster and more efficient, since it takes a long time to analyze all the data.

Question 2

related: codereview.stackexchange.com/q/180856/37660

Question 3

Can you provide the database schema and some dummy data for the XML? Like the CREATE TABLE statement you used and the format of the attributes. An explanation of what each file should contain would also be nice, as currently I'm getting RawData= for tservice and Service= for ttstamp, which is super confusing.

Question 4

Take-home message: pipelines in Bash are slow (compared to process substitution); loops in Bash are slow; Bash are slow.

Without a setup script and dummy data to test run it, I don't really understand exactly what your script is trying to achieve at each step, so I can only suggest the following improvements:

Avoid unnecessary pipelines

A lot of your code does cat 1ドル | grep Telegram | sed -e 's/"/ /g', which can be simplified to sed '/Telegram/!d; s/"/ /g' "1ドル", so you may want to save the result somewhere and extract from it when needed.
awk '{ print 9ドル }' | cut -c27-30 can be combined into awk '{print substr(9,ドル 27, 4)}'.
In the command substitution that gets assigned to b, you have sed -e 's/;/ /g' | sed -e 's/=/ /g' which should really just be sed -e 's/;/ /g' -e 's/=/ /g' or even better just sed 's/[;=]/ /g'. You don't need the -e option if you're not combining expressions.
kk=`wc -l trdata | awk '{ print 1ドル }'`: kk=`wc -l < trdata` can do the job just fine.
cat grhostfinal | cut -c1-4 > grhostfinal1 is not as efficient as cut -c1-4 < grhostfinal > grhostfinal1

sqlite3 test2.sqlite "select fecha from testxml4;" > data.csv
cat data.csv | sort | uniq > data2.csv

is probably best done entirely in SQL:

sqlite3 test2.sqlite "SELECT DISTINCT fecha FROM testxml4 ORDER BY fecha;" > data.csv

Avoid loops if possible, or optimize for each iteration of the loop

```
for i in `seq 1 $kk`
do
 echo $a >> rsctype
done
```
is much slower than
```
printf "$a\n%.0s" `seq 1 $kk` >> rsctype
```
for small $kk, and much slower than
```
yes "$a" | head -n "$kk" >> rsctype
```
for large $kk. See https://superuser.com/questions/86340/linux-command-to-repeat-a-string-n-times. (Not sure if you actually want to repeat the same strings, but that's what your code does.)

for k in `cat data2.csv`
do
 grep "$k" conjunto.csv >> quitar
done
diff quitar conjunto.csv | grep ">" | sed 's/^> //g' > diferencia.csv

looks like it could be done with just one diff? Maybe (haven't tested it):

diff --changed-group-format='%>' --unchanged-group-format='' data2.csv conjunto.csv\
 > diferencia.csv

Non-performance-related notes

You should wrap your cleanup command in a trap and put it at the beginning of the script so that it will always be executed unless the program is terminated by a SIGKILL:
```
trap 'rm -f \
rs{host,cname,ctype} \
ttstamp tservice tformat trdata{,2,3,4} \
grhost{,2,final{,{1..4}}} \
data{,2}.csv conjunto.csv \
{a..c} quitar' \
'EXIT'
```
Please don't rm -r if you're not deleting directories. This is a dangerous command if you're not careful. I used brace expansion to shorten the list of input files I need to type, but I'm not sure if you really need to create that many temporary files.
You don't need to touch the files. Redirection will create the files if they don't exist.

I can probably offer more suggestions if I could try out the script. I'll add to this answer if I think of anything, but this should be enough for now.

Gao Gao 1,2209 silver badges21 bronze badges · Accepted Answer · 2017-11-21 17:43:17Z

Take-home message: pipelines in Bash are slow (compared to process substitution); loops in Bash are slow; Bash are slow.

Without a setup script and dummy data to test run it, I don't really understand exactly what your script is trying to achieve at each step, so I can only suggest the following improvements:

Avoid unnecessary pipelines

A lot of your code does cat 1ドル | grep Telegram | sed -e 's/"/ /g', which can be simplified to sed '/Telegram/!d; s/"/ /g' "1ドル", so you may want to save the result somewhere and extract from it when needed.
awk '{ print 9ドル }' | cut -c27-30 can be combined into awk '{print substr(9,ドル 27, 4)}'.
In the command substitution that gets assigned to b, you have sed -e 's/;/ /g' | sed -e 's/=/ /g' which should really just be sed -e 's/;/ /g' -e 's/=/ /g' or even better just sed 's/[;=]/ /g'. You don't need the -e option if you're not combining expressions.
kk=`wc -l trdata | awk '{ print 1ドル }'`: kk=`wc -l < trdata` can do the job just fine.
cat grhostfinal | cut -c1-4 > grhostfinal1 is not as efficient as cut -c1-4 < grhostfinal > grhostfinal1

sqlite3 test2.sqlite "select fecha from testxml4;" > data.csv
cat data.csv | sort | uniq > data2.csv

is probably best done entirely in SQL:

sqlite3 test2.sqlite "SELECT DISTINCT fecha FROM testxml4 ORDER BY fecha;" > data.csv

Avoid loops if possible, or optimize for each iteration of the loop

```
for i in `seq 1 $kk`
do
 echo $a >> rsctype
done
```
is much slower than
```
printf "$a\n%.0s" `seq 1 $kk` >> rsctype
```
for small $kk, and much slower than
```
yes "$a" | head -n "$kk" >> rsctype
```
for large $kk. See https://superuser.com/questions/86340/linux-command-to-repeat-a-string-n-times. (Not sure if you actually want to repeat the same strings, but that's what your code does.)

for k in `cat data2.csv`
do
 grep "$k" conjunto.csv >> quitar
done
diff quitar conjunto.csv | grep ">" | sed 's/^> //g' > diferencia.csv

looks like it could be done with just one diff? Maybe (haven't tested it):

diff --changed-group-format='%>' --unchanged-group-format='' data2.csv conjunto.csv\
 > diferencia.csv

Non-performance-related notes

You should wrap your cleanup command in a trap and put it at the beginning of the script so that it will always be executed unless the program is terminated by a SIGKILL:
```
trap 'rm -f \
rs{host,cname,ctype} \
ttstamp tservice tformat trdata{,2,3,4} \
grhost{,2,final{,{1..4}}} \
data{,2}.csv conjunto.csv \
{a..c} quitar' \
'EXIT'
```
Please don't rm -r if you're not deleting directories. This is a dangerous command if you're not careful. I used brace expansion to shorten the list of input files I need to type, but I'm not sure if you really need to create that many temporary files.
You don't need to touch the files. Redirection will create the files if they don't exist.

I can probably offer more suggestions if I could try out the script. I'll add to this answer if I think of anything, but this should be enough for now.

Stack Exchange Network

Optimize and accelerate bash script to transform XML to SQLITE

1 Answer 1

Avoid unnecessary pipelines

Avoid loops if possible, or optimize for each iteration of the loop

Non-performance-related notes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Optimize and accelerate bash script to transform XML to SQLITE

1 Answer 1

Avoid unnecessary pipelines

Avoid loops if possible, or optimize for each iteration of the loop

Non-performance-related notes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions