I have this script in which I read an XML and I pass it to CSV and at the end of the script I transform it into SQLITE
#!/bin/bash
rm -f -r rshost rscname rsctype ttstamp tservice tformat trdata trdata2
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 9ドル }' | cut -c27-30 > trdata
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 3ドル }'| cut -c1-19 > ttstamp
a=`cat 1ドル | grep RecordStart | head -1 | sed -e 's/"/ /g'| awk '{ print 15ドル }'`
b=`cat 1ドル | grep RecordStart | head -1 | sed -e 's/"/ /g' | awk '{print 12ドル }' | sed -e 's/;/ /g' | sed -e 's/=/ /g' | awk '{print 4ドル }'`
c=`cat 1ドル | grep RecordStart | head -1 | sed -e 's/"/ /g' | awk '{print 9,ドル10ドル }'`
touch rsctype rshost rscname
kk=`wc -l trdata | awk '{ print 1ドル }'`
for i in `seq 1 $kk`
do
echo $a >> rsctype
echo $b >> rshost
echo $c >> rscname
done
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 5ドル }' > tservice
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 7ドル }' > tformat
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 9ドル }' | cut -c41-44 > trdata2
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 9ドル }' | cut -c31-34 > grhost
awk -Wposix '{printf("%d\n","0x" 1ドル)}' trdata > trdata3
awk -Wposix '{printf("%d\n","0x" 1ドル)}' trdata2 > trdata4
sed -i "s%^0%0/%g" grhost
cat grhost | cut -c1-3 > grhost2
sed -i "s%.\{4\}%/%g" grhost
pr -mts, grhost2 grhost > grhostfinal
sed -i "s/,//g" grhostfinal
cat grhostfinal | cut -c1-4 > grhostfinal1
cat grhostfinal | cut -c5 > grhostfinal2
awk -Wposix '{printf("%d\n","0x" 1ドル)}' grhostfinal2 > grhostfinal3
pr -mts, grhostfinal1 grhostfinal3 > grhostfinal4
sed -i "s/,//g" grhostfinal4
pr -mts, rshost ttstamp rsctype tservice tformat trdata4 trdata3 rscname grhostfinal4 > conjunto.csv
sed -i "s|^|,|g" conjunto.csv
sqlite3 test2.sqlite "select fecha from testxml4;" > data.csv
cat data.csv | sort | uniq > data2.csv
for k in `cat data2.csv`
do
grep "$k" conjunto.csv >> quitar
done
diff quitar conjunto.csv | grep ">" | sed 's/^> //g' > diferencia.csv
echo `sqlite3 test2.sqlite < testxml`
python csv2sqlite.py diferencia.csv test2.sqlite testxml4
rm -f -r rshost rscname rsctype ttstamp tservice tformat trdata trdata2 trdata3 trdata4 grhost2 grhost grhostfinal3 grhostfinal1 grhostfinal2 grhostfinal grhostfinal4 a b c data.csv conjunto.csv data2.csv quitar
I have this XML (The data is a private)
<CommunicationLog xmlns="http://knx.org/xml/telegrams/01">
<RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<RecordStop Timestamp="" />
<RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
<RecordStop Timestamp="" />
</CommunicationLog>
Once analyzed the data, I take them to a CSV and with the Python program csv2sqlite.py
python csv2sqlite.py CSVFILE.csv DB.sqlite TABLESQLITE
My question is how can I make this script faster and more efficient, since it takes a long time to analyze all the data.
1 Answer 1
Take-home message: pipelines in Bash are slow (compared to process substitution); loops in Bash are slow; Bash are slow.
Without a setup script and dummy data to test run it, I don't really understand exactly what your script is trying to achieve at each step, so I can only suggest the following improvements:
Avoid unnecessary pipelines
- A lot of your code does
cat 1ドル | grep Telegram | sed -e 's/"/ /g'
, which can be simplified tosed '/Telegram/!d; s/"/ /g' "1ドル"
, so you may want to save the result somewhere and extract from it when needed. awk '{ print 9ドル }' | cut -c27-30
can be combined intoawk '{print substr(9,ドル 27, 4)}'
.- In the command substitution that gets assigned to
b
, you havesed -e 's/;/ /g' | sed -e 's/=/ /g'
which should really just besed -e 's/;/ /g' -e 's/=/ /g'
or even better justsed 's/[;=]/ /g'
. You don't need the-e
option if you're not combining expressions. kk=`wc -l trdata | awk '{ print 1ドル }'`
:kk=`wc -l < trdata`
can do the job just fine.cat grhostfinal | cut -c1-4 > grhostfinal1
is not as efficient ascut -c1-4 < grhostfinal > grhostfinal1
sqlite3 test2.sqlite "select fecha from testxml4;" > data.csv cat data.csv | sort | uniq > data2.csv
is probably best done entirely in SQL:
sqlite3 test2.sqlite "SELECT DISTINCT fecha FROM testxml4 ORDER BY fecha;" > data.csv
Avoid loops if possible, or optimize for each iteration of the loop
for i in `seq 1 $kk` do echo $a >> rsctype done
is much slower than
printf "$a\n%.0s" `seq 1 $kk` >> rsctype
for small
$kk
, and much slower thanyes "$a" | head -n "$kk" >> rsctype
for large
$kk
. See https://superuser.com/questions/86340/linux-command-to-repeat-a-string-n-times. (Not sure if you actually want to repeat the same strings, but that's what your code does.)for k in `cat data2.csv` do grep "$k" conjunto.csv >> quitar done diff quitar conjunto.csv | grep ">" | sed 's/^> //g' > diferencia.csv
looks like it could be done with just one
diff
? Maybe (haven't tested it):diff --changed-group-format='%>' --unchanged-group-format='' data2.csv conjunto.csv\ > diferencia.csv
Non-performance-related notes
You should wrap your cleanup command in a
trap
and put it at the beginning of the script so that it will always be executed unless the program is terminated by aSIGKILL
:trap 'rm -f \ rs{host,cname,ctype} \ ttstamp tservice tformat trdata{,2,3,4} \ grhost{,2,final{,{1..4}}} \ data{,2}.csv conjunto.csv \ {a..c} quitar' \ 'EXIT'
Please don't
rm -r
if you're not deleting directories. This is a dangerous command if you're not careful. I used brace expansion to shorten the list of input files I need to type, but I'm not sure if you really need to create that many temporary files.You don't need to
touch
the files. Redirection will create the files if they don't exist.
I can probably offer more suggestions if I could try out the script. I'll add to this answer if I think of anything, but this should be enough for now.
CREATE TABLE
statement you used and the format of the attributes. An explanation of what each file should contain would also be nice, as currently I'm gettingRawData=
fortservice
andService=
forttstamp
, which is super confusing. \$\endgroup\$