5
\$\begingroup\$

I have this script in which I read an XML and I pass it to CSV and at the end of the script I transform it into SQLITE

#!/bin/bash
rm -f -r rshost rscname rsctype ttstamp tservice tformat trdata trdata2
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 9ドル }' | cut -c27-30 > trdata
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 3ドル }'| cut -c1-19 > ttstamp
a=`cat 1ドル | grep RecordStart | head -1 | sed -e 's/"/ /g'| awk '{ print 15ドル }'`
b=`cat 1ドル | grep RecordStart | head -1 | sed -e 's/"/ /g' | awk '{print 12ドル }' | sed -e 's/;/ /g' | sed -e 's/=/ /g' | awk '{print 4ドル }'`
c=`cat 1ドル | grep RecordStart | head -1 | sed -e 's/"/ /g' | awk '{print 9,ドル10ドル }'`
touch rsctype rshost rscname
kk=`wc -l trdata | awk '{ print 1ドル }'`
for i in `seq 1 $kk`
do
 echo $a >> rsctype
 echo $b >> rshost
 echo $c >> rscname
done
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 5ドル }' > tservice
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 7ドル }' > tformat
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 9ドル }' | cut -c41-44 > trdata2
cat 1ドル | grep Telegram | sed -e 's/"/ /g' | awk '{ print 9ドル }' | cut -c31-34 > grhost
awk -Wposix '{printf("%d\n","0x" 1ドル)}' trdata > trdata3
awk -Wposix '{printf("%d\n","0x" 1ドル)}' trdata2 > trdata4
sed -i "s%^0%0/%g" grhost
cat grhost | cut -c1-3 > grhost2
sed -i "s%.\{4\}%/%g" grhost
pr -mts, grhost2 grhost > grhostfinal
sed -i "s/,//g" grhostfinal
cat grhostfinal | cut -c1-4 > grhostfinal1
cat grhostfinal | cut -c5 > grhostfinal2
awk -Wposix '{printf("%d\n","0x" 1ドル)}' grhostfinal2 > grhostfinal3
pr -mts, grhostfinal1 grhostfinal3 > grhostfinal4
sed -i "s/,//g" grhostfinal4
pr -mts, rshost ttstamp rsctype tservice tformat trdata4 trdata3 rscname grhostfinal4 > conjunto.csv
sed -i "s|^|,|g" conjunto.csv
sqlite3 test2.sqlite "select fecha from testxml4;" > data.csv
cat data.csv | sort | uniq > data2.csv
for k in `cat data2.csv`
do
 grep "$k" conjunto.csv >> quitar
done
diff quitar conjunto.csv | grep ">" | sed 's/^> //g' > diferencia.csv
echo `sqlite3 test2.sqlite < testxml`
python csv2sqlite.py diferencia.csv test2.sqlite testxml4
rm -f -r rshost rscname rsctype ttstamp tservice tformat trdata trdata2 trdata3 trdata4 grhost2 grhost grhostfinal3 grhostfinal1 grhostfinal2 grhostfinal grhostfinal4 a b c data.csv conjunto.csv data2.csv quitar

I have this XML (The data is a private)

 <CommunicationLog xmlns="http://knx.org/xml/telegrams/01">
 <RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" />
 <Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
 <Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
 <RecordStop Timestamp="" />
 <RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" />
 <Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
 <Telegram Timestamp="" Service="" FrameFormat="" RawData="" />
 <RecordStop Timestamp="" />
</CommunicationLog>

Once analyzed the data, I take them to a CSV and with the Python program csv2sqlite.py

python csv2sqlite.py CSVFILE.csv DB.sqlite TABLESQLITE

My question is how can I make this script faster and more efficient, since it takes a long time to analyze all the data.

asked Nov 21, 2017 at 8:38
\$\endgroup\$
2
  • \$\begingroup\$ related: codereview.stackexchange.com/q/180856/37660 \$\endgroup\$ Commented Nov 21, 2017 at 13:21
  • \$\begingroup\$ Can you provide the database schema and some dummy data for the XML? Like the CREATE TABLE statement you used and the format of the attributes. An explanation of what each file should contain would also be nice, as currently I'm getting RawData= for tservice and Service= for ttstamp, which is super confusing. \$\endgroup\$ Commented Nov 21, 2017 at 19:15

1 Answer 1

3
\$\begingroup\$

Take-home message: pipelines in Bash are slow (compared to process substitution); loops in Bash are slow; Bash are slow.

Without a setup script and dummy data to test run it, I don't really understand exactly what your script is trying to achieve at each step, so I can only suggest the following improvements:

Avoid unnecessary pipelines

  • A lot of your code does cat 1ドル | grep Telegram | sed -e 's/"/ /g', which can be simplified to sed '/Telegram/!d; s/"/ /g' "1ドル", so you may want to save the result somewhere and extract from it when needed.
  • awk '{ print 9ドル }' | cut -c27-30 can be combined into awk '{print substr(9,ドル 27, 4)}'.
  • In the command substitution that gets assigned to b, you have sed -e 's/;/ /g' | sed -e 's/=/ /g' which should really just be sed -e 's/;/ /g' -e 's/=/ /g' or even better just sed 's/[;=]/ /g'. You don't need the -e option if you're not combining expressions.
  • kk=`wc -l trdata | awk '{ print 1ドル }'`: kk=`wc -l < trdata` can do the job just fine.
  • cat grhostfinal | cut -c1-4 > grhostfinal1 is not as efficient as cut -c1-4 < grhostfinal > grhostfinal1

  • sqlite3 test2.sqlite "select fecha from testxml4;" > data.csv
    cat data.csv | sort | uniq > data2.csv
    

    is probably best done entirely in SQL:

    sqlite3 test2.sqlite "SELECT DISTINCT fecha FROM testxml4 ORDER BY fecha;" > data.csv
    

Avoid loops if possible, or optimize for each iteration of the loop

  • for i in `seq 1 $kk`
    do
     echo $a >> rsctype
    done
    

    is much slower than

    printf "$a\n%.0s" `seq 1 $kk` >> rsctype
    

    for small $kk, and much slower than

    yes "$a" | head -n "$kk" >> rsctype
    

    for large $kk. See https://superuser.com/questions/86340/linux-command-to-repeat-a-string-n-times. (Not sure if you actually want to repeat the same strings, but that's what your code does.)

  • for k in `cat data2.csv`
    do
     grep "$k" conjunto.csv >> quitar
    done
    diff quitar conjunto.csv | grep ">" | sed 's/^> //g' > diferencia.csv
    

    looks like it could be done with just one diff? Maybe (haven't tested it):

    diff --changed-group-format='%>' --unchanged-group-format='' data2.csv conjunto.csv\
     > diferencia.csv
    

Non-performance-related notes

  • You should wrap your cleanup command in a trap and put it at the beginning of the script so that it will always be executed unless the program is terminated by a SIGKILL:

    trap 'rm -f \
    rs{host,cname,ctype} \
    ttstamp tservice tformat trdata{,2,3,4} \
    grhost{,2,final{,{1..4}}} \
    data{,2}.csv conjunto.csv \
    {a..c} quitar' \
    'EXIT'
    
  • Please don't rm -r if you're not deleting directories. This is a dangerous command if you're not careful. I used brace expansion to shorten the list of input files I need to type, but I'm not sure if you really need to create that many temporary files.

  • You don't need to touch the files. Redirection will create the files if they don't exist.

I can probably offer more suggestions if I could try out the script. I'll add to this answer if I think of anything, but this should be enough for now.

answered Nov 21, 2017 at 17:43
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.