3
\$\begingroup\$

I am splitting a csv file where the first 3 columns will be common for all the output files.

input file:

h1 h2 h3 o1 o2 ....
a b c d e ....
a1 b1 c1 d1 e1 ....

output files:

o1.csv:

h1 h2 h3 o1
a b c d 
a1 b1 c1 d1 

o2.csv:

h1 h2 h3 o2
a b c e
a1 b1 c1 e1 

So if there are n columns in the input file , the code creates n-3 output files. However my code is inefficient and is quite slow. It takes 20 seconds for 50000 rows.

old_IFS=$IFS
START_TIME=`date`
DELIMITER=, 
# reading and writing headers 
headers_line=$(head -n 1 "$csv_file")
IFS=$DELIMITER read -r -a headers <<< $headers_line
common_headers=${headers[0]}$DELIMITER${headers[1]}$DELIMITER${headers[2]}
for header in "${headers[@]:3}"
do
 # writing headers to every file
 echo $common_headers$DELIMITER$header > "$header$START_TIME".csv
done
# reading csv file line by line
i=1
while IFS=$DELIMITER read -r -a row_data
do
 test $i -eq 1 && ((i++)) && continue # ignoring headers
 j=0
 common_data=${row_data[0]}$DELIMITER${row_data[1]}$DELIMITER${row_data[2]}
 for val in "${row_data[@]:3}"
 do
 # appending row to every new csv file
 echo $common_data$DELIMITER$val >> "${headers[(($j+3))]}$START_TIME".csv
 ((j++)) 
 done
done < $csv_file
IFS=${old_IFS}

Any suggestions are appreciated.

asked May 21, 2017 at 21:01
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

Bash is not efficient for processing large files line by line. For small data it's fine, but when a script starts to feel heavy, it's good to look for other alternatives. Also note that the line by line processing and breaking into columns is not easy to get right, I bet you spent quite some time on this. You wrote it well, but the result is not particularly easy to read, and I'm afraid this is as good as it gets with Bash.

So what's the alternative? Try with cut in a loop. Yes that will imply reading the file n-3 times, but I bet it will be faster than the pure Bash solution. And it will be nicely readable too, which is an extremely important benefit.

A few notes about technique:

  • Use $(...) instead of `...`
  • You took care to save IFS and then restore at the end, but it was unnecessary: when you do var=... somecmd, the value of var is only set in the environment of somecmd, it is unchanged for the current script. That being said, what you did is safe, so it's fine.
  • The incrementing i variable in the loop is a bit misleading, because i is a common name in counting loops, and at first I thought the count itself has some purpose. But it doesn't, this variable is used only to distinguish the first line from the others. I would write differently, to make the intention perfectly obvious.
answered May 22, 2017 at 6:32
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.