I am splitting a csv file where the first 3 columns will be common for all the output files.
input file:
h1 h2 h3 o1 o2 ....
a b c d e ....
a1 b1 c1 d1 e1 ....
output files:
o1.csv:
h1 h2 h3 o1
a b c d
a1 b1 c1 d1
o2.csv:
h1 h2 h3 o2
a b c e
a1 b1 c1 e1
So if there are n columns in the input file , the code creates n-3 output files. However my code is inefficient and is quite slow. It takes 20 seconds for 50000 rows.
old_IFS=$IFS
START_TIME=`date`
DELIMITER=,
# reading and writing headers
headers_line=$(head -n 1 "$csv_file")
IFS=$DELIMITER read -r -a headers <<< $headers_line
common_headers=${headers[0]}$DELIMITER${headers[1]}$DELIMITER${headers[2]}
for header in "${headers[@]:3}"
do
# writing headers to every file
echo $common_headers$DELIMITER$header > "$header$START_TIME".csv
done
# reading csv file line by line
i=1
while IFS=$DELIMITER read -r -a row_data
do
test $i -eq 1 && ((i++)) && continue # ignoring headers
j=0
common_data=${row_data[0]}$DELIMITER${row_data[1]}$DELIMITER${row_data[2]}
for val in "${row_data[@]:3}"
do
# appending row to every new csv file
echo $common_data$DELIMITER$val >> "${headers[(($j+3))]}$START_TIME".csv
((j++))
done
done < $csv_file
IFS=${old_IFS}
Any suggestions are appreciated.
1 Answer 1
Bash is not efficient for processing large files line by line. For small data it's fine, but when a script starts to feel heavy, it's good to look for other alternatives. Also note that the line by line processing and breaking into columns is not easy to get right, I bet you spent quite some time on this. You wrote it well, but the result is not particularly easy to read, and I'm afraid this is as good as it gets with Bash.
So what's the alternative? Try with cut
in a loop. Yes that will imply reading the file n-3 times, but I bet it will be faster than the pure Bash solution. And it will be nicely readable too, which is an extremely important benefit.
A few notes about technique:
- Use
$(...)
instead of`...`
- You took care to save
IFS
and then restore at the end, but it was unnecessary: when you dovar=... somecmd
, the value ofvar
is only set in the environment ofsomecmd
, it is unchanged for the current script. That being said, what you did is safe, so it's fine. - The incrementing
i
variable in the loop is a bit misleading, becausei
is a common name in counting loops, and at first I thought the count itself has some purpose. But it doesn't, this variable is used only to distinguish the first line from the others. I would write differently, to make the intention perfectly obvious.