Read integers from text file and write to CSV file

Question 1

Background:

I've recently started learning bash due to my new job in a VFX company. We backup all of our media to LTO tapes (one master and one clone). I was tasked with writing a script that split the tape list of master and clone to a CSV file. I feel I've done so in a crude manner as this was my first ever script and would love some feedback as to how I could improve the efficiency / syntax / code in general so I can learn from this experience.

Here is the text file: https://1drv.ms/t/s!AkWewdosAYGuhiaBbehlLBF64Qsg

I've been executing the script by calling it like this:
$ sh /scriptname.sh filename.txt

Code:

#!/bin/bash
# This script will split the presstore list of tapes into a .CSV file with two seperate coloumns.
file="1ドル"
echo "Splitting tape list....."
touch tempsplit.csv #creates temporary file for use later in script
while IFs= read line
do
 lastchar=$(echo $line | tail -c 2) 
 if [ "$lastchar" == : ] #Ommits any lines that end with : else error
 then
 echo -ne
 elif [ "$lastchar" -ge 0 -a "$lastchar" -le 9 ] #Selects lines that end in number
 then
 breakdown=$(echo "$line" | cut -d':' -f2,6) #selects fileds 2 & 6 containing tape numbers
 master=$(echo "$breakdown" | cut -d'a' -f1) #cuts first number
 clone=$(echo "$breakdown" | cut -d':' -f2) #cuts second number
 final=$(echo -e "$master,$clone" | tr -d ' ' >> tempsplit.txt) #outputs to a temp csv file
 fi
done < $file
touch tapelist_split.csv
awk 'NR % 2 == 0' tempsplit.txt | sort -n >> tapelist_split.csv #removes every 2nd line, sorts numerically, converts to a .CSV file
rm -rf tempsplit.txt #removes tempfile
echo "Complete"

Summary:

I'm not sure why the while read statement doesn't work if I don't call file="1ドル" as in line 3 and again output to that varaible in line 20 done < $file- an explanation on this would be amazing.

I realise I shouldn't have to call my first if statement, as I only care about the numbers but if I don't I get an error: "integer expected" when I run the script - does anyone know why this might be.

The reason I create the file "tempslipt.txt" is because my code to remove every 2nd line and sort the file wasn't working within the wile read statement so I figured this was a clean way of doing it.

My code may not be very efficient or good which is why I'm asking for tips on how I can refine and correct it so in future I can write much cleaner scripts.

Question 2

Brave attempt :-) This can be so much better ;-)

Running scripts

The script has the shebang #!/bin/bash, but you invoke it as sh script.sh. The purpose of a shebang is to make scripts runnable as ./script.sh. When invoked this way, the shell looks at the shebang, and runs the script with the specified executable. sh is often symlinked to Bash, but not always.

In short, if a script doesn't require Bash, then use the shebang line #!/bin/sh and run it as ./script.sh or sh ./script.sh. If it requires Bash, then use the shebang line #!/usr/bin/env bash and run it with ./script.sh or bash ./script.sh.

Filtering lines

Bash is not particularly well-suited to filter lines by patterns. grep is a great tool for that. So instead of reading line by line with Bash to filter, look for ways to use grep. In this example, replace the loop with:

grep '[0-9]$' "$file" | while IFS= read line; do
 ...
done

Empty conditional branches

I think your intention here was to do nothing when the condition is true:

if [ "$lastchar" == : ] 
 then
 echo -ne
elif ...

If you ever need to do nothing, you can use true or : like this:

if ...; then
 :
elif ...

But this is not a good example to use this trick, because a better solution exists. A better solution would have been to drop the if, and change the elif to if. An even better solution would have been to call continue if the line doesn't match the required pattern, so the loop body would be flatter:

if ! [ "$lastchar" -ge 0 -a "$lastchar" -le 9 ]; then
 continue
fi
breakdown=...

With my tip with grep in the previous point, you don't need a condition at all.

Extracting two numbers

This is very inefficient:

breakdown=$(echo "$line" | cut -d':' -f2,6)
master=$(echo "$breakdown" | cut -d'a' -f1)
clone=$(echo "$breakdown" | cut -d':' -f2)
final=$(echo -e "$master,$clone" | tr -d ' ' >> tempsplit.txt)

The problem is that for each line this runs many processes: echo, cut, tr, and multiple $(...) sub-shells.

I see a simpler way to achieve what you want. Looking at these sample lines:

tape with barcode: 000053 and is: offline at listed location: MCR Shelves and is: Full and is copy of tape with barcode: 000047
tape with barcode: 000044 and is: offline at listed location: MCR Shelves and is: Full and is copy of tape with barcode: 000042

We could extract the two numbers and put a comma in between like this:

Replace all non-digits at the beginning with empty string (= remove)
Replace all non-digits with a comma

Try this simple pipeline:

grep '[0-9]$' file | head | sed -e 's/^[^0-9]*//' -e 's/[^0-9][^0-9]*/,/'

This works for the head of the file, but not the entire file. On closer look, there are lines that have a third number between the target numbers, which breaks the above pattern:

tape with barcode: 000484 and is: online at listed location: i40 QUANTUM and is: Appendable and is copy of tape with barcode: 000483

After replacement this line becomes:

000484 and is: online at listed location: i40 QUANTUM and is: Appendable and is copy of tape with barcode: 000483

Observe that there is a space after the first number and before the last. So to handle such case, we could change the logic of the second step to: "replace everything between two spaces with a comma".

Let's give that a try, on lines containing "i40":

grep 'i40.*[0-9]$' file | head | sed -e 's/^[^0-9]*//' -e 's/ .* /,/'

It seems the script could be replaced with:

grep '[0-9]$' "$file" | \
 sed -e 's/^[^0-9]*//' -e 's/ .* /,/' | \
 awk 'NR % 2 == 0' | sort -n > tapelist_split.csv

No temporary files needed.

If you have the GNU version of sed (typically in Linux), then you can use it delete every 2nd line instead of awk, slightly simpler:

grep '[0-9]$' "$file" | \
 sed -e 's/^[^0-9]*//' -e 's/ .* /,/' -e '1~2d' | \
 sort -n > tapelist_split.csv

Quoting

When you use variables as command parameters, it's important to double-quote them to protect from word splitting and globbing. So a loop should read from $file like this:

while ...; do ...; done < "$file"

Understanding command parameters

Why did you use the -rf flags here?

rm -rf tempsplit.txt

There's no good reason to use those flags in the above script.

The -r flag is to delete directories recursively. But the parameter above is a single file. The -r flag serves no purpose.
The -f flag is to force deleting a file that might be protected, or to suppress error in case the file doesn't exist. Neither is the case here. The flag serves no purpose.

Flags in a script that serve no purpose or noise, confusing. Understand all the flags you are using, and make sure they have a reason to be there.

The field separator

The variable name is incorrect here:

while IFs= read line

Should have been:

while IFS= read line

Variable names in Bash are case sensitive.

Question 3

Thank you so much for this breakdown - exactly what I was looking for! I shall take onboard this advice in future and try to improve, I feel I may have over complicated the task at hand. The only further query I have is that; Whilst your sed solution is fantastic and certainly works for the most part, when lines such as:

tape with barcode: 600172 and is: online at listed location: i80 Quantum and is: Appendable and is copy of tape with barcode: 600171

this are the input, removing everything except numbers does not work - how best would you work around this?

Question 4

@oshhee I updated my answer to handle that case

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2018-03-22 19:16:30Z

Brave attempt :-) This can be so much better ;-)

Running scripts

The script has the shebang #!/bin/bash, but you invoke it as sh script.sh. The purpose of a shebang is to make scripts runnable as ./script.sh. When invoked this way, the shell looks at the shebang, and runs the script with the specified executable. sh is often symlinked to Bash, but not always.

In short, if a script doesn't require Bash, then use the shebang line #!/bin/sh and run it as ./script.sh or sh ./script.sh. If it requires Bash, then use the shebang line #!/usr/bin/env bash and run it with ./script.sh or bash ./script.sh.

Filtering lines

Bash is not particularly well-suited to filter lines by patterns. grep is a great tool for that. So instead of reading line by line with Bash to filter, look for ways to use grep. In this example, replace the loop with:

grep '[0-9]$' "$file" | while IFS= read line; do
 ...
done

Empty conditional branches

I think your intention here was to do nothing when the condition is true:

if [ "$lastchar" == : ] 
 then
 echo -ne
elif ...

If you ever need to do nothing, you can use true or : like this:

if ...; then
 :
elif ...

But this is not a good example to use this trick, because a better solution exists. A better solution would have been to drop the if, and change the elif to if. An even better solution would have been to call continue if the line doesn't match the required pattern, so the loop body would be flatter:

if ! [ "$lastchar" -ge 0 -a "$lastchar" -le 9 ]; then
 continue
fi
breakdown=...

With my tip with grep in the previous point, you don't need a condition at all.

Extracting two numbers

This is very inefficient:

breakdown=$(echo "$line" | cut -d':' -f2,6)
master=$(echo "$breakdown" | cut -d'a' -f1)
clone=$(echo "$breakdown" | cut -d':' -f2)
final=$(echo -e "$master,$clone" | tr -d ' ' >> tempsplit.txt)

The problem is that for each line this runs many processes: echo, cut, tr, and multiple $(...) sub-shells.

I see a simpler way to achieve what you want. Looking at these sample lines:

tape with barcode: 000053 and is: offline at listed location: MCR Shelves and is: Full and is copy of tape with barcode: 000047
tape with barcode: 000044 and is: offline at listed location: MCR Shelves and is: Full and is copy of tape with barcode: 000042

We could extract the two numbers and put a comma in between like this:

Replace all non-digits at the beginning with empty string (= remove)
Replace all non-digits with a comma

Try this simple pipeline:

grep '[0-9]$' file | head | sed -e 's/^[^0-9]*//' -e 's/[^0-9][^0-9]*/,/'

This works for the head of the file, but not the entire file. On closer look, there are lines that have a third number between the target numbers, which breaks the above pattern:

tape with barcode: 000484 and is: online at listed location: i40 QUANTUM and is: Appendable and is copy of tape with barcode: 000483

After replacement this line becomes:

000484 and is: online at listed location: i40 QUANTUM and is: Appendable and is copy of tape with barcode: 000483

Observe that there is a space after the first number and before the last. So to handle such case, we could change the logic of the second step to: "replace everything between two spaces with a comma".

Let's give that a try, on lines containing "i40":

grep 'i40.*[0-9]$' file | head | sed -e 's/^[^0-9]*//' -e 's/ .* /,/'

It seems the script could be replaced with:

grep '[0-9]$' "$file" | \
 sed -e 's/^[^0-9]*//' -e 's/ .* /,/' | \
 awk 'NR % 2 == 0' | sort -n > tapelist_split.csv

No temporary files needed.

If you have the GNU version of sed (typically in Linux), then you can use it delete every 2nd line instead of awk, slightly simpler:

grep '[0-9]$' "$file" | \
 sed -e 's/^[^0-9]*//' -e 's/ .* /,/' -e '1~2d' | \
 sort -n > tapelist_split.csv

Quoting

When you use variables as command parameters, it's important to double-quote them to protect from word splitting and globbing. So a loop should read from $file like this:

while ...; do ...; done < "$file"

Understanding command parameters

Why did you use the -rf flags here?

rm -rf tempsplit.txt

There's no good reason to use those flags in the above script.

The -r flag is to delete directories recursively. But the parameter above is a single file. The -r flag serves no purpose.
The -f flag is to force deleting a file that might be protected, or to suppress error in case the file doesn't exist. Neither is the case here. The flag serves no purpose.

Flags in a script that serve no purpose or noise, confusing. Understand all the flags you are using, and make sure they have a reason to be there.

The field separator

The variable name is incorrect here:

while IFs= read line

Should have been:

while IFS= read line

Variable names in Bash are case sensitive.

Thank you so much for this breakdown - exactly what I was looking for! I shall take onboard this advice in future and try to improve, I feel I may have over complicated the task at hand. The only further query I have is that; Whilst your sed solution is fantastic and certainly works for the most part, when lines such as: tape with barcode: 600172 and is: online at listed location: i80 Quantum and is: Appendable and is copy of tape with barcode: 600171 this are the input, removing everything except numbers does not work - how best would you work around this?

Stack Exchange Network

Read integers from text file and write to CSV file

1 Answer 1

Running scripts

Filtering lines

Empty conditional branches

Extracting two numbers

Quoting

Understanding command parameters

The field separator

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Read integers from text file and write to CSV file

1 Answer 1

Running scripts

Filtering lines

Empty conditional branches

Extracting two numbers

Quoting

Understanding command parameters

The field separator

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions