Background:
I've recently started learning bash due to my new job in a VFX company. We backup all of our media to LTO tapes (one master and one clone). I was tasked with writing a script that split the tape list of master and clone to a CSV file. I feel I've done so in a crude manner as this was my first ever script and would love some feedback as to how I could improve the efficiency / syntax / code in general so I can learn from this experience.
Here is the text file: https://1drv.ms/t/s!AkWewdosAYGuhiaBbehlLBF64Qsg
I've been executing the script by calling it like this:
$ sh /scriptname.sh filename.txt
Code:
#!/bin/bash
# This script will split the presstore list of tapes into a .CSV file with two seperate coloumns.
file="1ドル"
echo "Splitting tape list....."
touch tempsplit.csv #creates temporary file for use later in script
while IFs= read line
do
lastchar=$(echo $line | tail -c 2)
if [ "$lastchar" == : ] #Ommits any lines that end with : else error
then
echo -ne
elif [ "$lastchar" -ge 0 -a "$lastchar" -le 9 ] #Selects lines that end in number
then
breakdown=$(echo "$line" | cut -d':' -f2,6) #selects fileds 2 & 6 containing tape numbers
master=$(echo "$breakdown" | cut -d'a' -f1) #cuts first number
clone=$(echo "$breakdown" | cut -d':' -f2) #cuts second number
final=$(echo -e "$master,$clone" | tr -d ' ' >> tempsplit.txt) #outputs to a temp csv file
fi
done < $file
touch tapelist_split.csv
awk 'NR % 2 == 0' tempsplit.txt | sort -n >> tapelist_split.csv #removes every 2nd line, sorts numerically, converts to a .CSV file
rm -rf tempsplit.txt #removes tempfile
echo "Complete"
Summary:
I'm not sure why the while read statement doesn't work if I don't call file="1ドル"
as in line 3 and again output to that varaible in line 20 done < $file
- an explanation on this would be amazing.
I realise I shouldn't have to call my first if statement, as I only care about the numbers but if I don't I get an error: "integer expected" when I run the script - does anyone know why this might be.
The reason I create the file "tempslipt.txt" is because my code to remove every 2nd line and sort the file wasn't working within the wile read statement so I figured this was a clean way of doing it.
My code may not be very efficient or good which is why I'm asking for tips on how I can refine and correct it so in future I can write much cleaner scripts.
1 Answer 1
Brave attempt :-) This can be so much better ;-)
Running scripts
The script has the shebang #!/bin/bash
, but you invoke it as sh script.sh
.
The purpose of a shebang is to make scripts runnable as ./script.sh
.
When invoked this way, the shell looks at the shebang, and runs the script with the specified executable.
sh
is often symlinked to Bash, but not always.
In short, if a script doesn't require Bash, then use the shebang line #!/bin/sh
and run it as ./script.sh
or sh ./script.sh
.
If it requires Bash, then use the shebang line #!/usr/bin/env bash
and run it with ./script.sh
or bash ./script.sh
.
Filtering lines
Bash is not particularly well-suited to filter lines by patterns.
grep
is a great tool for that.
So instead of reading line by line with Bash to filter,
look for ways to use grep
.
In this example, replace the loop with:
grep '[0-9]$' "$file" | while IFS= read line; do
...
done
Empty conditional branches
I think your intention here was to do nothing when the condition is true:
if [ "$lastchar" == : ] then echo -ne elif ...
If you ever need to do nothing, you can use true
or :
like this:
if ...; then
:
elif ...
But this is not a good example to use this trick,
because a better solution exists.
A better solution would have been to drop the if
, and change the elif
to if
.
An even better solution would have been to call continue
if the line doesn't match the required pattern, so the loop body would be flatter:
if ! [ "$lastchar" -ge 0 -a "$lastchar" -le 9 ]; then
continue
fi
breakdown=...
With my tip with grep
in the previous point, you don't need a condition at all.
Extracting two numbers
This is very inefficient:
breakdown=$(echo "$line" | cut -d':' -f2,6) master=$(echo "$breakdown" | cut -d'a' -f1) clone=$(echo "$breakdown" | cut -d':' -f2) final=$(echo -e "$master,$clone" | tr -d ' ' >> tempsplit.txt)
The problem is that for each line this runs many processes: echo
, cut
, tr
, and multiple $(...)
sub-shells.
I see a simpler way to achieve what you want. Looking at these sample lines:
tape with barcode: 000053 and is: offline at listed location: MCR Shelves and is: Full and is copy of tape with barcode: 000047 tape with barcode: 000044 and is: offline at listed location: MCR Shelves and is: Full and is copy of tape with barcode: 000042
We could extract the two numbers and put a comma in between like this:
- Replace all non-digits at the beginning with empty string (= remove)
- Replace all non-digits with a comma
Try this simple pipeline:
grep '[0-9]$' file | head | sed -e 's/^[^0-9]*//' -e 's/[^0-9][^0-9]*/,/'
This works for the head
of the file, but not the entire file. On closer look, there are lines that have a third number between the target numbers, which breaks the above pattern:
tape with barcode: 000484 and is: online at listed location: i40 QUANTUM and is: Appendable and is copy of tape with barcode: 000483
After replacement this line becomes:
000484 and is: online at listed location: i40 QUANTUM and is: Appendable and is copy of tape with barcode: 000483
Observe that there is a space after the first number and before the last. So to handle such case, we could change the logic of the second step to: "replace everything between two spaces with a comma".
Let's give that a try, on lines containing "i40":
grep 'i40.*[0-9]$' file | head | sed -e 's/^[^0-9]*//' -e 's/ .* /,/'
It seems the script could be replaced with:
grep '[0-9]$' "$file" | \
sed -e 's/^[^0-9]*//' -e 's/ .* /,/' | \
awk 'NR % 2 == 0' | sort -n > tapelist_split.csv
No temporary files needed.
If you have the GNU version of sed
(typically in Linux), then you can use it delete every 2nd line instead of awk
, slightly simpler:
grep '[0-9]$' "$file" | \
sed -e 's/^[^0-9]*//' -e 's/ .* /,/' -e '1~2d' | \
sort -n > tapelist_split.csv
Quoting
When you use variables as command parameters,
it's important to double-quote them to protect from word splitting and globbing. So a loop should read from $file
like this:
while ...; do ...; done < "$file"
Understanding command parameters
Why did you use the -rf
flags here?
rm -rf tempsplit.txt
There's no good reason to use those flags in the above script.
- The
-r
flag is to delete directories recursively. But the parameter above is a single file. The-r
flag serves no purpose. - The
-f
flag is to force deleting a file that might be protected, or to suppress error in case the file doesn't exist. Neither is the case here. The flag serves no purpose.
Flags in a script that serve no purpose or noise, confusing. Understand all the flags you are using, and make sure they have a reason to be there.
The field separator
The variable name is incorrect here:
while IFs= read line
Should have been:
while IFS= read line
Variable names in Bash are case sensitive.
-
\$\begingroup\$ Thank you so much for this breakdown - exactly what I was looking for! I shall take onboard this advice in future and try to improve, I feel I may have over complicated the task at hand. The only further query I have is that; Whilst your sed solution is fantastic and certainly works for the most part, when lines such as:
tape with barcode: 600172 and is: online at listed location: i80 Quantum and is: Appendable and is copy of tape with barcode: 600171
this are the input, removing everything except numbers does not work - how best would you work around this? \$\endgroup\$oshhee– oshhee2018年03月26日 16:47:05 +00:00Commented Mar 26, 2018 at 16:47 -
\$\begingroup\$ @oshhee I updated my answer to handle that case \$\endgroup\$janos– janos2018年03月30日 03:46:53 +00:00Commented Mar 30, 2018 at 3:46