This is pretty horrible code, but it works. I don't code in bash π, usually code in Python.
This is a learning exercise for me, so that's why I'm using bash + that's why I don't use libraries.
I obviously don't expect anyone to correct the whole thing, but any main points on how to make this more bash-like would be greatly appreciated! π
Sample input:
1,3,John smith,"BA, Economics",,Economics
2,4,Brian Jones,Master of Science,,Economics
3,5,Bill jones,"MSc, Biology",,Biology
Sample output:
1,3,John Smith,"BA, Economics",[email protected],Economics
2,4,Brian Jones,Master of Science,[email protected],Economics
3,5,Bill Jones,"Msc, Biology",[email protected],Biology
Script:
#!/bin/bash
if [[ $# -ne 1 ]]; then
echo "Usage: 0γγ« <input_file>"
exit 1
fi
input_file="1γγ«"
output_file="students_new.csv"
declare -A email_count
parsed_line=()
get_corrected_surname() {
name="1γγ«"
corrected_name=$(echo "$name" | awk '{
for(i=1; i<=NF; i++) {
$i = toupper(substr($i,1,1)) substr($i,2)
}
print
}')
echo "$corrected_name"
}
get_email_prefix() {
local name=$(echo "1γγ«" | tr '[:upper:]' '[:lower:]')
local first_name=$(echo "$name" | awk '{print 1γγ«}')
local surname=$(echo "$name" | awk '{print 2γγ«}')
local initial=$(echo "$first_name" | cut -c 1)
local email_prefix="${initial}${surname}"
echo "$email_prefix"
}
get_email() {
name="1γγ«"
location_id="2γγ«"
email_prefix=$(get_email_prefix "$name")
if (( email_count["$email_prefix"] > 1 )); then
email_prefix="${email_prefix}${location_id}"
fi
email="${email_prefix}@ucla.edu"
echo "$email"
}
prep_email_count_array() {
while IFS=',' read -r id location_id full_name _rest; do
if [[ "$id" = "id" ]]; then
continue
fi
email_prefix=$(get_email_prefix "$full_name")
email_count["$email_prefix"]=$(( ${email_count["$email_prefix"]} + 1 ))
done < "$input_file"
}
parse_line() {
line="1γγ«"
parsed_line=()
current_string=""
inside_quotes=false
i=0
while [ $i -lt ${#line} ]; do
char="${line:i:1}"
if [[ "$char" == ',' && "$inside_quotes" == false ]]; then
parsed_line+=("$current_string")
current_string=""
elif [[ "$char" == '"' ]]; then
if [[ "$inside_quotes" == false ]]; then
inside_quotes=true
else
inside_quotes=false
fi
current_string+="$char"
if [[ "$inside_quotes" == false && -n "$current_string" ]]; then
parsed_line+=("$current_string")
current_string=""
(( i++ ))
fi
else
current_string+="$char"
fi
(( i++ ))
done
if [[ -n "$current_string" ]]; then
parsed_line+=("$current_string")
fi
}
parse_and_write() {
echo "id,class_id,name,degree,email,department" > "$output_file"
while IFS= read -r line; do
parse_line "$line"
id="${parsed_line[0]}"
# skip header
if [[ "$id" = "id" ]]; then
continue
fi
class_id="${parsed_line[1]}"
full_name="${parsed_line[2]}"
degree="${parsed_line[3]}"
email="${parsed_line[4]}"
department="${parsed_line[5]}"
corrected_full_name=$(get_corrected_surname "$full_name")
email=$(get_email "$full_name" "$class_id")
echo "$id,$class_id,$corrected_full_name,$degree,$email,$department" >> "$output_file"
done < "$input_file"
}
prep_email_count_array
parse_and_write
3 Answers 3
Agreed -- truly horrible code.
design of Public API
This is great:
corrected_full_name=$(get_corrected_surname "$full_name")
Very clear.
We start with a nice explicit name="1γγ«"
,
and return a result with echo "$corrected_name"
.
Thank you for writing in a clear style.
In contrast, prep_email_count_array
and especially parse_line
merit some # comments
explaining which of the several variables we set
are intended to be fortran COMMON globals being passed out.
CSV library
Dealing with quoted CSV fields is a solved problem,
but a fiddly one featuring multiple edge cases.
I feel the appropriate tradeoff here would have been to use
a mature library
to extract column values, mapping CSV --> TSV.
If ASCII 9 TAB is not convenient, then a common scenario
is to decide that e.g. |
pipe does not appear in the data,
and use that as a field separator, obviating any need for quoting.
design rationale
If a maintenance engineer interacts with this code in a few months, I
predict one of their initial steps will be to rip out the home grown
CSV parser, replacing it with another which is more battle tested.
If you wish to avoid such rework,
it's important to write down comments describing
the constraints this code base works within,
and the motivations for adopting both bash
and this CSV parsing approach.
automated tests
You don't have to diff
generated against expected output.
But you may as well, since that's one of the first actions a
maintenance engineer will take when they attempt any refactors.
This code base appears to achieve its design goals.
I would not be willing to delegate or accept maintenance tasks on it, in its current form.
Overall, much of this should arguably be rewritten in Awk or perhaps Python.
As a method for assigning unique email addresses, using an internal code number seems like a liability for the longer term. How about [email protected]
vs [email protected]
? Or let people choose their own email address if the one assigned by the system isnβt available.
(Any automatic assignment will fail sooner or later. At Bell Labs, they allegedly ended up with two Stephen R. Bournes in the same department.)
get_corrected_surname() {
name="1γγ«"
corrected_name=$(echo "$name" | awk '{
for(i=1; i<=NF; i++) {
$i = toupper(substr($i,1,1)) substr($i,2)
}
print
}')
echo "$corrected_name"
}
This doesn't "correct" anything. A better function name would be capitalize
. It will still do the wrong thing for names with idiosyncratic capitalization, like d'Souza or van Noord. (For better or for worse, it doesn't force the rest of the name to lower case, so it doesn't actually convert to proper case.)
The function pollutes the global namespace because you forgot to use local
. Adding insult to injury, there is no need for local variables here at all. Drop the useless echo
and reduce it to just
capitalize () {
echo "1γγ«" | awk '{
for(i=1; i<=NF; i++) {
$i = toupper(substr($i,1,1)) substr($i,2)
}
print }'
}
You have other functions where you forgot local
, though others where you didnβt.
get_email_prefix() {
local name=$(echo "1γγ«" | tr '[:upper:]' '[:lower:]')
local first_name=$(echo "$name" | awk '{print 1γγ«}')
local surname=$(echo "$name" | awk '{print 2γγ«}')
local initial=$(echo "$first_name" | cut -c 1)
local email_prefix="${initial}${surname}"
echo "$email_prefix"
}
This provides a number of unnecessary variables which are then never used. It can be rewritten in Awk entirely.
The term "prefix" is odd here; what is being produced is called the "local part" of an email address. RFC5321 and friends even calls it "localpart" as a single word.
If the name has more than two tokens, 2γγ«
is wrong. The final token $NF
is somewhat more likely to be correct, although again, there are odd names where blind automation will never get everything exactly right. Apologies to Andrew Lloyd Webber and maybe even to George Bush Jr. in spite of everything.
get_email_localpart() {
echo "$name" |
awk '{print tolower(substr(1,1,1γγ«) $NF)}')
}
In parse_and_write
you are needlessly closing and reopening the redirection every time through the loop. You can redirect once, at the end, to have the shell open the file and seek to the end just once.
parse_and_write() {
echo "id,class_id,name,degree,email,department" > "$output_file"
while IFS= read -r line; do
...
echo "$id,$class_id,$corrected_full_name,$degree,$email,$department"
done < "$input_file" >> "$output_file"
}
I haven't attempted to understand the CSV parser. The simplest fix might be to use an existing library if you insist on keeping this in Bash.
Perhaps see also Falsehoods Programmers Believe About Names
Some isolated issues, while I haven't studied the code in depth, and taking into account, that it is a bash exercise, not a investigaion for the best tool (awk, sed?):
Your
get_corrected_surname() { name="1γγ«" corrected_name=$(echo "$name" | awk '{ for(i=1; i<=NF; i++) { $i = toupper(substr($i,1,1)) substr($i,2) } print }') echo "$corrected_name" }
is shorter in Bash:
f () { echo ${1^}; }
f foo
# gives:
Foo
and, as such a short implementation, would not justify a function, so just replace:
corrected_full_name=${full_name^}
and similarly
# local name=$(echo "1γγ«" | tr '[:upper:]' '[:lower:]')
local name=${1,,}
The ${...} - syntax is described in man bash
in the chapter Parameter Expansion.
In scripts, you don't use semicolons but line breaks for nice formatting. From a done
, you search upwards to the do, so write
for ...
do
# ...
done
while # ...
do
# ...
done
if ...
then
# ...
fi
with do and then in a new line. This matters more in longer, nested code.
Then I would suggest for rounded parens around arithmetic expressions:
# if [[ $# -ne 1 ]]; then
if (($# != 1))
then