Bash script to count word occurrences in file.txt
How can this script be made more proper or more elegant?
Example output:
Ok source folder & file /home/x/Music/file.txt = source file
Ok /tmp/ folder & file /tmp/file1.txt = using RAM scratchpad, because speed & less drive wear
Example output:
15374 147603 849668 /tmp/file1.txt
Lines: 15374 Words: 147603 Characters: 849668 /tmp/file1.txt
Lines: 15374 Words: 147546 Characters: 846336 /tmp/file2.txt
Lines: 7116 Words: 14231 Characters: 114157 /tmp/file3.txt
Example output:
$'()+,-./0122002112232242252312342352402422472502512522753302303342456789:;=ABCDEFGHIJKLMNOPQRSTUVWXYZ\n\tabcdefghijklmnopqrstuvwxyz 133 /tmp/file1.txt
$'()+,-./0123456789:;=ABCDEFGHIJKLMNOPQRSTUVWXYZ\n\tabcdefghijklmnopqrstuvwxyz 79 /tmp/file2.txt
$'()+,-./0123456789:;=\nabcdefghijklmnopqrstuvwxyz 51 /tmp/file3.txt
Example output:
/tmp/file3.txt
11105 the
6740
6209 of
4501 a
4024 to
3395 or
3130 reg.
...
1 yields;
1 zero,
1 zero;
7116 Lines for count word occurrences in file.txt
Example output:
details about person
668 person
122 persons
108 personal
40 person,
33 personally
13 person.
11 person;
4 (person
3 in-person
3 persons,
3 persons.
2 personally,
2 persons:
1 (personne)
1 persons;
15 *person*
#! /bin/bash
# Bash script to count word occurrences in file.txt
# script tested on Kubuntu 22.04.1
# One example, How to run script?
# 1. set Terminal scrollback to greater than 1000 Lines, say 99000 Lines.
# 2. Put your text into /home/x/Music/file.txt
# 3. Run script /home/x/Music/count_words.sh
# x = whoami ~ Bob ~ user etc...
# time bash /home/x/Music/count_words.sh # copy and paste into Terminal
# Put text into a file called file.txt example, Go to web page:
# https://www.ontario.ca/laws/regulation/900194/v87
# RULES OF CIVIL PROCEDURE
# Ctrl-a = highlight all of web page
# Ctrl-v = paste all of web page into a file called file.txt
# file Location = /home/x/Music/file.txt
# source file = src1 = /home/x/Music/file.txt
# Question 1
# How can this script be made:
# - more proper or
# - more elegant?
# Question 2
# For Kubuntu 22.04.1, How to permanently set Terminal scrollback to unlimited?
# source file, change path and filename to fit your needs.
src1='/home/u3/Music/file.txt' ;
# variable #1 = person, after showing List, then more details about 1 word.
# count word occurrences of person persons personal person, person; ...
var1='person' ;
clear
echo "Bash script to count word occurrences in file.txt"
echo
# test for path, file.txt
test ! -f "$src1" && echo "Error source folder path or file" || echo "Ok source folder & file $src1 = source file" ;
# copy file.txt to /tmp/file1.txt
cp "$src1" /tmp/file1.txt || exit
# test for path, file1.txt, using RAM scratchpad, because less drive wear.
test ! -f "/tmp/file1.txt" && echo "Error source folder path or file" || echo "Ok /tmp/ folder & file /tmp/file1.txt = using RAM scratchpad, because speed & less drive wear" ;
echo
# align output of a basic count
echo " " |tr '\n' ' ' ; wc /tmp/file1.txt
# basic count plus verbage for /tmp/file1.txt
cat < /tmp/file1.txt |wc |awk '{print "Lines: " 1ドル "\tWords: " 2ドル "\tCharacters: " 3ドル }' |tr -s '\n' ' ' ; echo "/tmp/file1.txt"
# clean up #1, remove unseen characters
sed "s/\r.*\r/ /g" /tmp/file1.txt |tr -cd '11円12円40円-176円' > /tmp/file2.txt
# basic count plus verbage for /tmp/file2.txt
cat < /tmp/file2.txt |wc |awk '{print "Lines: " 1ドル "\tWords: " 2ドル "\tCharacters: " 3ドル }'|tr -s '\n' ' ' ; echo "/tmp/file2.txt"
# clean up #2, squeeze space, convert space to new line, convert to all Lower case.
cat < /tmp/file2.txt |tr -s " " |tr '[:space:]' '\n' |tr '[:upper:]' '[:lower:]' |sort |uniq -c |sort -k1,1nr > /tmp/file3.txt ;
# basic count plus verbage for /tmp/file3.txt, show progress made
cat < /tmp/file3.txt |wc |awk '{print "Lines: " 1ドル "\tWords: " 2ドル "\tCharacters: " 3ドル }'|tr -s '\n' ' ' ; echo "/tmp/file3.txt"
echo ;
# 133 characters used in /tmp/file1.txt
echo "$(od -c /tmp/file1.txt |grep -oP "^\d+ +\K.*" |tr -s ' ' '\n' |LC_ALL=C sort -u |tr -d '\n')" |tr '\n' ' ' ; echo $(od -c /tmp/file1.txt |grep -oP "^\d+ +\K.*" |tr -s ' ' '\n' |LC_ALL=C sort -u |tr -d '\n') |wc -c |tr -s '\n' ' ' ; echo "/tmp/file1.txt" ;
# 79 characters used in /tmp/file2.txt
echo "$(od -c /tmp/file2.txt |grep -oP "^\d+ +\K.*" |tr -s ' ' '\n' |LC_ALL=C sort -u |tr -d '\n')" |tr '\n' ' ' ; echo $(od -c /tmp/file2.txt |grep -oP "^\d+ +\K.*" |tr -s ' ' '\n' |LC_ALL=C sort -u |tr -d '\n') |wc -c |tr -s '\n' ' ' ; echo "/tmp/file2.txt" ;
# 51 characters used in /tmp/file3.txt, shows progress of filters
echo "$(od -c /tmp/file3.txt |grep -oP "^\d+ +\K.*" |tr -s ' ' '\n' |LC_ALL=C sort -u |tr -d '\n')" |tr '\n' ' ' ; echo $(od -c /tmp/file3.txt |grep -oP "^\d+ +\K.*" |tr -s ' ' '\n' |LC_ALL=C sort -u |tr -d '\n') |wc -c |tr -s '\n' ' ' ; echo "/tmp/file3.txt" ;
echo ;
# title
echo "/tmp/file3.txt"
# result #1, List, count word occurrences
# cat -A /tmp/file3.txt # show all during testing
cat /tmp/file3.txt
cat < /tmp/file3.txt |wc -l |tr -s '\n' ' ' ; echo "Lines for count word occurrences in file.txt "
echo ;
# result #2, details about one word person count word occurrences
echo "details about $var1 " ;
grep -i $var1 /tmp/file3.txt ;
grep -c $var1 /tmp/file3.txt |tr -s '\n' ' ' ; echo "*$var1*"
echo
echo "short Navigation Legend Ctrl-Shift-Home Ctrl-Shift-UParrow Ctrl-Shift-F = Find" ;
exit
#
#
#
#
#
#
#
# Copyright September 2022
# count_words.sh version 1a
#
# this script was tested on
# 1. Kubuntu 22.04.1
#
# 2. https://www.shellcheck.net/
# 3 prompts of SC2005 (style): Useless echo?
# unsure how to fix
#
# 3. various Bibles in text format
#
# script posted to code review
# https://codereview.stackexchange.com/
#
# This program is free software:
# you can redistribute it and/or modify
# it under the terms of the GNU General Public License as
# published by the Free Software Foundation,
# either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY;
# without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# See the GNU General Public License for more details.
#
# You should have received a copy of the
# GNU General Public License along with this program.
# If not, see
# http://www.gnu.org/licenses/
# Software Disclaimer
# There are inherent dangers in the use of any software available for download on the Internet, and we caution you to make sure that you completely understand the potential risks before downloading any of the software.
#
# The Software and code samples available on this website are provided "as is" without warranty of any kind, either express or implied. Use at your own risk.
#
# The use of the software and scripts downloaded on this site is done at your own discretion and risk and with agreement that you will be solely responsible for any damage to your computer system or loss of data that results from such activities. You are solely responsible for adequate protection and backup of the data and equipment used in connection with any of the software, and we will not be liable for any damages that you may suffer in connection with using, modifying or distributing any of this software. No advice or information, whether oral or written, obtained by you from us or from this website shall create any warranty for the software.
#
# We make makes no warranty that
#
# the software will meet your requirements
# the software will be uninterrupted, timely, secure or error-free
# the results that may be obtained from the use of the software will be effective, accurate or reliable
# the quality of the software will meet your expectations
# any errors in the software obtained from us will be corrected.
# The software, code sample and their documentation made available on this website:
#
# could include technical or other mistakes, inaccuracies or typographical errors. We may make changes to the software or documentation made available on its web site at any time without prior-notice.
# may be out of date, and we make no commitment to update such materials.
# We assume no responsibility for errors or omissions in the software or documentation available from its web site.
#
# In no event shall we be liable to you or any third parties for any special, punitive, incidental, indirect or consequential damages of any kind, or any damages whatsoever, including, without limitation, those resulting from loss of use, data or profits, and on any theory of liability, arising out of or in connection with the use of this software.
#
#
#
2 Answers 2
Make the purpose clear
Comments and output in the script mentions that it counts word occurrences.
There's already a standard program for that, wc
.
The script does much more than wc
.
It would be useful to explain what it is and why.
The script prints various statistics about file.txt
,
about the occurrences of words that contain "person",
and other things which are not very clear.
It would be good to make it clear why it's doing what it does.
Make the script more generally usable
The script expects a text file at a specific path, and prints statistics about occurrences of "person" (among other things). It would be good to make these parameters of the script, for example:
if [ $# -ne 2 ]; then
echo "usage: 0ドル path/to/textfile keyword" >&2
exit 1
fi
path=1ドル
keyword=2ドル
Try to write one statement per line
It's easiest to read code from top to bottom. When there are multiple statements on a line, the reader is forced to scan to the right, which makes it harder. When the reader is forced to scroll to the right, that's extremely annoying, for example here:
test ! -f "$src1" && echo "Error source folder path or file" || echo "Ok source folder & file $src1 = source file" ;
This would be better as:
test ! -f "$src1" \
&& echo "Error source folder path or file" \
|| echo "Ok source folder & file $src1 = source file"
Another example:
cat < /tmp/file1.txt |wc |awk '{print "Lines: " 1ドル "\tWords: " 2ドル "\tCharacters: " 3ドル }' |tr -s '\n' ' ' ; echo "/tmp/file1.txt"
Better as:
cat /tmp/file1.txt \
| wc \
| awk '{print "Lines: " 1ドル "\tWords: " 2ドル "\tCharacters: " 3ドル }' \
| tr -s '\n' ' '
echo "/tmp/file1.txt"
Especially that last echo
was bad there, with no reason to be at the far right end of the line.
Finally, the worst offender:
echo "$(od -c /tmp/file1.txt |grep -oP "^\d+ +\K.*" |tr -s ' ' '\n' |LC_ALL=C sort -u |tr -d '\n')" |tr '\n' ' ' ; echo $(od -c /tmp/file1.txt |grep -oP "^\d+ +\K.*" |tr -s ' ' '\n' |LC_ALL=C sort -u |tr -d '\n') |wc -c |tr -s '\n' ' ' ; echo "/tmp/file1.txt" ;
It's trivial and much more easy to read to move each statement on its own line:
echo "$(od -c /tmp/file1.txt |grep -oP "^\d+ +\K.*" |tr -s ' ' '\n' |LC_ALL=C sort -u |tr -d '\n')" |tr '\n' ' '
echo $(od -c /tmp/file1.txt |grep -oP "^\d+ +\K.*" |tr -s ' ' '\n' |LC_ALL=C sort -u |tr -d '\n') |wc -c |tr -s '\n' ' '
echo "/tmp/file1.txt"
Don't repeat yourself
This kind of code appears several times in the script:
od -c /path/to/file | grep -oP "^\d+ +\K.*" | tr -s ' ' '\n' | LC_ALL=C sort -u | tr -d '\n'
It would be good to create a functin for it with a descriptive name, perhaps:
sorted_unique_chars() {
od -c "1ドル" | grep -oP "^\d+ +\K.*" | tr -s ' ' '\n' | LC_ALL=C sort -u | tr -d '\n'
}
printf "%s " "$(sorted_unique_chars /tmp/file1.txt)"
printf "%s " "$(sorted_unique_chars /tmp/file1.txt | wc -c)"
echo "/tmp/file1.txt"
It would be good to save the result of the function in a variable, rather than processing the file twice:
chars=$(sorted_unique_chars /tmp/file1.txt)
printf "%s " "$chars"
printf "%s " "$(printf "%s" "$chars")"
echo "/tmp/file1.txt"
The script does this repatedly for 3 files. It would be useful to use a loop here:
for path in /tmp/file[123].txt; do
chars=$(sorted_unique_chars "$path")
printf "%s " "$chars"
printf "%s " "$(printf "%s" "$chars")"
echo "$path"
done
Make error messages more precise
Instead of:
test ! -f "$src1" && echo "Error source folder path or file" || ...
This is much more helpful:
test ! -f "$src1" && echo "File does not exist: $src1" || ...
Use descriptive names
var1
doesn't describe its purpose, which doesn't help readers understand the code.
keyword
would be a better name.
Use mktemp
for temporary files, and clean up on exit
The script creates files under /tmp
.
Instead of assuming it's ok to use /tmp
for such purpose,
it's better to use the mktemp
tool to create a temporary directory at the recommended location.
This also makes it possible to run multiple instances of a script in parallel safely,
without overwriting the same files.
It's also good to delete the temporary files when the script is done working with them,
using trap
:
workdir=$(mktemp -d)
cleanup() {
rm -fr "$workdir"
}
trap cleanup EXIT
And then instead of the hardcoded /tmp/file1.txt
and others,
use "$workdir/file1.txt"
(with the double-quotes).
In addition to janos's fine answer:
Choose your shell carefully
Here, we make no use of Bash facilities that plain POSIX shell doesn't have. Using /bin/sh
will make your script slightly faster and significantly more portable.
Use Shellcheck
The Shellcheck tool provides valuable insights:
279347.sh:114:6: note: Useless echo? Instead of 'echo $(cmd)', just use 'cmd'. [SC2005]
279347.sh:114:121: warning: Quote this to prevent word splitting. [SC2046]
279347.sh:114:121: note: Useless echo? Instead of 'echo $(cmd)', just use 'cmd'. [SC2005]
279347.sh:118:6: note: Useless echo? Instead of 'echo $(cmd)', just use 'cmd'. [SC2005]
279347.sh:118:121: warning: Quote this to prevent word splitting. [SC2046]
279347.sh:118:121: note: Useless echo? Instead of 'echo $(cmd)', just use 'cmd'. [SC2005]
279347.sh:122:6: note: Useless echo? Instead of 'echo $(cmd)', just use 'cmd'. [SC2005]
279347.sh:122:121: warning: Quote this to prevent word splitting. [SC2046]
279347.sh:122:121: note: Useless echo? Instead of 'echo $(cmd)', just use 'cmd'. [SC2005]
Don't use cat
as a simple pass-through
There's no point starting a cat
process just to pipe a single file into another process. Just have the receiving process read directly from the file. For example, instead of
cat < /tmp/file2.txt |wc
we can simply
wc /tmp/file2.txt
Improve the user interface
The first thing the program does is:
clear
echo "Bash script to count word occurrences in file.txt"
echo
There's two annoyances here: clearing the screen is something the user could choose to do, but usually doesn't want (perhaps when comparing results of different runs), and any intelligent user should already know the purpose of the script before running it (and doesn't want that information cluttering the output stream, where it just complicates further processing.
Error messages should go to the error stream, rather than mixed in with output:
echo "Error source folder path or file" >&2
## 🔺🔺🔺
At the end, we have:
echo "short Navigation Legend Ctrl-Shift-Home Ctrl-Shift-UParrow Ctrl-Shift-F = Find" ;
I'm not sure what that even means.
Simplify
Instead of converting echo
's newline to space, prefer printf
. So
echo " " |tr '\n' ' '
can be simply
printf '%6s'
There's a lot of stray ;
at ends of lines that can be removed.
Improve error handling
It makes no sense to continue if $src1
isn't a plain file:
test ! -f "$src1" && echo "Error source folder path or file" || echo "Ok source folder & file $src1 = source file" ;
That would be better as
if test ! -f "$src1"
then
echo "Error: $src1 not found" >&2
exit 1
fi
Consider adding set -e
(and probably also -u
) to the beginning of your script, so that it doesn't continue after a failed command.
Use functions to reduce repetition
There's a lot of repeated lines that differ just in the filename processed. Use functions to reduce that repetition and ensure consistency. As it stands, it's very easy for the repeated lines to drift out of sync and accumulate unintentional differences.
False optimisation
This is unnecessary:
cp "$src1" /tmp/file1.txt || exit # ... using RAM scratchpad, because less drive wear.
Any reasonable operating system will use buffer cache to avoid re-reading directly from disk. Copying (potentially to another filesystem) actively harms that. And there's no guarantee that /tmp
is a RAM filesystem.
Take care in licencing
This comment looks wrong:
# We make makes no warranty that
It looks like you haven't particularly taken care here, which is worrying to any user.