Objective:
The script should pass each file present in the data
directory to the read_file
executable, and print its result and some statistics about the file in a nicely formatted manner. It should flush the caches before each run of the program.
Code:
#!/usr/bin/bash
for file in data/*; do
a=($(wc "$file"))
lines="${a[0]}"
words="${a[1]}"
size=$(du -h "$file" | cut -f1)
printf "%s: L:%s W:%s %s\n" "$(basename "$file")" "$lines" "$words" "$size"
free 1>/dev/null && sync && echo 3 >| /proc/sys/vm/drop_caches && free 1>/dev/null
printf " %s getline\n" "$(./read_file --getline "$file")"
free 1>/dev/null && sync && echo 3 >| /proc/sys/vm/drop_caches && free 1>/dev/null
printf " %s mmap\n\n" "$(./read_file --mmap "$file")"
done
which prints the desired output:
F_01: L:65748 W:65000 112M
1.678400s getline
0.920728s mmap
F_02: L:927277 W:917225 1.1G
11.716405s getline
10.950498s mmap
F_03: L:994391 W:983497 1.7G
24.733193s getline
21.603982s mmap
...
Review Request:
Anything, everything.
I ran shellcheck on the script, it suggested using mapfile or read -a to split command output for a=($(wc "$file"))
.
1 Answer 1
idiom
free 1>/dev/null
We write source code to convey technical ideas to collaborators.
I confess that I initially failed this riddle.
That is, I attempted $ free 1
to see what
it would say, and then consulted the
man page
which confirmed the utility accepts no such argument.
OTOH, I would have immediately parsed 2>/dev/null
as discarding stderr.
Humans typically don't write about stdout in that way. Prefer to phrase it
free > /dev/null
platform
The "/proc" FS soon makes it apparent that we're targetting just
unix-like hosts, and by the time we see "drop_caches" we have
ruled out the BSD variants such as Solaris.
Nonetheless, it would be polite for the ReadMe or for the Review Context
to mention that we're targeting IDK CentOS, or perhaps Debian.
Or have the code verify that some aspect of uname -a
comes back as expected.
Similarly, you might introduce what the ./read_file
benchmark tool
is all about.
OIC, that topic is separately addressed in
this
question.
friendly size
I can't imagine there's any real saving,
but consider offering explicit wc -lw
flags,
since we're ignoring the size in -c
characters.
Parsing unicode codepoints from utf8 has real overhead. In the files of interest for your use case, possibly it won't make a difference to the word count, or won't make a difference that you care about. I haven't benched it, but consider explicitly setting C locale. I know for certain that setting C locale can produce impressive speedups in time spent by /usr/bin/sort on comparisons, more than 2x.
The identifier file
is suggestive, but of course
one might find other things in a filesystem.
Consider testing whether we have an ordinary -f
file.
If we asked for du -s
summary, then at least
we'd be guaranteed to always produce exactly one line.
bail on error
We anticipate zero errors.
Consider using set -e
to make errors fatal, Just In Case.
Or even set -e -o pipeline
.
And on this topic: the Review Context didn't mention anything
about needing UID 0 root access.
Code which does not set -e
should probably
be testing the value of id -u
,
so it can offer some helpful advice about sudo
if need be.
This script does more than one thing,
and doesn't offer a convenient way for unit tests
to exercise an unprivileged subset of those things
while doing a test run as an ordinary user.
Consider making "drop buffers" optional,
so it only happens during production runs.
extract helper
I get why we sync
out dirty buffers,
so there's more cached buffers we can drop.
But I found the free
call puzzling.
Apparently we're evaluating for side effects?!?
I thought there are none.
Or maybe it occasionally displays some diagnostic
of interest to stderr?
Please add a #
comment.
I don't understand why we drop slabs + pagecache
with >|
instead of >
.
I believe "drop_caches" is a "regular file".
I'm reading this in the
bash man page:
If the redirection operator is >|, or the redirection operator is > and the noclobber option to the set builtin command is not enabled, the redirection is attempted even if the file named by word exists.
Am I to infer there is some non-standard aspect of your environment
which this works around?
Perhaps you set "noclobber" elsewhere?
Consider explicitly setting it or unsetting in this script.
Please do add a #
comment so future maintenance engineers
will understand this detail.
We drop caches more than once. Please define a function which we call more than once.
glob expansion
If data/*
corresponds to a great many files,
you may find it convenient to push most of
this script into a helper script or function,
driven by find data/ -type f | xargs
.
output format
The OP code produces your desired output, good.
Consider combining those three output lines into a single line,
for the convenience of awk
or CSV spreadsheet programs.
Consider running such tabular output through column -t
to align
columns, for the convenience of folks eyeballing the output.
-
1\$\begingroup\$ "Perhaps you set "noclobber" elsewhere?" ==> Yes, noclobber is set elsewhere. I'd comment why it is needed. \$\endgroup\$Madagascar– Madagascar2024年05月19日 17:22:25 +00:00Commented May 19, 2024 at 17:22
-
\$\begingroup\$ "Consider running such tabular output through column -t to align columns, for the convenience of folks eyeballing the output." ==> Could you provide an example how? \$\endgroup\$Madagascar– Madagascar2024年05月19日 18:02:49 +00:00Commented May 19, 2024 at 18:02
-
1\$\begingroup\$ Here is some reformatting of simple tabular data:
(echo day_of_week month day hms; date; date) | column -t
. I only mentioned it because I noticed your example output contained files which variously took 2 or 3 digits to express their length, something that becomes slightly troublesome when eyeballing lots of output and the columns keep wandering back and forth. Another approach some folks use is to insert TABs into each line, though that's less robust if a column wanders above and below a width of eight characters. It's also possible for generating program to insert right number of SPACEs. \$\endgroup\$J_H– J_H2024年05月19日 18:27:17 +00:00Commented May 19, 2024 at 18:27