Generating a Benchmarking Report for a Program

Question 1

Objective:

The script should pass each file present in the data directory to the read_file executable, and print its result and some statistics about the file in a nicely formatted manner. It should flush the caches before each run of the program.

Code:

#!/usr/bin/bash
for file in data/*; do
 a=($(wc "$file"))
 lines="${a[0]}"
 words="${a[1]}"
 size=$(du -h "$file" | cut -f1)
 printf "%s: L:%s W:%s %s\n" "$(basename "$file")" "$lines" "$words" "$size"
 free 1>/dev/null && sync && echo 3 >| /proc/sys/vm/drop_caches && free 1>/dev/null
 printf " %s getline\n" "$(./read_file --getline "$file")"
 free 1>/dev/null && sync && echo 3 >| /proc/sys/vm/drop_caches && free 1>/dev/null
 printf " %s mmap\n\n" "$(./read_file --mmap "$file")"
done

which prints the desired output:

F_01: L:65748 W:65000 112M
 1.678400s getline
 0.920728s mmap
F_02: L:927277 W:917225 1.1G
 11.716405s getline
 10.950498s mmap
F_03: L:994391 W:983497 1.7G
 24.733193s getline
 21.603982s mmap
...

Review Request:

Anything, everything.

I ran shellcheck on the script, it suggested using mapfile or read -a to split command output for a=($(wc "$file")).

Question 2

idiom

free 1>/dev/null

We write source code to convey technical ideas to collaborators. I confess that I initially failed this riddle. That is, I attempted $ free 1 to see what it would say, and then consulted the man page which confirmed the utility accepts no such argument. OTOH, I would have immediately parsed 2>/dev/null as discarding stderr.

Humans typically don't write about stdout in that way. Prefer to phrase it

free > /dev/null

platform

The "/proc" FS soon makes it apparent that we're targetting just unix-like hosts, and by the time we see "drop_caches" we have ruled out the BSD variants such as Solaris. Nonetheless, it would be polite for the ReadMe or for the Review Context to mention that we're targeting IDK CentOS, or perhaps Debian. Or have the code verify that some aspect of uname -a comes back as expected.

Similarly, you might introduce what the ./read_file benchmark tool is all about. OIC, that topic is separately addressed in this question.

friendly size

I can't imagine there's any real saving, but consider offering explicit wc -lw flags, since we're ignoring the size in -c characters.

Parsing unicode codepoints from utf8 has real overhead. In the files of interest for your use case, possibly it won't make a difference to the word count, or won't make a difference that you care about. I haven't benched it, but consider explicitly setting C locale. I know for certain that setting C locale can produce impressive speedups in time spent by /usr/bin/sort on comparisons, more than 2x.

The identifier file is suggestive, but of course one might find other things in a filesystem. Consider testing whether we have an ordinary -f file. If we asked for du -s summary, then at least we'd be guaranteed to always produce exactly one line.

bail on error

We anticipate zero errors. Consider using set -e to make errors fatal, Just In Case. Or even set -e -o pipeline.

And on this topic: the Review Context didn't mention anything about needing UID 0 root access. Code which does not set -e should probably be testing the value of id -u, so it can offer some helpful advice about sudo if need be. This script does more than one thing, and doesn't offer a convenient way for unit tests to exercise an unprivileged subset of those things while doing a test run as an ordinary user. Consider making "drop buffers" optional, so it only happens during production runs.

extract helper

I get why we sync out dirty buffers, so there's more cached buffers we can drop. But I found the free call puzzling. Apparently we're evaluating for side effects?!? I thought there are none. Or maybe it occasionally displays some diagnostic of interest to stderr? Please add a # comment.

I don't understand why we drop slabs + pagecache with >| instead of >. I believe "drop_caches" is a "regular file". I'm reading this in the bash man page:

If the redirection operator is >|, or the redirection operator is > and the noclobber option to the set builtin command is not enabled, the redirection is attempted even if the file named by word exists.

Am I to infer there is some non-standard aspect of your environment which this works around? Perhaps you set "noclobber" elsewhere? Consider explicitly setting it or unsetting in this script. Please do add a # comment so future maintenance engineers will understand this detail.

We drop caches more than once. Please define a function which we call more than once.

glob expansion

If data/* corresponds to a great many files, you may find it convenient to push most of this script into a helper script or function, driven by find data/ -type f | xargs.

output format

The OP code produces your desired output, good. Consider combining those three output lines into a single line, for the convenience of awk or CSV spreadsheet programs. Consider running such tabular output through column -t to align columns, for the convenience of folks eyeballing the output.

Question 3

"Perhaps you set "noclobber" elsewhere?" ==> Yes, noclobber is set elsewhere. I'd comment why it is needed.

Question 4

"Consider running such tabular output through column -t to align columns, for the convenience of folks eyeballing the output." ==> Could you provide an example how?

Question 5

Here is some reformatting of simple tabular data: (echo day_of_week month day hms; date; date) | column -t. I only mentioned it because I noticed your example output contained files which variously took 2 or 3 digits to express their length, something that becomes slightly troublesome when eyeballing lots of output and the columns keep wandering back and forth. Another approach some folks use is to insert TABs into each line, though that's less robust if a column wanders above and below a width of eight characters. It's also possible for generating program to insert right number of SPACEs.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2024-05-19 16:41:41Z

idiom

free 1>/dev/null

We write source code to convey technical ideas to collaborators. I confess that I initially failed this riddle. That is, I attempted $ free 1 to see what it would say, and then consulted the man page which confirmed the utility accepts no such argument. OTOH, I would have immediately parsed 2>/dev/null as discarding stderr.

Humans typically don't write about stdout in that way. Prefer to phrase it

free > /dev/null

platform

The "/proc" FS soon makes it apparent that we're targetting just unix-like hosts, and by the time we see "drop_caches" we have ruled out the BSD variants such as Solaris. Nonetheless, it would be polite for the ReadMe or for the Review Context to mention that we're targeting IDK CentOS, or perhaps Debian. Or have the code verify that some aspect of uname -a comes back as expected.

Similarly, you might introduce what the ./read_file benchmark tool is all about. OIC, that topic is separately addressed in this question.

friendly size

I can't imagine there's any real saving, but consider offering explicit wc -lw flags, since we're ignoring the size in -c characters.

Parsing unicode codepoints from utf8 has real overhead. In the files of interest for your use case, possibly it won't make a difference to the word count, or won't make a difference that you care about. I haven't benched it, but consider explicitly setting C locale. I know for certain that setting C locale can produce impressive speedups in time spent by /usr/bin/sort on comparisons, more than 2x.

The identifier file is suggestive, but of course one might find other things in a filesystem. Consider testing whether we have an ordinary -f file. If we asked for du -s summary, then at least we'd be guaranteed to always produce exactly one line.

bail on error

We anticipate zero errors. Consider using set -e to make errors fatal, Just In Case. Or even set -e -o pipeline.

And on this topic: the Review Context didn't mention anything about needing UID 0 root access. Code which does not set -e should probably be testing the value of id -u, so it can offer some helpful advice about sudo if need be. This script does more than one thing, and doesn't offer a convenient way for unit tests to exercise an unprivileged subset of those things while doing a test run as an ordinary user. Consider making "drop buffers" optional, so it only happens during production runs.

extract helper

I get why we sync out dirty buffers, so there's more cached buffers we can drop. But I found the free call puzzling. Apparently we're evaluating for side effects?!? I thought there are none. Or maybe it occasionally displays some diagnostic of interest to stderr? Please add a # comment.

I don't understand why we drop slabs + pagecache with >| instead of >. I believe "drop_caches" is a "regular file". I'm reading this in the bash man page:

If the redirection operator is >|, or the redirection operator is > and the noclobber option to the set builtin command is not enabled, the redirection is attempted even if the file named by word exists.

Am I to infer there is some non-standard aspect of your environment which this works around? Perhaps you set "noclobber" elsewhere? Consider explicitly setting it or unsetting in this script. Please do add a # comment so future maintenance engineers will understand this detail.

We drop caches more than once. Please define a function which we call more than once.

glob expansion

If data/* corresponds to a great many files, you may find it convenient to push most of this script into a helper script or function, driven by find data/ -type f | xargs.

output format

The OP code produces your desired output, good. Consider combining those three output lines into a single line, for the convenience of awk or CSV spreadsheet programs. Consider running such tabular output through column -t to align columns, for the convenience of folks eyeballing the output.

"Perhaps you set "noclobber" elsewhere?" ==> Yes, noclobber is set elsewhere. I'd comment why it is needed.
"Consider running such tabular output through column -t to align columns, for the convenience of folks eyeballing the output." ==> Could you provide an example how?
Here is some reformatting of simple tabular data: (echo day_of_week month day hms; date; date) | column -t. I only mentioned it because I noticed your example output contained files which variously took 2 or 3 digits to express their length, something that becomes slightly troublesome when eyeballing lots of output and the columns keep wandering back and forth. Another approach some folks use is to insert TABs into each line, though that's less robust if a column wanders above and below a width of eight characters. It's also possible for generating program to insert right number of SPACEs.

Stack Exchange Network

Generating a Benchmarking Report for a Program

Objective:

Code:

Review Request:

1 Answer 1

idiom

platform

friendly size

bail on error

extract helper

glob expansion

output format

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Generating a Benchmarking Report for a Program

Objective:

Code:

Review Request:

1 Answer 1

idiom

platform

friendly size

bail on error

extract helper

glob expansion

output format

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions