The following bash code is meant to check every second the amount of NEW (relative to the last second) network socket files. at the end of the run it summarizes every 60 entries (should be 60 seconds) and output a file called verdict.csv that tells me how many new network sockets were opened in that minute (I run under the assumption that those sockets live for more than 1 second and hence I don't miss new ones).
The problem starts when I run it on a busy server where I have a lot of new network sockets being opened, then I start seeing that the lsof_func iterations takes much more then 1 second (even more than a minute some times) and than I cannot trust the output of this script.
#!/bin/bash
TIMETORUN=84600 # Time for the script to run in seconds
NEWCONNECTIONSPERMINUTE=600
# collect number of new socket files in the last second
lsof_func () {
echo "" > /tmp/lsof_test
while [[ $TIME -lt $TIMETORUN ]]; do
lsof -i -t > /tmp/lsof_test2
echo "$(date +"%Y-%m-%d %H:%M:%S"),$(comm -23 <(cat /tmp/lsof_test2|sort) <(cat /tmp/lsof_test|sort) | wc -l)" >> /tmp/results.csv # comm command is used as a set subtractor operator (lsof_test minus lsof_test2)
mv /tmp/lsof_test2 /tmp/lsof_test
TIME=$((TIME+1))
sleep 0.9
done
}
# Calculate the number of new connections per minute
verdict () {
cat /tmp/results.csv | uniq > /tmp/results_for_verdict.csv
echo "Timestamp,New Procs" > /tmp/verdict.csv
while [[ $(cat /tmp/results_for_verdict.csv | wc -l) -gt 60 ]]; do
echo -n $(cat /tmp/results_for_verdict.csv | head -n 1 | awk -F, '{print 1ドル}'), >> /tmp/verdict.csv
cat /tmp/results_for_verdict.csv | head -n 60 | awk -F, '{s+=2ドル} END {print s}' >> /tmp/verdict.csv
sed -n '61,$p' < /tmp/results_for_verdict.csv > /tmp/tmp_results_for_verdict.csv
mv /tmp/tmp_results_for_verdict.csv /tmp/results_for_verdict.csv
done
echo -n $(cat /tmp/results_for_verdict.csv | head -n 1 | awk -F, '{print 1ドル}'), >> /tmp/verdict.csv
cat /tmp/results_for_verdict.csv | head -n 60 | awk -F, '{s+=2ドル} END {print s}' >> /tmp/verdict.csv
}
lsof_func
verdict
#cleanup
rm /tmp/lsof_test
#rm /tmp/lsof_test2
rm /tmp/results.csv
rm /tmp/results_for_verdict.csv
How can I make the iterations of lsof_func function be more consistent / run faster and collect this data every second?
2 Answers 2
We have a simple bug - using lsof -t
causes it to print one line per process rather than one line per socket. If we want to observe changes to the open sockets as claimed in the question, then we'll want something like lsof -i -b -n -F 'n' | grep '^n'
.
Instead of using lsof
, it may be more efficient to use netstat
; on my lightly-loaded system it's about 10-20 times as fast, but you should benchmark the two on your target system.
So instead of comparing subsequent runs of lsof -i -t | sort
, we could compare runs of
netstat -tn | awk '{print 4,ドル5ドル}' | sort
Some things to note here:
netstat -t
examines TCP connections over IPv4 and IPv6. I believe that's what's wanted.netstat -n
, likelsof -n
, saves a vast amount of time by not doing DNS reverse lookup.awk
is more suitable thancut
for selecting columns, sincenetstat
uses a variable number of spaces to separate fields.- Netstat includes a couple of header lines, but because these are the same in every invocation, they will disappear in the comparison. We could remove them if we really want:
awk 'FNR>2 {print 4,ドル5ドル}'
.
Temporary files
Don't assume that /tmp/
is the right location for temporary files - if $TMPDIR
is set, we should prefer that directory instead (e.g. which pam_tmpdir
is used for per-user temp directories).
It's a good idea to create a single directory for all the script's temporary files, then we can arrange for it to be cleaned up however the script exits:
export TMPDIR=$(mktemp -d)
trap 'rm -rf "$TMPDIR"' EXIT
By using the well-known TMPDIR
environment variable for this, we also clean up any of our subprocesses' temporary files if they get left lying around.
We can also simplify code by changing into the temporary directory (but always fail if cd
is unsuccessful, either explicitly or by using set -e
).
Variables
Prefer lower-case for non-exported shell variables, as they share a namespace with environment variables, and upper-case is conventionally used for communicating between programs.
Extending the comment would expose a bug:
timetorun=84600 # Time for the script to run (1 day)
We could then see that actually 86,400 seconds was intended.
NEWCONNECTIONSPERMINUTE
certainly needs a comment, as it appears to be completely unused.
Unnecessary cat
There's no need to concatenate a single file like this:
cat results.csv | uniq > results_for_verdict.csv
Simply redirect standard input using <
:
<results.csv uniq >results_for_verdict.csv
or pass the filenames as argument to uniq
:
uniq results.csv results_for_verdict.csv
An even more egregious case is the use of cat
in process substitution here:
comm -23 <(cat lsof_test2|sort) <(cat lsof_test|sort)
The obvious transformation is to:
comm -23 <(sort lsof_test2) <(sort lsof_test)
But if we arrange for the files to be sorted (lsof -i -t | sort >lsof_test2
), then we can just pass the file names directly, and we only sort each file once rather than twice:
comm -23 lsof_test2 lsof_test
Timing
Counting loop iterations is a very approximate method of timing. A better way is to calculate the time we should finish, and loop until that time is reached. We can use the Bash magic variable SECONDS
to determine how long the script has been running, so our test becomes:
local -i endtime=$SECONDS+$timetorun
while [ $SECONDS -lt $endtime ]
Multiple opens of output files
Instead of opening for append each time around the loop, we can redirect the entire loop's output:
while ...
⋮
done >results.csv
Or let lsof_func
just write to its standard output, and pipe that into verdict
(which can also write to its standard output):
lsof_func | verdict >verdict.csv
Splitting by minute
Instead of a shell loop to count lines, we could use the standard split
utility to break our input into 60-line files. That simplifies our code a great deal:
verdict() {
echo "Timestamp,New Procs"
# Create 1 file per 60 seconds
uniq | split -l 60 - results_
for file in results_*
do
printf '%s' "$(head -n 1 "$file" | awk -F, '{print 1ドル}'),"
awk -F, '{s+=2ドル} END {print s}' "$file"
done
}
I don't think we need two separate awk
programs here, as a single one can capture the initial timestamp and accumulate the values:
for file in results_*
do
awk -F, 'FNR==1{ts=1ドル} {s+=2ドル} END{OFS="," ; print ts,s}' "$file"
done
The full program would then be
#!/bin/bash
set -eu
timetorun=86400 # Gather statistics for 1 day
export TMPDIR=$(mktemp -d)
trap 'rm -rf "$TMPDIR"' EXIT
cd "$TMPDIR" || exit $?
# collect number of new socket files in the last second
lsof_func() {
true > lsof_test
local -i endtime=$SECONDS+$timetorun
while [ $SECONDS -lt $endtime ]
do
lsof -i -t | sort >lsof_test2
date +"%F %T,$(comm -23 lsof_test2 lsof_test | wc -l)"
mv lsof_test2 lsof_test
sleep 0.9
done
}
# Calculate the number of new connections per minute
verdict() {
echo "Timestamp,New Procs"
# Create 1 file per 60 seconds
uniq | split -l 60 - results_
for file in results_*
do
awk -F, 'FNR==1{ts=1ドル} {s+=2ドル} END{OFS="," ; print ts,s}' "$file"
done
}
lsof_func | verdict
Alternative approach
Instead of writing to files and post-processing them with awk
, we could simply hold all our data in memory, in a Bash array. We can add as we go, so that we're only storing one value per minute.
Easiest of all is if we care only about calendar minutes, and don't mind that we have a partial minute at start and end:
# collect number of new socket files in each second, and add to per-minute counter
declare -i endtime=$SECONDS+$timetorun
declare -A -i minute
while [ "$SECONDS" -lt "$endtime" ]
do
lsof -i -t | sort >lsof_test2
minute["$(date +'%F %R')"]+=$(comm -23 lsof_test2 lsof_test | wc -l)
mv lsof_test2 lsof_test
sleep 0.9
done
# Output the number of new connections per minute
echo "Timestamp,New Procs"
for t in "${!minute[@]}"
do
printf '%s,%u\n' "$t" "${minute[$t]}"
done
If we want to keep the current behaviour, we'll need to store into arbitrary 60-second chunks, and store the times separately:
declare -i endtime=$SECONDS+$timetorun
declare -A -i minute
declare -A date
while [ $SECONDS -lt $endtime ]
do
lsof -i -t | sort >lsof_test2
declare -i m=$SECONDS/60
minute[$m]+=$(comm -23 lsof_test2 lsof_test | wc -l)
date[$m]=$(date -d -1min '+%F %T')
mv lsof_test2 lsof_test
sleep 0.9
done
# Output the number of new connections per minute
echo "Timestamp,New Procs"
for m in "${!minute[@]}"
do
printf '%s,%u\n' "${date[$m]}" "${minute[$m]}"
done | sort
Modified code
The full, rewritten version (with no Shellcheck warnings):
#!/bin/bash
set -eu -o pipefail
declare -i endtime=86400 # Stop after 1 day
TMPDIR=$(mktemp -d); export TMPDIR
trap 'rm -rf "$TMPDIR"' EXIT
cd "$TMPDIR"
list_sockets() {
lsof -i -t | sort
}
# Initial open sockets
list_sockets >sockets
# Update per-minute counters with new sockets
declare -A -i minute
declare -A date
while [ $SECONDS -lt $endtime ]
do
mv sockets sockets.old
list_sockets >sockets
declare -i m=$SECONDS/60
minute[$m]+=$(comm -23 sockets sockets.old | wc -l)
date[$m]=$(date -d -1min '+%F %T')
sleep 0.5
done
# Output the number of new connections per minute
echo "Timestamp,New Procs"
for m in "${!minute[@]}"
do printf '%s,%u\n' "${date[$m]}" "${minute[$m]}"
done | sort
-
\$\begingroup\$ This is great, thank you very much for all the output. I didn't try everything but I did use variables instead if files and that alone made performance much better, I will try and implement other suggestions you had thanks! \$\endgroup\$Noam Salit– Noam Salit2021年09月05日 11:36:28 +00:00Commented Sep 5, 2021 at 11:36
ss -s
for a start. This could be enough for your purpose, you only need counters and not the full details even if the information is aggregated after collection. \$\endgroup\$84600
a typo for86400
? If not, what's significant about 23½ hours that makes it a good sampling period? \$\endgroup\$