I made a simple algorithm that makes my logs more suitable to be plotted. A logfile has the following structure:
timestamp bytes
and looks like this:
1485762953167032 517
1485762953167657 517
1485762953188416 517
1485762953188504 517
1485762953195641 151
1485762953196256 151
1485762953198736 216
1485762953200099 216
1485762953201115 1261
1485762953201658 151
1485762953201840 151
1485762953202040 1261
1485762953203387 216
1485762953204183 216
1485762953206935 549
1485762953207548 546
1485762953259335 306
1485762953260025 1448
1485762953260576 1448
1485762953261087 1448
1485762953261790 1500
1485762953263878 1500
1485762953264273 1448
1485762953264914 1500
I get my timestamp with gettimeofday(&t, NULL)
and achieve a precision long long timestamp = (t.tv_sec * 1000000) + t.tv_usec
.
When I said before "makes my logs more suitable" I meant "timestamp should be set to seconds precision, not microseconds", so both columns will be modified.
My naive algorithm is something like this (it is in Bash and since maybe not everybody likes it, you got its pseudocode): ready every line of log file, transform timestamp from microseconds to seconds dividing one million, if it is the same transformed timestamp of the one in the previous line, update the count of bytes and packets (one packet per row).
for every file.log {
TIMESTAMP=0
BYTES=0
PACKETS=0
echo "Creating file $LOGFILE.plt..."
create file $LOGFILE.plt
echo "Preparing $LOGFILE for plotting..."
while read file.log LINE {
# LINE is a array: LINE[0] contains timestamp, LINE[1] contains bytes
if PACKETS == 0 {
# first row
TIMESTAMP = LINE[0] / 1000000
BYTES = BYTES + LINE[1]
PACKETS += 1
echo TIMESTAMP " " BYTES " " PACKETS
}
TEMP_TIMESTAMP = LINE[0] / 1000000
if TIMESTAMP == TEMP_TIMESTAMP {
BYTES = BYTES + LINE[1]
PACKETS += 1
}
else {
if PACKETS != 0 {
log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt
}
TIMESTAMP = TEMP_TIMESTAMP
BYTES=0
PACKETS=1
}
}
}
Since it reads every line of logfile it is at least O(n) and for long logfile it may takes a while: is there a way I can shorten this time?
I can't make this optimisation in my code: it is a program built on realtime performance (i.e. streaming and other responsive services) so it has to make the least effort possible; it already logs, it can't calculate optimisation stuff on logs too.
-
5As long as you have to read every line of input, this I/O will probably take so much time that micro-optimizing the computation behind it buys you nothing. Learn the memory hierarchy: processor registers and cache are fast, RAM is slower, network and disks are unspeakably slow in comparison.Kilian Foth– Kilian Foth2017年01月30日 09:12:39 +00:00Commented Jan 30, 2017 at 9:12
-
The question is if you always need the all the information from the file. You might be able to keep multiple copies of your data, i.e. Full log file, four copies each keeping 6 hours of a 24 hour intervall, eight copies with a three hour interval, etc. If disk space is not an issue keeping these redundant copies might help.Markus– Markus2017年01月30日 15:27:07 +00:00Commented Jan 30, 2017 at 15:27
-
The simplest optimizations are to use binary number storage instead of text (eliminates the conversions), and to use a faster language than Bash.Frank Hileman– Frank Hileman2017年01月31日 23:36:19 +00:00Commented Jan 31, 2017 at 23:36
2 Answers 2
awk
seems purpose-built for tasks like that.
cat input_file | awk '{print int(1ドル/1000000 + 0.5), 2ドル}' > temp_file
Now temp_file
contains per-second packet amounts, with many duplicate seconds values.
To coalesce the values, you can pipe them through this invocation:
awk 'BEGIN {tstamp = 0; bytes = 0};
{if (1ドル == tstamp) {bytes += 2ドル}
else {if (tstamp != 0) {print tstamp, bytes};
tstamp=1ドル; bytes = 2ドル}}'
I'd put these two awk scripts in files and use something like
cat input_file | awk -f divide.awk | awk -f coalesce.awk > output_file
Don't toatally follow your pseudocode but
On the else I don't see how packets could ever be zero and setting BYTES=0 is wrong
You fail to write out the last
read file.log LINE
TIMESTAMP = LINE[0] / 1000000
BYTES = LINE[1]
PACKETS = 1
echo TIMESTAMP " " BYTES " " PACKETS
while read file.log LINE {
# LINE is a array: LINE[0] contains timestamp, LINE[1] contains bytes
TEMP_TIMESTAMP = LINE[0] / 1000000
if TIMESTAMP == TEMP_TIMESTAMP {
BYTES += LINE[1]
PACKETS += 1
}
else {
log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt
TIMESTAMP = TEMP_TIMESTAMP
BYTES = LINE[1]
PACKETS = 1
}
}
log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt