Logging algorithm optimisation

Question 1

I made a simple algorithm that makes my logs more suitable to be plotted. A logfile has the following structure:

timestamp bytes

and looks like this:

1485762953167032 517
1485762953167657 517
1485762953188416 517
1485762953188504 517
1485762953195641 151
1485762953196256 151
1485762953198736 216
1485762953200099 216
1485762953201115 1261
1485762953201658 151
1485762953201840 151
1485762953202040 1261
1485762953203387 216
1485762953204183 216
1485762953206935 549
1485762953207548 546
1485762953259335 306
1485762953260025 1448
1485762953260576 1448
1485762953261087 1448
1485762953261790 1500
1485762953263878 1500
1485762953264273 1448
1485762953264914 1500

I get my timestamp with gettimeofday(&t, NULL) and achieve a precision long long timestamp = (t.tv_sec * 1000000) + t.tv_usec.

When I said before "makes my logs more suitable" I meant "timestamp should be set to seconds precision, not microseconds", so both columns will be modified.

My naive algorithm is something like this (it is in Bash and since maybe not everybody likes it, you got its pseudocode): ready every line of log file, transform timestamp from microseconds to seconds dividing one million, if it is the same transformed timestamp of the one in the previous line, update the count of bytes and packets (one packet per row).

for every file.log {
 TIMESTAMP=0
 BYTES=0
 PACKETS=0
 echo "Creating file $LOGFILE.plt..."
 create file $LOGFILE.plt
 echo "Preparing $LOGFILE for plotting..."
 while read file.log LINE {
 # LINE is a array: LINE[0] contains timestamp, LINE[1] contains bytes
 if PACKETS == 0 {
 # first row
 TIMESTAMP = LINE[0] / 1000000
 BYTES = BYTES + LINE[1]
 PACKETS += 1
 echo TIMESTAMP " " BYTES " " PACKETS
 }
 TEMP_TIMESTAMP = LINE[0] / 1000000
 if TIMESTAMP == TEMP_TIMESTAMP {
 BYTES = BYTES + LINE[1]
 PACKETS += 1
 }
 else {
 if PACKETS != 0 {
 log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt
 }
 TIMESTAMP = TEMP_TIMESTAMP
 BYTES=0
 PACKETS=1
 }
 }
}

Since it reads every line of logfile it is at least O(n) and for long logfile it may takes a while: is there a way I can shorten this time?

I can't make this optimisation in my code: it is a program built on realtime performance (i.e. streaming and other responsive services) so it has to make the least effort possible; it already logs, it can't calculate optimisation stuff on logs too.

Question 2

As long as you have to read every line of input, this I/O will probably take so much time that micro-optimizing the computation behind it buys you nothing. Learn the memory hierarchy: processor registers and cache are fast, RAM is slower, network and disks are unspeakably slow in comparison.

Question 3

The question is if you always need the all the information from the file. You might be able to keep multiple copies of your data, i.e. Full log file, four copies each keeping 6 hours of a 24 hour intervall, eight copies with a three hour interval, etc. If disk space is not an issue keeping these redundant copies might help.

Question 4

The simplest optimizations are to use binary number storage instead of text (eliminates the conversions), and to use a faster language than Bash.

Question 5

awk seems purpose-built for tasks like that.

cat input_file | awk '{print int(1ドル/1000000 + 0.5), 2ドル}' > temp_file

Now temp_file contains per-second packet amounts, with many duplicate seconds values.

To coalesce the values, you can pipe them through this invocation:

awk 'BEGIN {tstamp = 0; bytes = 0}; 
 {if (1ドル == tstamp) {bytes += 2ドル} 
 else {if (tstamp != 0) {print tstamp, bytes}; 
 tstamp=1ドル; bytes = 2ドル}}'

I'd put these two awk scripts in files and use something like

cat input_file | awk -f divide.awk | awk -f coalesce.awk > output_file

Question 6

Don't toatally follow your pseudocode but

On the else I don't see how packets could ever be zero and setting BYTES=0 is wrong

You fail to write out the last

read file.log LINE 
TIMESTAMP = LINE[0] / 1000000
BYTES = LINE[1]
PACKETS = 1
echo TIMESTAMP " " BYTES " " PACKETS
while read file.log LINE {
 # LINE is a array: LINE[0] contains timestamp, LINE[1] contains bytes
 TEMP_TIMESTAMP = LINE[0] / 1000000
 if TIMESTAMP == TEMP_TIMESTAMP {
 BYTES += LINE[1]
 PACKETS += 1
 }
 else {
 log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt
 TIMESTAMP = TEMP_TIMESTAMP
 BYTES = LINE[1]
 PACKETS = 1
 }
}
log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt

9000 9000 24.3k4 gold badges53 silver badges80 bronze badges · Answer 1 · 2017-03-01 16:32:03Z

awk seems purpose-built for tasks like that.

cat input_file | awk '{print int(1ドル/1000000 + 0.5), 2ドル}' > temp_file

Now temp_file contains per-second packet amounts, with many duplicate seconds values.

To coalesce the values, you can pipe them through this invocation:

awk 'BEGIN {tstamp = 0; bytes = 0}; 
 {if (1ドル == tstamp) {bytes += 2ドル} 
 else {if (tstamp != 0) {print tstamp, bytes}; 
 tstamp=1ドル; bytes = 2ドル}}'

I'd put these two awk scripts in files and use something like

cat input_file | awk -f divide.awk | awk -f coalesce.awk > output_file

paparazzo paparazzo 1,9271 gold badge14 silver badges23 bronze badges · Answer 2 · 2017-01-30 15:41:53Z

Don't toatally follow your pseudocode but

On the else I don't see how packets could ever be zero and setting BYTES=0 is wrong

You fail to write out the last

read file.log LINE 
TIMESTAMP = LINE[0] / 1000000
BYTES = LINE[1]
PACKETS = 1
echo TIMESTAMP " " BYTES " " PACKETS
while read file.log LINE {
 # LINE is a array: LINE[0] contains timestamp, LINE[1] contains bytes
 TEMP_TIMESTAMP = LINE[0] / 1000000
 if TIMESTAMP == TEMP_TIMESTAMP {
 BYTES += LINE[1]
 PACKETS += 1
 }
 else {
 log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt
 TIMESTAMP = TEMP_TIMESTAMP
 BYTES = LINE[1]
 PACKETS = 1
 }
}
log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt

Stack Exchange Network

Logging algorithm optimisation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Logging algorithm optimisation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions