0

I made a simple algorithm that makes my logs more suitable to be plotted. A logfile has the following structure:

timestamp bytes

and looks like this:

1485762953167032 517
1485762953167657 517
1485762953188416 517
1485762953188504 517
1485762953195641 151
1485762953196256 151
1485762953198736 216
1485762953200099 216
1485762953201115 1261
1485762953201658 151
1485762953201840 151
1485762953202040 1261
1485762953203387 216
1485762953204183 216
1485762953206935 549
1485762953207548 546
1485762953259335 306
1485762953260025 1448
1485762953260576 1448
1485762953261087 1448
1485762953261790 1500
1485762953263878 1500
1485762953264273 1448
1485762953264914 1500

I get my timestamp with gettimeofday(&t, NULL) and achieve a precision long long timestamp = (t.tv_sec * 1000000) + t.tv_usec.

When I said before "makes my logs more suitable" I meant "timestamp should be set to seconds precision, not microseconds", so both columns will be modified.

My naive algorithm is something like this (it is in Bash and since maybe not everybody likes it, you got its pseudocode): ready every line of log file, transform timestamp from microseconds to seconds dividing one million, if it is the same transformed timestamp of the one in the previous line, update the count of bytes and packets (one packet per row).

for every file.log {
 TIMESTAMP=0
 BYTES=0
 PACKETS=0
 echo "Creating file $LOGFILE.plt..."
 create file $LOGFILE.plt
 echo "Preparing $LOGFILE for plotting..."
 while read file.log LINE {
 # LINE is a array: LINE[0] contains timestamp, LINE[1] contains bytes
 if PACKETS == 0 {
 # first row
 TIMESTAMP = LINE[0] / 1000000
 BYTES = BYTES + LINE[1]
 PACKETS += 1
 echo TIMESTAMP " " BYTES " " PACKETS
 }
 TEMP_TIMESTAMP = LINE[0] / 1000000
 if TIMESTAMP == TEMP_TIMESTAMP {
 BYTES = BYTES + LINE[1]
 PACKETS += 1
 }
 else {
 if PACKETS != 0 {
 log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt
 }
 TIMESTAMP = TEMP_TIMESTAMP
 BYTES=0
 PACKETS=1
 }
 }
}

Since it reads every line of logfile it is at least O(n) and for long logfile it may takes a while: is there a way I can shorten this time?

I can't make this optimisation in my code: it is a program built on realtime performance (i.e. streaming and other responsive services) so it has to make the least effort possible; it already logs, it can't calculate optimisation stuff on logs too.

asked Jan 30, 2017 at 9:07
3
  • 5
    As long as you have to read every line of input, this I/O will probably take so much time that micro-optimizing the computation behind it buys you nothing. Learn the memory hierarchy: processor registers and cache are fast, RAM is slower, network and disks are unspeakably slow in comparison. Commented Jan 30, 2017 at 9:12
  • The question is if you always need the all the information from the file. You might be able to keep multiple copies of your data, i.e. Full log file, four copies each keeping 6 hours of a 24 hour intervall, eight copies with a three hour interval, etc. If disk space is not an issue keeping these redundant copies might help. Commented Jan 30, 2017 at 15:27
  • The simplest optimizations are to use binary number storage instead of text (eliminates the conversions), and to use a faster language than Bash. Commented Jan 31, 2017 at 23:36

2 Answers 2

1

awk seems purpose-built for tasks like that.

cat input_file | awk '{print int(1ドル/1000000 + 0.5), 2ドル}' > temp_file

Now temp_file contains per-second packet amounts, with many duplicate seconds values.

To coalesce the values, you can pipe them through this invocation:

awk 'BEGIN {tstamp = 0; bytes = 0}; 
 {if (1ドル == tstamp) {bytes += 2ドル} 
 else {if (tstamp != 0) {print tstamp, bytes}; 
 tstamp=1ドル; bytes = 2ドル}}'

I'd put these two awk scripts in files and use something like

cat input_file | awk -f divide.awk | awk -f coalesce.awk > output_file 
answered Mar 1, 2017 at 16:32
0

Don't toatally follow your pseudocode but

On the else I don't see how packets could ever be zero and setting BYTES=0 is wrong

You fail to write out the last

read file.log LINE 
TIMESTAMP = LINE[0] / 1000000
BYTES = LINE[1]
PACKETS = 1
echo TIMESTAMP " " BYTES " " PACKETS
while read file.log LINE {
 # LINE is a array: LINE[0] contains timestamp, LINE[1] contains bytes
 TEMP_TIMESTAMP = LINE[0] / 1000000
 if TIMESTAMP == TEMP_TIMESTAMP {
 BYTES += LINE[1]
 PACKETS += 1
 }
 else {
 log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt
 TIMESTAMP = TEMP_TIMESTAMP
 BYTES = LINE[1]
 PACKETS = 1
 }
}
log in file "TIMESTAMP BYTES PACKETS" >> $LOGFILE.plt 
answered Jan 30, 2017 at 15:41

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.