Running over a log file to extract log times

Question 1

I need to run over a log file with plenty of entries (25+GB) in Python to extract some log times.

The snippet below works. On my test file, I'm ending up with roughly 9:30min (Python) and 4:30min (PyPy) on 28 million records (small log).

import datetime
import time
import json
import re
format_path= re.compile( 
 r"^\S+\s-\s" 
 r"(?P<user>\S*)\s"
 r"\[(?P<time>.*?)\]" 
)
threshold = time.strptime('00:01:00,000'.split(',')[0],'%H:%M:%S')
tick = datetime.timedelta(hours=threshold.tm_hour,minutes=threshold.tm_min,seconds=threshold.tm_sec).total_seconds()
zero_time = datetime.timedelta(hours=0,minutes=0,seconds=0)
zero_tick = zero_time.total_seconds()
format_date = '%d/%b/%Y:%H:%M:%S'
obj = {}
test = open('very/big/log','r')
for line in test:
 try:
 chunk = line.split('+', 1)[0].split('-', 1)[1].split(' ')
 user = chunk[1]
 if (user[0] == "C"):
 this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)
 try:
 machine = obj[user]
 except KeyError, e:
 machine = obj.setdefault(user,{"init":this_time,"last":this_time,"downtime":0})
 last = machine["last"]
 diff = (this_time-last).total_seconds()
 if (diff > tick):
 machine["downtime"] += diff-tick
 machine["last"] = this_time
 except Exception:
 pass

While I'm ok with the time the script runs, I'm wondering if there are any obvious pitfalls regarding performance, which I'm stepping in (still in my first weeks of Python).

Can I do anything to make this run faster?

EDIT:

Sample log entry:

['2001:470:1f14:169:15f3:824f:8a61:7b59 - SOFTINST [14/Nov/2012:09:32:31 +0100] "POST /setComputerPartition HTTP/1.1" 200 4 "-" "-" 102356']

EDIT 2:

One thing I just did, was to check if log entries have the same log time and if so, I'm using the this_time from the previous iteration vs calling this:

this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)

I checked some of logs, there are plenty of entries with the exact same time, so on the last run on pypy I'm down to 2.43min.

Question 2

You can use regular expressions with re module to speed up parsing dramatically. For example the following code extracts user and time information by using regular expression (note that user[0] == "C" condition also checked by the regexp):

import re
import datetime
match_record = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+)").match
strptime = datetime.datetime.strptime
f = open("very/big/log", "rb")
for line in f:
 match = match_record(line)
 if match is not None:
 user, str_time = match.groups()
 time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
 print user, repr(time)

I think you can make it even faster by reading bigger chunks of data and using re.finditer method.

Don't use dictionaries to store information in the obj dictionary. Tuples is the way to go here and if you really need to use names for fields you can use collections.namedtuple. In any case tuples will be faster than dictionaries (and lists) and will take less memory.
Probably if user in obj will be little bit faster than try/except KeyError here. You can check execution times for small code blocks with timeit module.
Check the Python performance tips. For example Avoiding dots... chapter can be relevant for your code.

UPDATE

The following version of the code read bigger chunks of data and using re.finditer (Not sure if it faster or not. Also note the changes in the regexp):

import re
import datetime
iter_records = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+) .*$", re.M).finditer
strptime = datetime.datetime.strptime
f = open("very/big/log", "rb")
chunk = ""
while True:
 # Probably buffer size can be even bigger
 new = f.read(1048576)
 chunk += new
 end = 0
 for match in iter_records(chunk):
 user, str_time = match.groups()
 time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
 print user, repr(time)
 end = match.end()
 if end > 0:
 chunk = chunk[end:]
 elif not new:
 break

Question 3

ok. using regular expression with pypy and 28m rows cuts 2.43mins to 1.47mins, users in obj is 5seconds slower on the test set and namedtuples... I'm trying now

Question 4

regarding namedtuples - the values in my tuple will be changed on every iteration plus I need to add 3 more values once the loop is done. From what I understand namedtuples are not changeable "out-of-the-box". So should I use dict, namedtuple or a list?

Question 5

@frequent You would have to construct a new tuple with updated values and assign that to obj[user].

Question 6

@JanneKarila: so I'm thinking. But creating a new tuple & removing the old one is probably more resource-eating than list or dict (or set)

Question 7

Ok. Using a list vs a dict cuts processing to 1.30mins.

Question 8

import datetime
import time
import json
import re
format_path= re.compile( 
 r"^\S+\s-\s" 
 r"(?P<user>\S*)\s"
 r"\[(?P<time>.*?)\]" 
)

Python convention is ALL_CAPS for global constants. But as far as I can tell, you don't actually use this.

threshold = time.strptime('00:01:00,000'.split(',')[0],'%H:%M:%S')

Why are you putting in a string literal just to throw half of it away? Why are you converting from a string rather then creating the value directly? tick = datetime.timedelta(hours=threshold.tm_hour,minutes=threshold.tm_min,seconds=threshold.tm_sec).total_seconds()

That's a way over complicated way to accomplish this. Just say tick = 60 and skip all the rigamorle.

zero_time = datetime.timedelta(hours=0,minutes=0,seconds=0)
zero_tick = zero_time.total_seconds()

You don't seem to use this...

format_date = '%d/%b/%Y:%H:%M:%S'
obj = {}
test = open('very/big/log','r')

Really bad names. They don't tell me anything. Also, it's recommended to deal with files using a with statement.

for line in test:
 try:
 chunk = line.split('+', 1)[0].split('-', 1)[1].split(' ')

That seems way complicated. There is probably a better way to extract the data but without knowing what this log looks like I can't really suggest one.

 user = chunk[1]
 if (user[0] == "C"):

You don't need the ( and )

 this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)
 try:
 machine = obj[user]
 except KeyError, e:
 machine = obj.setdefault(user,{"init":this_time,"last":this_time,"downtime":0})

Don't use dictionaries random attribute storage. Create an a class and put stuff in there.

 last = machine["last"]
 diff = (this_time-last).total_seconds()
 if (diff > tick):

You don't need the parens, and I'd combine the last two lines

 machine["downtime"] += diff-tick
 machine["last"] = this_time
 except Exception:
 pass

Don't ever do this. DON'T EVER DO THIS. You should never never catch and then ignore all possible exceptions. Invariably, you'll end up with a bug that you'll miss because you are ignoring the exceptions.

As for making it run faster, we'd really need to know more about the log file's contents.

Question 9

wow. Lot of input. Let me digest. I also add a sample record above plus one thing I just did

Question 10

if (user[0] == "C"):

If the above statement skips a large majority of lines, you can speed things up by doing a preliminary check up front:

if ' - C' in line:

This may give false positives, so follow up with a better check. The regex proposed by hdima looks like the way to go.

hdima hdima 3661 silver badge3 bronze badges · Accepted Answer · 2013-03-07 20:44:39Z

You can use regular expressions with re module to speed up parsing dramatically. For example the following code extracts user and time information by using regular expression (note that user[0] == "C" condition also checked by the regexp):

import re
import datetime
match_record = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+)").match
strptime = datetime.datetime.strptime
f = open("very/big/log", "rb")
for line in f:
 match = match_record(line)
 if match is not None:
 user, str_time = match.groups()
 time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
 print user, repr(time)

I think you can make it even faster by reading bigger chunks of data and using re.finditer method.

Don't use dictionaries to store information in the obj dictionary. Tuples is the way to go here and if you really need to use names for fields you can use collections.namedtuple. In any case tuples will be faster than dictionaries (and lists) and will take less memory.
Probably if user in obj will be little bit faster than try/except KeyError here. You can check execution times for small code blocks with timeit module.
Check the Python performance tips. For example Avoiding dots... chapter can be relevant for your code.

UPDATE

The following version of the code read bigger chunks of data and using re.finditer (Not sure if it faster or not. Also note the changes in the regexp):

import re
import datetime
iter_records = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+) .*$", re.M).finditer
strptime = datetime.datetime.strptime
f = open("very/big/log", "rb")
chunk = ""
while True:
 # Probably buffer size can be even bigger
 new = f.read(1048576)
 chunk += new
 end = 0
 for match in iter_records(chunk):
 user, str_time = match.groups()
 time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
 print user, repr(time)
 end = match.end()
 if end > 0:
 chunk = chunk[end:]
 elif not new:
 break

ok. using regular expression with pypy and 28m rows cuts 2.43mins to 1.47mins, users in obj is 5seconds slower on the test set and namedtuples... I'm trying now
regarding namedtuples - the values in my tuple will be changed on every iteration plus I need to add 3 more values once the loop is done. From what I understand namedtuples are not changeable "out-of-the-box". So should I use dict, namedtuple or a list?
@frequent You would have to construct a new tuple with updated values and assign that to obj[user].
@JanneKarila: so I'm thinking. But creating a new tuple & removing the old one is probably more resource-eating than list or dict (or set)
Ok. Using a list vs a dict cuts processing to 1.30mins.

Stack Exchange Network

Running over a log file to extract log times

3 Answers 3

UPDATE

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Running over a log file to extract log times

3 Answers 3

UPDATE

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions