5
\$\begingroup\$

I need to run over a log file with plenty of entries (25+GB) in Python to extract some log times.

The snippet below works. On my test file, I'm ending up with roughly 9:30min (Python) and 4:30min (PyPy) on 28 million records (small log).

import datetime
import time
import json
import re
format_path= re.compile( 
 r"^\S+\s-\s" 
 r"(?P<user>\S*)\s"
 r"\[(?P<time>.*?)\]" 
)
threshold = time.strptime('00:01:00,000'.split(',')[0],'%H:%M:%S')
tick = datetime.timedelta(hours=threshold.tm_hour,minutes=threshold.tm_min,seconds=threshold.tm_sec).total_seconds()
zero_time = datetime.timedelta(hours=0,minutes=0,seconds=0)
zero_tick = zero_time.total_seconds()
format_date = '%d/%b/%Y:%H:%M:%S'
obj = {}
test = open('very/big/log','r')
for line in test:
 try:
 chunk = line.split('+', 1)[0].split('-', 1)[1].split(' ')
 user = chunk[1]
 if (user[0] == "C"):
 this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)
 try:
 machine = obj[user]
 except KeyError, e:
 machine = obj.setdefault(user,{"init":this_time,"last":this_time,"downtime":0})
 last = machine["last"]
 diff = (this_time-last).total_seconds()
 if (diff > tick):
 machine["downtime"] += diff-tick
 machine["last"] = this_time
 except Exception:
 pass

While I'm ok with the time the script runs, I'm wondering if there are any obvious pitfalls regarding performance, which I'm stepping in (still in my first weeks of Python).

Can I do anything to make this run faster?

EDIT:

Sample log entry:

['2001:470:1f14:169:15f3:824f:8a61:7b59 - SOFTINST [14/Nov/2012:09:32:31 +0100] "POST /setComputerPartition HTTP/1.1" 200 4 "-" "-" 102356']

EDIT 2:

One thing I just did, was to check if log entries have the same log time and if so, I'm using the this_time from the previous iteration vs calling this:

this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)

I checked some of logs, there are plenty of entries with the exact same time, so on the last run on pypy I'm down to 2.43min.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Mar 7, 2013 at 16:33
\$\endgroup\$

3 Answers 3

4
\$\begingroup\$
  1. You can use regular expressions with re module to speed up parsing dramatically. For example the following code extracts user and time information by using regular expression (note that user[0] == "C" condition also checked by the regexp):

    import re
    import datetime
    match_record = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+)").match
    strptime = datetime.datetime.strptime
    f = open("very/big/log", "rb")
    for line in f:
     match = match_record(line)
     if match is not None:
     user, str_time = match.groups()
     time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
     print user, repr(time)
    

    I think you can make it even faster by reading bigger chunks of data and using re.finditer method.

  2. Don't use dictionaries to store information in the obj dictionary. Tuples is the way to go here and if you really need to use names for fields you can use collections.namedtuple. In any case tuples will be faster than dictionaries (and lists) and will take less memory.

  3. Probably if user in obj will be little bit faster than try/except KeyError here. You can check execution times for small code blocks with timeit module.

  4. Check the Python performance tips. For example Avoiding dots... chapter can be relevant for your code.

UPDATE

The following version of the code read bigger chunks of data and using re.finditer (Not sure if it faster or not. Also note the changes in the regexp):

import re
import datetime
iter_records = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+) .*$", re.M).finditer
strptime = datetime.datetime.strptime
f = open("very/big/log", "rb")
chunk = ""
while True:
 # Probably buffer size can be even bigger
 new = f.read(1048576)
 chunk += new
 end = 0
 for match in iter_records(chunk):
 user, str_time = match.groups()
 time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
 print user, repr(time)
 end = match.end()
 if end > 0:
 chunk = chunk[end:]
 elif not new:
 break
answered Mar 7, 2013 at 20:44
\$\endgroup\$
6
  • \$\begingroup\$ ok. using regular expression with pypy and 28m rows cuts 2.43mins to 1.47mins, users in obj is 5seconds slower on the test set and namedtuples... I'm trying now \$\endgroup\$ Commented Mar 8, 2013 at 9:22
  • \$\begingroup\$ regarding namedtuples - the values in my tuple will be changed on every iteration plus I need to add 3 more values once the loop is done. From what I understand namedtuples are not changeable "out-of-the-box". So should I use dict, namedtuple or a list? \$\endgroup\$ Commented Mar 8, 2013 at 9:55
  • 1
    \$\begingroup\$ @frequent You would have to construct a new tuple with updated values and assign that to obj[user]. \$\endgroup\$ Commented Mar 8, 2013 at 10:06
  • \$\begingroup\$ @JanneKarila: so I'm thinking. But creating a new tuple & removing the old one is probably more resource-eating than list or dict (or set) \$\endgroup\$ Commented Mar 8, 2013 at 10:09
  • \$\begingroup\$ Ok. Using a list vs a dict cuts processing to 1.30mins. \$\endgroup\$ Commented Mar 8, 2013 at 10:34
2
\$\begingroup\$
import datetime
import time
import json
import re
format_path= re.compile( 
 r"^\S+\s-\s" 
 r"(?P<user>\S*)\s"
 r"\[(?P<time>.*?)\]" 
)

Python convention is ALL_CAPS for global constants. But as far as I can tell, you don't actually use this.

threshold = time.strptime('00:01:00,000'.split(',')[0],'%H:%M:%S')

Why are you putting in a string literal just to throw half of it away? Why are you converting from a string rather then creating the value directly? tick = datetime.timedelta(hours=threshold.tm_hour,minutes=threshold.tm_min,seconds=threshold.tm_sec).total_seconds()

That's a way over complicated way to accomplish this. Just say tick = 60 and skip all the rigamorle.

zero_time = datetime.timedelta(hours=0,minutes=0,seconds=0)
zero_tick = zero_time.total_seconds()

You don't seem to use this...

format_date = '%d/%b/%Y:%H:%M:%S'
obj = {}
test = open('very/big/log','r')

Really bad names. They don't tell me anything. Also, it's recommended to deal with files using a with statement.

for line in test:
 try:
 chunk = line.split('+', 1)[0].split('-', 1)[1].split(' ')

That seems way complicated. There is probably a better way to extract the data but without knowing what this log looks like I can't really suggest one.

 user = chunk[1]
 if (user[0] == "C"):

You don't need the ( and )

 this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)
 try:
 machine = obj[user]
 except KeyError, e:
 machine = obj.setdefault(user,{"init":this_time,"last":this_time,"downtime":0})

Don't use dictionaries random attribute storage. Create an a class and put stuff in there.

 last = machine["last"]
 diff = (this_time-last).total_seconds()
 if (diff > tick):

You don't need the parens, and I'd combine the last two lines

 machine["downtime"] += diff-tick
 machine["last"] = this_time
 except Exception:
 pass

Don't ever do this. DON'T EVER DO THIS. You should never never catch and then ignore all possible exceptions. Invariably, you'll end up with a bug that you'll miss because you are ignoring the exceptions.

As for making it run faster, we'd really need to know more about the log file's contents.

answered Mar 7, 2013 at 17:00
\$\endgroup\$
1
  • \$\begingroup\$ wow. Lot of input. Let me digest. I also add a sample record above plus one thing I just did \$\endgroup\$ Commented Mar 7, 2013 at 17:31
1
\$\begingroup\$
if (user[0] == "C"):

If the above statement skips a large majority of lines, you can speed things up by doing a preliminary check up front:

if ' - C' in line:

This may give false positives, so follow up with a better check. The regex proposed by hdima looks like the way to go.

answered Mar 7, 2013 at 20:56
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.