Script for analyzing millions of lines

Question 1

I have files with over 100 million lines in each:

...
01-AUG-2012 02:29:44 important data
01-AUG-2012 02:29:44 important data
01-AUG-2012 02:36:02 important data
some unimportant data
blahblah (also unimportant data)
some unimportant data
01-AUG-2012 02:40:15 important data
some unimportant data
...

As you can see, there are important data (starting with date and time) and unimportant data. Also in each second, there can be many lines of important data.

My goal is to count the number of "important data" in each second (or minute or hour...) and reformat date/time format. My script also lets me count data in each minute, hour etc using options.dlen:

options.dlen = 10 takes YYYY-MM-DDD
options.dlen = 13 takes YYYY-MM-DDD HH
options.dlen = 16 takes YYYY-MM-DDD HH:MM
options.dlen = 20 takes YYYY-MM-DDD HH:MM:SS

I have written the following script (this is the main part - I skip all the file openings, parameters etc).

DATA = {}
# search for DD-MMM-YYYY HH:MM:SS
# e.g. "01-JUL-2012 02:29:36 important data"
pattern = re.compile('^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
DATA = defaultdict(int)
i = 0
f = open(options.infilename, 'r')
for line in f:
 if re.match(pattern, line):
 if options.verbose:
 i += 1
 # print out every 1000 iterations
 if i % 1000 == 0:
 print str(i) + '\r',
 # converts data date/time format to YYYY-MM-DD HH:MM:SS format (but still keep it as datetime !)
 d = datetime.strptime( line [0:20], '%d-%b-%Y %H:%M:%S')
 # converts d, which is datetime to string again
 day_string = d.strftime('%Y-%m-%d %H:%M:%S')
 DATA [ str(day_string[0:int(options.dlen)]) ] += 1
f.close()
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1), reverse=True)
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1))
L2 = sorted(DATA.iteritems(), key=operator.itemgetter(0))

It takes about 3 hours to process over 100 million lines. Can you suggest performance improvements for this script?

Update: I have just used PyPy and the same task on the same server took 45 minutes. I will try to add profile statistics.

Question 2

Without proper profiling we can only guess where the bottleneck is. Create a smaller test file (manpagez.com/man/1/head) and run your script under profiler.

Question 3

You can use Python Multiprocessing to process the lines concurrently

Question 4

Without knowing whether your program is CPU-bound or IO-bound, it's hard to say how to optimize this. You can tell with the time(1) command.

Question 5

One way to quickly improve your script's time would be to trim down the number of lines it has to deal with - pipe your input file through something like grep '^[0-9]' before your script gets to it, so that you only process the lines that are important.

Question 6

1. Make a test case

The first thing to do is to establish the existence of the performance problem, so let's knock up some test data.

def make_test_file(filename, n, t, delta):
 """
 Write `n` lines of test data to `filename`, starting at `t` (a
 datetime object) and stepping by `delta` (a timedelta object) each
 line.
 """
 with open(filename, 'w') as f:
 for _ in xrange(n):
 f.write(t.strftime('%d-%b-%Y %H:%M:%S ').upper())
 f.write('important data\n')
 t += delta
>>> from datetime import datetime, timedelta
>>> make_test_file('data.txt', 10**5, datetime.now(), timedelta(seconds=1))

And then with the OP's code in the function aggregate1(filename, dlen):

>>> import timeit
>>> timeit.timeit(lambda:aggregate1('data.txt', 16), number = 1)
5.786283016204834

So on the real file (1000 times bigger) that would take an hour and a half on my machine (or longer, if the time complexity is worse than linear). So yes, there's a real performance problem.

2. Clean up the code

Let's try a bunch of obvious minor improvements and optimizations (mostly as suggested in other answers):

Convert dlen to an integer once (not every for every line).
Write day_string[:dlen] instead of str(day_string[0:dlen]).
Write pattern.match(line) instead of re.match(pattern, line).
There's no need for key = operator.itemgetter(0) because the sort will proceed on the first element of the pair in any case.
Rename DATA as count and day_string with s (it's really a date-time string, not a day string).
Use with to ensure that the file is closed in the event of an error.
Import the name strptime so it doesn't have to be looked up for every line.

Let's try that:

def aggregate2(filename, dlen):
 strptime = datetime.datetime.strptime
 dlen = int(dlen)
 pattern = re.compile(r'^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
 count = defaultdict(int)
 with open(filename, 'r') as f:
 for line in f:
 if pattern.match(line):
 d = strptime(line[:20], '%d-%b-%Y %H:%M:%S')
 s = d.strftime('%Y-%m-%d %H:%M:%S')
 count[s[:dlen]] += 1
 return sorted(count.iteritems())
>>> timeit.timeit(lambda:aggregate2('data.txt', 10), number = 1)
5.200263977050781

A small improvement, 10% or so, but clean code makes the next step easier.

3. Profile

>>> import cProfile
>>> cProfile.run("aggregate2('data.txt', 10)")
 2700009 function calls in 6.262 seconds
 Ordered by: standard name
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 6.262 6.262 <string>:1(<module>)
 100000 0.088 0.000 1.020 0.000 _strptime.py:27(_getlang)
 100000 2.098 0.000 4.033 0.000 _strptime.py:295(_strptime)
 100000 0.393 0.000 0.642 0.000 locale.py:339(normalize)
 100000 0.105 0.000 0.747 0.000 locale.py:407(_parse_localename)
 100000 0.119 0.000 0.933 0.000 locale.py:506(getlocale)
 100000 0.067 0.000 0.067 0.000 {_locale.setlocale}
 100000 0.515 0.000 4.548 0.000 {built-in method strptime}
 100000 0.079 0.000 0.079 0.000 {isinstance}
 200000 0.035 0.000 0.035 0.000 {len}
 100000 0.043 0.000 0.043 0.000 {method 'end' of '_sre.SRE_Match' objects}
 300001 0.076 0.000 0.076 0.000 {method 'get' of 'dict' objects}
 100000 0.276 0.000 0.276 0.000 {method 'groupdict' of '_sre.SRE_Match' objects}
 100000 0.090 0.000 0.090 0.000 {method 'index' of 'list' objects}
 100000 0.025 0.000 0.025 0.000 {method 'iterkeys' of 'dict' objects}
 100000 0.046 0.000 0.046 0.000 {method 'lower' of 'str' objects}
 200000 0.553 0.000 0.553 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
 100000 1.144 0.000 1.144 0.000 {method 'strftime' of 'datetime.date' objects}

I've cut some of the output for clarity. It should be clear that the culprits are strptime (73% of runtime), strftime (18%), and match (9%). Everything else is either called by one of those, or negligible.

4. Pluck the low-hanging fruit

We can avoid calling both strptime and strftime if we recognize that the only things we are achieving by calling these two functions are (a) to translate the months from names (AUG) to numbers (08), and (b) to reorder the components into ISO standard order. So let's do that ourselves:

def aggregate3(filename, dlen):
 dlen = int(dlen)
 months = dict(JAN = '01', FEB = '02', MAR = '03', APR = '04',
 MAY = '05', JUN = '06', JUL = '07', AUG = '08',
 SEP = '09', OCT = '10', NOV = '11', DEC = '12')
 pattern = re.compile(r'^(\d{2})-([A-Z]{3})-(\d{4}) (\d{2}:\d{2}:\d{2}) '
 'important data')
 count = defaultdict(int)
 with open(filename, 'r') as f:
 for line in f:
 m = pattern.match(line)
 if m:
 s = '{3}-{0}-{1} {4}'.format(months[m.group(2)], *m.groups())
 count[s[:dlen]] += 1
 return sorted(count.iteritems())
>>> timeit.timeit(lambda:aggregate3('data.txt', 10), number = 1)
0.5073871612548828

There you go: a 90% speedup! That should get you down from three hours to 20 minutes or so. There are a few more things one might try (for example, doing the aggregations for all the different values of dlen in a single pass). But I think this is enough to be going on with.

Question 7

DATA = {}

Python convention is for ALL_CAPS to be reserved for constants

# search for DD-MMM-YYYY HH:MM:SS
# e.g. "01-JUL-2012 02:29:36 important data"
pattern = re.compile('^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
DATA = defaultdict(int)
i = 0
f = open(options.infilename, 'r')

I recommend using with to make sure the file is closed

for line in f:

You should put this loop in a function. Code inside a function runs faster then code at the top level

 if re.match(pattern, line):

Do you really need a regular expression? From the file listing you gave maybe you should be checking line[20:] == 'important data'

Also, use pattern.match(line), re.match works passing a precompiled pattern, but I've found that it has much worse performance.

 if options.verbose:
 i += 1
 # print out every 1000 iterations
 if i % 1000 == 0:
 print str(i) + '\r',
 # converts data date/time format to YYYY-MM-DD HH:MM:SS format (but still keep it as datetime !)
 d = datetime.strptime( line [0:20], '%d-%b-%Y %H:%M:%S')
 # converts d, which is datetime to string again
 day_string = d.strftime('%Y-%m-%d %H:%M:%S')
 DATA [ str(day_string[0:int(options.dlen)]) ] += 1

There's a good chance you might be better off storing the datetime object rather then the string. On the other hand, is the file already in sorted order? In that case all you need to do is check whether the time string has changed, and you can avoid storing things in a dictionary

f.close()
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1), reverse=True)
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1))
L2 = sorted(DATA.iteritems(), key=operator.itemgetter(0))

If the incoming file is already sorted, you'll save a lot of time by maintaining that sort.

Question 8

This code is in def main(): which is then run by if name == "main": main() Is that what you say about putting inside function ?

Question 9

@przemol, yes that's what I mean by putting inside a function. If you've already done that: good.

Question 10

Use string operations instead of regular expression matching. RE uses a fully featured engine which is redundant in this situation.

Question 11

Here are a few ideas, none of them tested:

Use a quick test to skip lines that can't possibly match the format.
```
if line[:2].isdigit():
```
Skip the regular expression entirely and let strptime raise an exception if the format isn't correct.
Skip strptime and strftime and use the original date string directly in your dictionary. Use a second step to convert the strings before you sort, or use a custom sort key and retain the original format.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2012-08-31 23:09:13Z

1. Make a test case

The first thing to do is to establish the existence of the performance problem, so let's knock up some test data.

def make_test_file(filename, n, t, delta):
 """
 Write `n` lines of test data to `filename`, starting at `t` (a
 datetime object) and stepping by `delta` (a timedelta object) each
 line.
 """
 with open(filename, 'w') as f:
 for _ in xrange(n):
 f.write(t.strftime('%d-%b-%Y %H:%M:%S ').upper())
 f.write('important data\n')
 t += delta
>>> from datetime import datetime, timedelta
>>> make_test_file('data.txt', 10**5, datetime.now(), timedelta(seconds=1))

And then with the OP's code in the function aggregate1(filename, dlen):

>>> import timeit
>>> timeit.timeit(lambda:aggregate1('data.txt', 16), number = 1)
5.786283016204834

So on the real file (1000 times bigger) that would take an hour and a half on my machine (or longer, if the time complexity is worse than linear). So yes, there's a real performance problem.

2. Clean up the code

Let's try a bunch of obvious minor improvements and optimizations (mostly as suggested in other answers):

Convert dlen to an integer once (not every for every line).
Write day_string[:dlen] instead of str(day_string[0:dlen]).
Write pattern.match(line) instead of re.match(pattern, line).
There's no need for key = operator.itemgetter(0) because the sort will proceed on the first element of the pair in any case.
Rename DATA as count and day_string with s (it's really a date-time string, not a day string).
Use with to ensure that the file is closed in the event of an error.
Import the name strptime so it doesn't have to be looked up for every line.

Let's try that:

def aggregate2(filename, dlen):
 strptime = datetime.datetime.strptime
 dlen = int(dlen)
 pattern = re.compile(r'^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
 count = defaultdict(int)
 with open(filename, 'r') as f:
 for line in f:
 if pattern.match(line):
 d = strptime(line[:20], '%d-%b-%Y %H:%M:%S')
 s = d.strftime('%Y-%m-%d %H:%M:%S')
 count[s[:dlen]] += 1
 return sorted(count.iteritems())
>>> timeit.timeit(lambda:aggregate2('data.txt', 10), number = 1)
5.200263977050781

A small improvement, 10% or so, but clean code makes the next step easier.

3. Profile

>>> import cProfile
>>> cProfile.run("aggregate2('data.txt', 10)")
 2700009 function calls in 6.262 seconds
 Ordered by: standard name
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 6.262 6.262 <string>:1(<module>)
 100000 0.088 0.000 1.020 0.000 _strptime.py:27(_getlang)
 100000 2.098 0.000 4.033 0.000 _strptime.py:295(_strptime)
 100000 0.393 0.000 0.642 0.000 locale.py:339(normalize)
 100000 0.105 0.000 0.747 0.000 locale.py:407(_parse_localename)
 100000 0.119 0.000 0.933 0.000 locale.py:506(getlocale)
 100000 0.067 0.000 0.067 0.000 {_locale.setlocale}
 100000 0.515 0.000 4.548 0.000 {built-in method strptime}
 100000 0.079 0.000 0.079 0.000 {isinstance}
 200000 0.035 0.000 0.035 0.000 {len}
 100000 0.043 0.000 0.043 0.000 {method 'end' of '_sre.SRE_Match' objects}
 300001 0.076 0.000 0.076 0.000 {method 'get' of 'dict' objects}
 100000 0.276 0.000 0.276 0.000 {method 'groupdict' of '_sre.SRE_Match' objects}
 100000 0.090 0.000 0.090 0.000 {method 'index' of 'list' objects}
 100000 0.025 0.000 0.025 0.000 {method 'iterkeys' of 'dict' objects}
 100000 0.046 0.000 0.046 0.000 {method 'lower' of 'str' objects}
 200000 0.553 0.000 0.553 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
 100000 1.144 0.000 1.144 0.000 {method 'strftime' of 'datetime.date' objects}

I've cut some of the output for clarity. It should be clear that the culprits are strptime (73% of runtime), strftime (18%), and match (9%). Everything else is either called by one of those, or negligible.

4. Pluck the low-hanging fruit

We can avoid calling both strptime and strftime if we recognize that the only things we are achieving by calling these two functions are (a) to translate the months from names (AUG) to numbers (08), and (b) to reorder the components into ISO standard order. So let's do that ourselves:

def aggregate3(filename, dlen):
 dlen = int(dlen)
 months = dict(JAN = '01', FEB = '02', MAR = '03', APR = '04',
 MAY = '05', JUN = '06', JUL = '07', AUG = '08',
 SEP = '09', OCT = '10', NOV = '11', DEC = '12')
 pattern = re.compile(r'^(\d{2})-([A-Z]{3})-(\d{4}) (\d{2}:\d{2}:\d{2}) '
 'important data')
 count = defaultdict(int)
 with open(filename, 'r') as f:
 for line in f:
 m = pattern.match(line)
 if m:
 s = '{3}-{0}-{1} {4}'.format(months[m.group(2)], *m.groups())
 count[s[:dlen]] += 1
 return sorted(count.iteritems())
>>> timeit.timeit(lambda:aggregate3('data.txt', 10), number = 1)
0.5073871612548828

There you go: a 90% speedup! That should get you down from three hours to 20 minutes or so. There are a few more things one might try (for example, doing the aggregations for all the different values of dlen in a single pass). But I think this is enough to be going on with.

Stack Exchange Network

Script for analyzing millions of lines

4 Answers 4

1. Make a test case

2. Clean up the code

3. Profile

4. Pluck the low-hanging fruit

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Script for analyzing millions of lines

4 Answers 4

1. Make a test case

2. Clean up the code

3. Profile

4. Pluck the low-hanging fruit

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions