I have files with over 100 million lines in each:
... 01-AUG-2012 02:29:44 important data 01-AUG-2012 02:29:44 important data 01-AUG-2012 02:36:02 important data some unimportant data blahblah (also unimportant data) some unimportant data 01-AUG-2012 02:40:15 important data some unimportant data ...
As you can see, there are important data (starting with date and time) and unimportant data. Also in each second, there can be many lines of important data.
My goal is to count the number of "important data" in each second (or minute or hour...) and reformat date/time format. My script also lets me count data in each minute, hour etc using options.dlen
:
options.dlen = 10 takes YYYY-MM-DDD options.dlen = 13 takes YYYY-MM-DDD HH options.dlen = 16 takes YYYY-MM-DDD HH:MM options.dlen = 20 takes YYYY-MM-DDD HH:MM:SS
I have written the following script (this is the main part - I skip all the file openings, parameters etc).
DATA = {}
# search for DD-MMM-YYYY HH:MM:SS
# e.g. "01-JUL-2012 02:29:36 important data"
pattern = re.compile('^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
DATA = defaultdict(int)
i = 0
f = open(options.infilename, 'r')
for line in f:
if re.match(pattern, line):
if options.verbose:
i += 1
# print out every 1000 iterations
if i % 1000 == 0:
print str(i) + '\r',
# converts data date/time format to YYYY-MM-DD HH:MM:SS format (but still keep it as datetime !)
d = datetime.strptime( line [0:20], '%d-%b-%Y %H:%M:%S')
# converts d, which is datetime to string again
day_string = d.strftime('%Y-%m-%d %H:%M:%S')
DATA [ str(day_string[0:int(options.dlen)]) ] += 1
f.close()
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1), reverse=True)
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1))
L2 = sorted(DATA.iteritems(), key=operator.itemgetter(0))
It takes about 3 hours to process over 100 million lines. Can you suggest performance improvements for this script?
Update: I have just used PyPy and the same task on the same server took 45 minutes. I will try to add profile statistics.
4 Answers 4
1. Make a test case
The first thing to do is to establish the existence of the performance problem, so let's knock up some test data.
def make_test_file(filename, n, t, delta):
"""
Write `n` lines of test data to `filename`, starting at `t` (a
datetime object) and stepping by `delta` (a timedelta object) each
line.
"""
with open(filename, 'w') as f:
for _ in xrange(n):
f.write(t.strftime('%d-%b-%Y %H:%M:%S ').upper())
f.write('important data\n')
t += delta
>>> from datetime import datetime, timedelta
>>> make_test_file('data.txt', 10**5, datetime.now(), timedelta(seconds=1))
And then with the OP's code in the function aggregate1(filename, dlen)
:
>>> import timeit
>>> timeit.timeit(lambda:aggregate1('data.txt', 16), number = 1)
5.786283016204834
So on the real file (1000 times bigger) that would take an hour and a half on my machine (or longer, if the time complexity is worse than linear). So yes, there's a real performance problem.
2. Clean up the code
Let's try a bunch of obvious minor improvements and optimizations (mostly as suggested in other answers):
Convert
dlen
to an integer once (not every for every line).Write
day_string[:dlen]
instead ofstr(day_string[0:dlen])
.Write
pattern.match(line)
instead ofre.match(pattern, line)
.There's no need for
key = operator.itemgetter(0)
because the sort will proceed on the first element of the pair in any case.Rename
DATA
ascount
andday_string
withs
(it's really a date-time string, not a day string).Use
with
to ensure that the file is closed in the event of an error.Import the name
strptime
so it doesn't have to be looked up for every line.
Let's try that:
def aggregate2(filename, dlen):
strptime = datetime.datetime.strptime
dlen = int(dlen)
pattern = re.compile(r'^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
count = defaultdict(int)
with open(filename, 'r') as f:
for line in f:
if pattern.match(line):
d = strptime(line[:20], '%d-%b-%Y %H:%M:%S')
s = d.strftime('%Y-%m-%d %H:%M:%S')
count[s[:dlen]] += 1
return sorted(count.iteritems())
>>> timeit.timeit(lambda:aggregate2('data.txt', 10), number = 1)
5.200263977050781
A small improvement, 10% or so, but clean code makes the next step easier.
3. Profile
>>> import cProfile
>>> cProfile.run("aggregate2('data.txt', 10)")
2700009 function calls in 6.262 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 6.262 6.262 <string>:1(<module>)
100000 0.088 0.000 1.020 0.000 _strptime.py:27(_getlang)
100000 2.098 0.000 4.033 0.000 _strptime.py:295(_strptime)
100000 0.393 0.000 0.642 0.000 locale.py:339(normalize)
100000 0.105 0.000 0.747 0.000 locale.py:407(_parse_localename)
100000 0.119 0.000 0.933 0.000 locale.py:506(getlocale)
100000 0.067 0.000 0.067 0.000 {_locale.setlocale}
100000 0.515 0.000 4.548 0.000 {built-in method strptime}
100000 0.079 0.000 0.079 0.000 {isinstance}
200000 0.035 0.000 0.035 0.000 {len}
100000 0.043 0.000 0.043 0.000 {method 'end' of '_sre.SRE_Match' objects}
300001 0.076 0.000 0.076 0.000 {method 'get' of 'dict' objects}
100000 0.276 0.000 0.276 0.000 {method 'groupdict' of '_sre.SRE_Match' objects}
100000 0.090 0.000 0.090 0.000 {method 'index' of 'list' objects}
100000 0.025 0.000 0.025 0.000 {method 'iterkeys' of 'dict' objects}
100000 0.046 0.000 0.046 0.000 {method 'lower' of 'str' objects}
200000 0.553 0.000 0.553 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
100000 1.144 0.000 1.144 0.000 {method 'strftime' of 'datetime.date' objects}
I've cut some of the output for clarity. It should be clear that the culprits are strptime
(73% of runtime), strftime
(18%), and match
(9%). Everything else is either called by one of those, or negligible.
4. Pluck the low-hanging fruit
We can avoid calling both strptime
and strftime
if we recognize that the only things we are achieving by calling these two functions are (a) to translate the months from names (AUG
) to numbers (08
), and (b) to reorder the components into ISO standard order. So let's do that ourselves:
def aggregate3(filename, dlen):
dlen = int(dlen)
months = dict(JAN = '01', FEB = '02', MAR = '03', APR = '04',
MAY = '05', JUN = '06', JUL = '07', AUG = '08',
SEP = '09', OCT = '10', NOV = '11', DEC = '12')
pattern = re.compile(r'^(\d{2})-([A-Z]{3})-(\d{4}) (\d{2}:\d{2}:\d{2}) '
'important data')
count = defaultdict(int)
with open(filename, 'r') as f:
for line in f:
m = pattern.match(line)
if m:
s = '{3}-{0}-{1} {4}'.format(months[m.group(2)], *m.groups())
count[s[:dlen]] += 1
return sorted(count.iteritems())
>>> timeit.timeit(lambda:aggregate3('data.txt', 10), number = 1)
0.5073871612548828
There you go: a 90% speedup! That should get you down from three hours to 20 minutes or so. There are a few more things one might try (for example, doing the aggregations for all the different values of dlen
in a single pass). But I think this is enough to be going on with.
DATA = {}
Python convention is for ALL_CAPS to be reserved for constants
# search for DD-MMM-YYYY HH:MM:SS
# e.g. "01-JUL-2012 02:29:36 important data"
pattern = re.compile('^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
DATA = defaultdict(int)
i = 0
f = open(options.infilename, 'r')
I recommend using with
to make sure the file is closed
for line in f:
You should put this loop in a function. Code inside a function runs faster then code at the top level
if re.match(pattern, line):
Do you really need a regular expression? From the file listing you gave maybe you should be checking line[20:] == 'important data'
Also, use pattern.match(line)
, re.match
works passing a precompiled pattern, but I've found that it has much worse performance.
if options.verbose:
i += 1
# print out every 1000 iterations
if i % 1000 == 0:
print str(i) + '\r',
# converts data date/time format to YYYY-MM-DD HH:MM:SS format (but still keep it as datetime !)
d = datetime.strptime( line [0:20], '%d-%b-%Y %H:%M:%S')
# converts d, which is datetime to string again
day_string = d.strftime('%Y-%m-%d %H:%M:%S')
DATA [ str(day_string[0:int(options.dlen)]) ] += 1
There's a good chance you might be better off storing the datetime object rather then the string. On the other hand, is the file already in sorted order? In that case all you need to do is check whether the time string has changed, and you can avoid storing things in a dictionary
f.close()
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1), reverse=True)
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1))
L2 = sorted(DATA.iteritems(), key=operator.itemgetter(0))
If the incoming file is already sorted, you'll save a lot of time by maintaining that sort.
-
\$\begingroup\$ This code is in def main(): which is then run by if name == "main": main() Is that what you say about putting inside function ? \$\endgroup\$pb100– pb1002012年08月31日 11:49:21 +00:00Commented Aug 31, 2012 at 11:49
-
\$\begingroup\$ @przemol, yes that's what I mean by putting inside a function. If you've already done that: good. \$\endgroup\$Winston Ewert– Winston Ewert2012年08月31日 12:20:56 +00:00Commented Aug 31, 2012 at 12:20
Use string operations instead of regular expression matching. RE uses a fully featured engine which is redundant in this situation.
Here are a few ideas, none of them tested:
Use a quick test to skip lines that can't possibly match the format.
if line[:2].isdigit():
Skip the regular expression entirely and let
strptime
raise an exception if the format isn't correct.- Skip
strptime
andstrftime
and use the original date string directly in your dictionary. Use a second step to convert the strings before you sort, or use a custom sort key and retain the original format.
time(1)
command. \$\endgroup\$grep '^[0-9]'
before your script gets to it, so that you only process the lines that are important. \$\endgroup\$