Can you please help me with the following script?
At the moment the script is taking up to 20 mins to execute, depending on the amount of data being processed (each time the script is executed, it processes several thousand lines of data). I would like to make the script more responsive. Can you tell me how I can update the script for better performance?
See comments in each of the main parts of the script:
#!/opt/SP/mdp/home/SCRIPTS/tools/Python-2.6.2/bin/python
import glob
import re as regex
from datetime import datetime
# Here I am getting two set of files which I am going to use to create two separate dictionaries
Cnfiles = glob.glob('/logs/split_logs/Cn*_generic_activity.log')
Prfiles = glob.glob('/logs/split_logs/Pr*_generic_activity.log')
# Output file
log = file('/logs/split_logs/processed_data.log', 'w')
# First dictionary, holds received records
Cn = {}
for logfile in Cnfiles:
with open(logfile) as logfile:
filecontent = logfile.xreadlines()
for line in filecontent:
if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:
line = regex.sub('<._', '', line)
line = line.replace('<', '')
line = line.replace('>', '')
line = line.replace('.', ' ')
line = line.replace(r'|', ' ')
line = line.strip()
field = line.split(' ')
opco = field[4]
service = field[5]
status = field[6]
jarid = field[10]
Cn.setdefault(opco, {}).setdefault(service, {}).setdefault(status, {})[jarid] = jarid
# Second dictionary, holds the various stages the records go through
Pr = {}
for logfile in Prfiles:
with open(logfile) as logfile:
filecontent = logfile.xreadlines()
for line in filecontent:
if 'status 7 to 13' in line or 'status 9 to 13' in line or 'status 7 to 14' in line or 'status 9 to 14' in line or 'status 5 to 504' in line or 'status 7 to 505' in line:
line = line.replace('<', '')
line = line.replace('>', '')
line = line.replace('.', ' ')
line = line.strip()
field = line.split(' ')
jarid = field[8]
status = field[13]
Pr.setdefault(status, {})[jarid] = jarid
# Up to this point, the script performs quite well, even with big files.
# However, the next step is comparing the two dictionaries and if it finds a record that is in both dicts,
# it is creating new sub-dictionaries to hold the various stages of the records. This is the step
# that is taking the most time! Do you know of a better way I can do this?
for opco in Cn.keys():
for service in Cn[opco].keys():
for status in Cn[opco][service].keys():
for jarid in Cn[opco][service][status].keys():
if jarid in Pr['13'].keys():
Cn[opco][service].setdefault('ACK', {})[jarid] = jarid
elif jarid in Pr['14'].keys():
Cn[opco][service].setdefault('NACK', {})[jarid] = jarid
else:
if jarid in Pr['504'].keys() or jarid in Pr['505'].keys():
Cn[opco][service].setdefault('RETRY', {})[jarid] = jarid
# Once the new sub-dictionaries are created, I am just counting the number of records in
# each one and writing the output to the log.
timestamp = (datetime.now()).strftime('%y%m%d %H:%M')
for opco in sorted(Cn.keys()):
for service in sorted(Cn[opco].keys()):
if 'RECV' in Cn[opco][service].keys():
recvcount = len(Cn[opco][service]['RECV'])
else:
recvcount = ''
if 'NACK' in Cn[opco][service].keys():
nackcount = len(Cn[opco][service]['NACK'])
else:
nackcount = ''
if 'ACK' in Cn[opco][service].keys():
ackcount = len(Cn[opco][service]['ACK'])
else:
ackcount = ''
if 'RETRY' in Cn[opco][service].keys():
retrycount = len(Cn[opco][service]['RETRY'])
else:
retrycount = ''
log.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % (timestamp, opco, service, recvcount, ackcount, nackcount, retrycount))
log.close()
UPDATE
I have updated my script using most of the suggestions posted here. I am now using regexes. I am not using dict.keys()
to iterate through the dicts anymore and I have shortened my statements, etc. However, my script is still running very slowly.
for opco in Cn:
for service in Cn[opco]:
for jarid in Cn[opco][service]['RECV']:
if jarid in Pr['13']:
Cn[opco][service].setdefault('ACK', []).append(jarid)
elif jarid in Pr['14']:
Cn[opco][service].setdefault('NACK', []).append(jarid)
else:
if jarid in Pr['504'] or jarid in Pr['505']:
Cn[opco][service].setdefault('RETRY', []).append(jarid)
I have run my script with a profiler and I can see that this step in the script is taking the most time (1368 CPU secs last time I executed it!). Please note that I am new to Python and I still don't have a good grasp on everything that can be done with it.
UPDATE
I have managed to get my whole script to execute in under 10 seconds by changing the 'for loop' I posted above to the following:
for opco in Cn:
for service in Cn[opco]:
ack = set(Cn[opco][service]['RECV']) & set(Pr['13'])
for jarid in ack:
Cn[opco][service].setdefault('ACK', set()).add(jarid)
nack = set(Cn[opco][service]['RECV']) & set(Pr['14'])
for jarid in nack:
Cn[opco][service].setdefault('NACK', set()).add(jarid)
retry = set(Cn[opco][service]['RECV']) & set(Pr['504'])
for jarid in retry:
Cn[opco][service].setdefault('RETRY', set()).add(jarid)
5 Answers 5
This:
jarid in Pr['13'].keys()
gets the keys of the dict as a list and then searches the list, which loses the speed advantage of a dict. Searching a list is O(n), whereas searching a dict is O(1). It's faster to do this:
jarid in Pr['13']
This:
if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:
is equivalent to this:
if ('SERV1' in line or 'SERV2' in line or 'SERV3' in line) and 'RECV' in line:
This:
line = line.replace('<', '')
line = line.replace('>', '')
line = line.replace('.', ' ')
line = line.replace(r'|', ' ')
can be shortened to this:
line = regex.sub('<._', '', line)
line = regex.sub('[<>,|]', '', line)
This:
line = line.replace('<', '')
line = line.replace('>', '')
line = line.replace('.', ' ')
can be shortened to this:
line = regex.sub('[<>.]', '', line)
Pre-compiling the regexes may give a little speed improvement, but not much, because the regexes are cached by the re module.
-
\$\begingroup\$ Wow! Thanks MRAB, this are really good suggestions! I will definitely use them in my script \$\endgroup\$Junior– Junior2011年07月13日 19:29:56 +00:00Commented Jul 13, 2011 at 19:29
-
1\$\begingroup\$ as for keys() (which I didn't mention in my answer because already mentioned here): I believe it gets even worse: it's not only the list vs. dict lookup: a new list must be created first, (and at first sight, this is happening a lot, even multiple times for the same dict) \$\endgroup\$Steven– Steven2011年07月13日 20:13:32 +00:00Commented Jul 13, 2011 at 20:13
Suggestions:
- compile your regex-es once and use the compiled version
- go over the log files once instead of three times.
- find the lines in the log using a regex - I don't know the exact syntax but it should say: "starts with a newline, ends with a newline and has ...". Then compile it and use
re.findall()
- try and avoid the
line = line.replace(..)
's you have. Since string is immutable, each replace creates a new line string. Maybe use indicesopco = line[x:y]
. I don't know how complicated the logs are but, I think that having a fewif
s instead of all those replacements should speed things up. If I understand your comment correctly, basically what you have is two giant lists. One for "Receiver Log" and one for "stages" log. As I see it (correct me if I'm wrong), what you're doing is (first, reducing the lists), and then going entry by entry on one list and checking if the entry appears in the second list. Maybe, (don't shoot me if this is ridiculous ;) sort one of the lists and go entry by entry on the second list while using binary search. I.e:
l1 = open("Receiver Log").readlines()
l2 = open("Stages log").readlines()
sort(l2) # on jarid
for entry in l1:
if binarySearch(l2, entry['jarid']) # again, on jarid # analyze the l2
Theoretically speaking, you're making a tremendous speed-up. Practically, I don't know if it'll work. python has a builtin
list.sort()
and according to https://stackoverflow.com/questions/212358/binary-search-in-python it's easy to implement binary search.
-
\$\begingroup\$ Hi @Guy, I have no knowledge of binary search. Please see my original post above. I have updated it to ask for help implementing a solution using binary search. \$\endgroup\$Junior– Junior2011年07月19日 03:14:19 +00:00Commented Jul 19, 2011 at 3:14
-
\$\begingroup\$ The link I pointed to has an implementation of binary search in python. \$\endgroup\$user5679– user56792011年07月19日 08:12:37 +00:00Commented Jul 19, 2011 at 8:12
First, is your process I/O bound or CPU bound? Look at your CPU meter when it's running. If it's I/O bound, the CPU will be at less than 100% and there might not be much you can do about it except get better hardware.
If you're CPU bound, a number of things could help. It's not a huge savings, but it will help to pre-compile your regular expressions. Instead line = regex.sub('<._', '', line)
Start with
regex1 = regex.compile('<._')
outside your loop, and inside the loop do
line = regex1.sub('',line)
Also the long if tests will run faster as regular expressions. Instead of
if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:
Do
regex2 = regex.compile('SERV[1-3].*RECV') # Outside of loop
if regex2.search(line):
The regular expression engine is very smart and highly optimized. Since that line is running on every log entry, it should be real savings.
Similarly, a regex that only captures the 5th, 6th, 7th and 11th fields will save you from allocating memory for the strings for all the others that you aren't using. For example to capture the third and fifth words in a line use this regex (understanding that \s
matches whitespace and \S
matches non-whitespace and parentheses determine what gets captured and stored in the match object's group
):
regex3 = regex.compile("^\S+\s+\S+\s+(\S+)\s+\S+\s+(\S+)")
m = regex3.search(line)
word3 = m.group(1)
word5 = m.group(2)
A lot harder to read, but it will run a lot faster. (Comment your code!) In fact, if you just just replace that whole series of line.replace
and split
with one big fancy regex, that would be best. Without an example of what the input looks like it's not practical to construct that regex.
Also, you could try running a profiler.
-
\$\begingroup\$ Hi Leopd, can you please tell me what would be the regex to capture only certain fields from a line? \$\endgroup\$Junior– Junior2011年07月14日 00:15:39 +00:00Commented Jul 14, 2011 at 0:15
-
\$\begingroup\$ Edited my answer with an example regex. \$\endgroup\$Leopd– Leopd2011年07月14日 21:24:27 +00:00Commented Jul 14, 2011 at 21:24
You didn't complain about this part of the code, but I got some nitpicks:
with open(logfile) as logfile:
filecontent = logfile.xreadlines()
for line in filecontent:
# The call to xreadlines() is wasteful; You can iterate over the file descriptor directly.
# I also wouldn't use the same name for the filename and the file descriptor.
with open(logfile) as f:
for line in f:
Two more suggestions:
Make use of the logging module. That way you won't have to do the kind of (buggy and unpretty) custom implementation here:
log = file('/logs/split_logs/processed_data.log', 'w') log.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % (timestamp, opco, service, recvcount, ackcount, nackcount, retrycount)) log.close()
- Have a look at PEP 8. It has something to say about your overlong lines.
Some other suggestions in addition to what has already been mentioned:
- stuff like
some_dict[jarid] = jarid
suggests that you actually want a set instead of a dictionary? - you're using the
setdefault
method quite a lot: maybe have a look at collections.defaultdict - in the case of
Cn[opco][service]['RECV']
and so on... you seem to use that only to get the number of occurrences? You don't even have to use a set for that: have a look at collections.Counter or use adefaultdict(int)
. (you could just do something likeservice_counter['RECV'] += 1
, you don't even have to check whether you've seen the key before or not) you're doing a whole lot of lookups again and again, for example you first ask for
Cn[opco].keys()
, then you doCn[opco][service].keys()
: you're looking upCn[opco]
again, and then you lookup the service (with the key you have just retrieved), and then you do the whole thing again, but with an extra level, and again... Use the dictionary'sitems()
oriteritems()
method to iterate over keys and values at the same time. For example:for opco, opco_val in Cn.items(): for service, service_val in opco_val.items():
etc.... That way, you don't have to look it up again (and certainly not the whole chain!)
-
\$\begingroup\$ Hello @Steven, I have tried using your suggestion above by iterating through the keys, values in pair using dict.items(), however, when I then try to use 'service' as my value, I am getting the dictionary held at this value (e.j: opco is giving me 'UK' which is correct, however service is giving me '{SERV1:{RECV:{eai123ea:2011年07月18日 21:09:13.251}}}' which is a sub-dict. Can you please give me another example of how I can use your approach to iterate through my dicts? \$\endgroup\$Junior– Junior2011年07月19日 03:10:11 +00:00Commented Jul 19, 2011 at 3:10
opco
? \$\endgroup\$#!/usr/bin/env python
? \$\endgroup\$