Holding records using dictionary

Question 1

Can you please help me with the following script?

At the moment the script is taking up to 20 mins to execute, depending on the amount of data being processed (each time the script is executed, it processes several thousand lines of data). I would like to make the script more responsive. Can you tell me how I can update the script for better performance?

See comments in each of the main parts of the script:

#!/opt/SP/mdp/home/SCRIPTS/tools/Python-2.6.2/bin/python
import glob
import re as regex
from datetime import datetime
# Here I am getting two set of files which I am going to use to create two separate dictionaries
Cnfiles = glob.glob('/logs/split_logs/Cn*_generic_activity.log')
Prfiles = glob.glob('/logs/split_logs/Pr*_generic_activity.log')
# Output file
log = file('/logs/split_logs/processed_data.log', 'w')
# First dictionary, holds received records
Cn = {}
for logfile in Cnfiles:
 with open(logfile) as logfile:
 filecontent = logfile.xreadlines()
 for line in filecontent:
 if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:
 line = regex.sub('<._', '', line)
 line = line.replace('<', '')
 line = line.replace('>', '')
 line = line.replace('.', ' ')
 line = line.replace(r'|', ' ')
 line = line.strip()
 field = line.split(' ')
 opco = field[4]
 service = field[5]
 status = field[6]
 jarid = field[10]
 Cn.setdefault(opco, {}).setdefault(service, {}).setdefault(status, {})[jarid] = jarid
# Second dictionary, holds the various stages the records go through
Pr = {}
for logfile in Prfiles:
 with open(logfile) as logfile:
 filecontent = logfile.xreadlines()
 for line in filecontent:
 if 'status 7 to 13' in line or 'status 9 to 13' in line or 'status 7 to 14' in line or 'status 9 to 14' in line or 'status 5 to 504' in line or 'status 7 to 505' in line:
 line = line.replace('<', '')
 line = line.replace('>', '')
 line = line.replace('.', ' ')
 line = line.strip()
 field = line.split(' ')
 jarid = field[8]
 status = field[13]
 Pr.setdefault(status, {})[jarid] = jarid
# Up to this point, the script performs quite well, even with big files.
# However, the next step is comparing the two dictionaries and if it finds a record that is in both dicts,
# it is creating new sub-dictionaries to hold the various stages of the records. This is the step
# that is taking the most time! Do you know of a better way I can do this?
for opco in Cn.keys():
 for service in Cn[opco].keys():
 for status in Cn[opco][service].keys():
 for jarid in Cn[opco][service][status].keys():
 if jarid in Pr['13'].keys():
 Cn[opco][service].setdefault('ACK', {})[jarid] = jarid
 elif jarid in Pr['14'].keys():
 Cn[opco][service].setdefault('NACK', {})[jarid] = jarid
 else:
 if jarid in Pr['504'].keys() or jarid in Pr['505'].keys():
 Cn[opco][service].setdefault('RETRY', {})[jarid] = jarid
# Once the new sub-dictionaries are created, I am just counting the number of records in
# each one and writing the output to the log.
timestamp = (datetime.now()).strftime('%y%m%d %H:%M')
for opco in sorted(Cn.keys()):
 for service in sorted(Cn[opco].keys()):
 if 'RECV' in Cn[opco][service].keys():
 recvcount = len(Cn[opco][service]['RECV'])
 else:
 recvcount = ''
 if 'NACK' in Cn[opco][service].keys():
 nackcount = len(Cn[opco][service]['NACK'])
 else:
 nackcount = ''
 if 'ACK' in Cn[opco][service].keys():
 ackcount = len(Cn[opco][service]['ACK'])
 else:
 ackcount = ''
 if 'RETRY' in Cn[opco][service].keys():
 retrycount = len(Cn[opco][service]['RETRY'])
 else:
 retrycount = ''
 log.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % (timestamp, opco, service, recvcount, ackcount, nackcount, retrycount))
log.close()

UPDATE

I have updated my script using most of the suggestions posted here. I am now using regexes. I am not using dict.keys() to iterate through the dicts anymore and I have shortened my statements, etc. However, my script is still running very slowly.

 for opco in Cn:
 for service in Cn[opco]:
 for jarid in Cn[opco][service]['RECV']:
 if jarid in Pr['13']:
 Cn[opco][service].setdefault('ACK', []).append(jarid)
 elif jarid in Pr['14']:
 Cn[opco][service].setdefault('NACK', []).append(jarid)
 else:
 if jarid in Pr['504'] or jarid in Pr['505']:
 Cn[opco][service].setdefault('RETRY', []).append(jarid)

I have run my script with a profiler and I can see that this step in the script is taking the most time (1368 CPU secs last time I executed it!). Please note that I am new to Python and I still don't have a good grasp on everything that can be done with it.

UPDATE

I have managed to get my whole script to execute in under 10 seconds by changing the 'for loop' I posted above to the following:

 for opco in Cn:
 for service in Cn[opco]:
 ack = set(Cn[opco][service]['RECV']) & set(Pr['13'])
 for jarid in ack:
 Cn[opco][service].setdefault('ACK', set()).add(jarid)
 nack = set(Cn[opco][service]['RECV']) & set(Pr['14'])
 for jarid in nack:
 Cn[opco][service].setdefault('NACK', set()).add(jarid)
 retry = set(Cn[opco][service]['RECV']) & set(Pr['504'])
 for jarid in retry:
 Cn[opco][service].setdefault('RETRY', set()).add(jarid)

Question 2

I just noticed the part you said is problematic. Can you specify what you're trying to do and what are the datastructures. For example, what is opco?

Question 3

Hi @Guy, What I try to do at that point is look through my two dictionaries and try to match the jarid on both. If the jarid is found, then I take the stage value where the jarid is found and create a new dictionary with this info. In this case, opco is a two letter country code. My log files look like this... Receiver log file:- >timestamp,machine,opco,service,status,jarid >2011年07月04日 11:15:02,152 ICMFCn01 EG SERV1 RECV 191a7971 for the 'stages' log, the entries are like:- >timestamp,machine,jarid,stage >2011年07月04日 11:16:35,482 ICMFPr01 0cb6c851 14

Question 4

That is quite a shebang line. You ever consider #!/usr/bin/env python?

Question 5

Look at my answer to "I'm trying to understand how to make my application more efficient". Profiling Python code is pretty easy and painless.

Question 6

This:

jarid in Pr['13'].keys()

gets the keys of the dict as a list and then searches the list, which loses the speed advantage of a dict. Searching a list is O(n), whereas searching a dict is O(1). It's faster to do this:

jarid in Pr['13']

This:

if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:

is equivalent to this:

if ('SERV1' in line or 'SERV2' in line or 'SERV3' in line) and 'RECV' in line:

This:

line = line.replace('<', '')
line = line.replace('>', '')
line = line.replace('.', ' ')
line = line.replace(r'|', ' ')

can be shortened to this:

line = regex.sub('<._', '', line)
line = regex.sub('[<>,|]', '', line)

This:

line = line.replace('<', '')
line = line.replace('>', '')
line = line.replace('.', ' ')

can be shortened to this:

line = regex.sub('[<>.]', '', line)

Pre-compiling the regexes may give a little speed improvement, but not much, because the regexes are cached by the re module.

Question 7

Wow! Thanks MRAB, this are really good suggestions! I will definitely use them in my script

Question 8

as for keys() (which I didn't mention in my answer because already mentioned here): I believe it gets even worse: it's not only the list vs. dict lookup: a new list must be created first, (and at first sight, this is happening a lot, even multiple times for the same dict)

Question 9

Suggestions:

compile your regex-es once and use the compiled version
go over the log files once instead of three times.
find the lines in the log using a regex - I don't know the exact syntax but it should say: "starts with a newline, ends with a newline and has ...". Then compile it and use re.findall()
try and avoid the line = line.replace(..)'s you have. Since string is immutable, each replace creates a new line string. Maybe use indices opco = line[x:y]. I don't know how complicated the logs are but, I think that having a few ifs instead of all those replacements should speed things up.
If I understand your comment correctly, basically what you have is two giant lists. One for "Receiver Log" and one for "stages" log. As I see it (correct me if I'm wrong), what you're doing is (first, reducing the lists), and then going entry by entry on one list and checking if the entry appears in the second list. Maybe, (don't shoot me if this is ridiculous ;) sort one of the lists and go entry by entry on the second list while using binary search. I.e:

l1 = open("Receiver Log").readlines()

l2 = open("Stages log").readlines()

sort(l2) # on jarid

for entry in l1:
```
 if binarySearch(l2, entry['jarid']) # again, on jarid
 # analyze the l2
```
Theoretically speaking, you're making a tremendous speed-up. Practically, I don't know if it'll work. python has a builtin list.sort() and according to https://stackoverflow.com/questions/212358/binary-search-in-python it's easy to implement binary search.

Question 10

Hi @Guy, I have no knowledge of binary search. Please see my original post above. I have updated it to ask for help implementing a solution using binary search.

Question 11

The link I pointed to has an implementation of binary search in python.

Question 12

First, is your process I/O bound or CPU bound? Look at your CPU meter when it's running. If it's I/O bound, the CPU will be at less than 100% and there might not be much you can do about it except get better hardware.

If you're CPU bound, a number of things could help. It's not a huge savings, but it will help to pre-compile your regular expressions. Instead line = regex.sub('<._', '', line)

Start with

regex1 = regex.compile('<._')

outside your loop, and inside the loop do

line = regex1.sub('',line)

Also the long if tests will run faster as regular expressions. Instead of

if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:

Do

regex2 = regex.compile('SERV[1-3].*RECV') # Outside of loop
if regex2.search(line):

The regular expression engine is very smart and highly optimized. Since that line is running on every log entry, it should be real savings.

Similarly, a regex that only captures the 5th, 6th, 7th and 11th fields will save you from allocating memory for the strings for all the others that you aren't using. For example to capture the third and fifth words in a line use this regex (understanding that \s matches whitespace and \S matches non-whitespace and parentheses determine what gets captured and stored in the match object's group):

regex3 = regex.compile("^\S+\s+\S+\s+(\S+)\s+\S+\s+(\S+)")
m = regex3.search(line)
word3 = m.group(1)
word5 = m.group(2)

A lot harder to read, but it will run a lot faster. (Comment your code!) In fact, if you just just replace that whole series of line.replace and split with one big fancy regex, that would be best. Without an example of what the input looks like it's not practical to construct that regex.

Also, you could try running a profiler.

Question 13

Hi Leopd, can you please tell me what would be the regex to capture only certain fields from a line?

Question 14

Edited my answer with an example regex.

Question 15

You didn't complain about this part of the code, but I got some nitpicks:

with open(logfile) as logfile:
 filecontent = logfile.xreadlines()
 for line in filecontent:
# The call to xreadlines() is wasteful; You can iterate over the file descriptor directly.
# I also wouldn't use the same name for the filename and the file descriptor.
with open(logfile) as f:
 for line in f:

Two more suggestions:

Make use of the logging module. That way you won't have to do the kind of (buggy and unpretty) custom implementation here:

log = file('/logs/split_logs/processed_data.log', 'w')
log.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % (timestamp, opco, service, recvcount, ackcount, nackcount, retrycount))
log.close()

Have a look at PEP 8. It has something to say about your overlong lines.

Question 16

Some other suggestions in addition to what has already been mentioned:

stuff like some_dict[jarid] = jarid suggests that you actually want a set instead of a dictionary?
you're using the setdefault method quite a lot: maybe have a look at collections.defaultdict
in the case of Cn[opco][service]['RECV'] and so on... you seem to use that only to get the number of occurrences? You don't even have to use a set for that: have a look at collections.Counter or use a defaultdict(int). (you could just do something like service_counter['RECV'] += 1, you don't even have to check whether you've seen the key before or not)
you're doing a whole lot of lookups again and again, for example you first ask for Cn[opco].keys(), then you do Cn[opco][service].keys(): you're looking up Cn[opco] again, and then you lookup the service (with the key you have just retrieved), and then you do the whole thing again, but with an extra level, and again... Use the dictionary's items() or iteritems() method to iterate over keys and values at the same time. For example:
```
for opco, opco_val in Cn.items():
 for service, service_val in opco_val.items():
```
etc.... That way, you don't have to look it up again (and certainly not the whole chain!)

Question 17

Hello @Steven, I have tried using your suggestion above by iterating through the keys, values in pair using dict.items(), however, when I then try to use 'service' as my value, I am getting the dictionary held at this value (e.j: opco is giving me 'UK' which is correct, however service is giving me '{SERV1:{RECV:{eai123ea:2011年07月18日 21:09:13.251}}}' which is a sub-dict. Can you please give me another example of how I can use your approach to iterate through my dicts?

MRAB MRAB 2561 silver badge2 bronze badges · Answer 1 · 2011-07-13 19:17:00Z

This:

jarid in Pr['13'].keys()

gets the keys of the dict as a list and then searches the list, which loses the speed advantage of a dict. Searching a list is O(n), whereas searching a dict is O(1). It's faster to do this:

jarid in Pr['13']

This:

if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:

is equivalent to this:

if ('SERV1' in line or 'SERV2' in line or 'SERV3' in line) and 'RECV' in line:

This:

line = line.replace('<', '')
line = line.replace('>', '')
line = line.replace('.', ' ')
line = line.replace(r'|', ' ')

can be shortened to this:

line = regex.sub('<._', '', line)
line = regex.sub('[<>,|]', '', line)

This:

line = line.replace('<', '')
line = line.replace('>', '')
line = line.replace('.', ' ')

can be shortened to this:

line = regex.sub('[<>.]', '', line)

Pre-compiling the regexes may give a little speed improvement, but not much, because the regexes are cached by the re module.

Wow! Thanks MRAB, this are really good suggestions! I will definitely use them in my script
as for keys() (which I didn't mention in my answer because already mentioned here): I believe it gets even worse: it's not only the list vs. dict lookup: a new list must be created first, (and at first sight, this is happening a lot, even multiple times for the same dict)

user5679 user5679 1511 bronze badge · Answer 2 · 2011-07-13 18:12:01Z

Suggestions:

compile your regex-es once and use the compiled version
go over the log files once instead of three times.
find the lines in the log using a regex - I don't know the exact syntax but it should say: "starts with a newline, ends with a newline and has ...". Then compile it and use re.findall()
try and avoid the line = line.replace(..)'s you have. Since string is immutable, each replace creates a new line string. Maybe use indices opco = line[x:y]. I don't know how complicated the logs are but, I think that having a few ifs instead of all those replacements should speed things up.
If I understand your comment correctly, basically what you have is two giant lists. One for "Receiver Log" and one for "stages" log. As I see it (correct me if I'm wrong), what you're doing is (first, reducing the lists), and then going entry by entry on one list and checking if the entry appears in the second list. Maybe, (don't shoot me if this is ridiculous ;) sort one of the lists and go entry by entry on the second list while using binary search. I.e:

l1 = open("Receiver Log").readlines()

l2 = open("Stages log").readlines()

sort(l2) # on jarid

for entry in l1:
```
 if binarySearch(l2, entry['jarid']) # again, on jarid
 # analyze the l2
```
Theoretically speaking, you're making a tremendous speed-up. Practically, I don't know if it'll work. python has a builtin list.sort() and according to https://stackoverflow.com/questions/212358/binary-search-in-python it's easy to implement binary search.

Hi @Guy, I have no knowledge of binary search. Please see my original post above. I have updated it to ask for help implementing a solution using binary search.
The link I pointed to has an implementation of binary search in python.

Leopd Leopd 1512 bronze badges · Answer 3 · 2011-07-13 18:20:20Z

First, is your process I/O bound or CPU bound? Look at your CPU meter when it's running. If it's I/O bound, the CPU will be at less than 100% and there might not be much you can do about it except get better hardware.

If you're CPU bound, a number of things could help. It's not a huge savings, but it will help to pre-compile your regular expressions. Instead line = regex.sub('<._', '', line)

Start with

regex1 = regex.compile('<._')

outside your loop, and inside the loop do

line = regex1.sub('',line)

Also the long if tests will run faster as regular expressions. Instead of

if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:

Do

regex2 = regex.compile('SERV[1-3].*RECV') # Outside of loop
if regex2.search(line):

The regular expression engine is very smart and highly optimized. Since that line is running on every log entry, it should be real savings.

Similarly, a regex that only captures the 5th, 6th, 7th and 11th fields will save you from allocating memory for the strings for all the others that you aren't using. For example to capture the third and fifth words in a line use this regex (understanding that \s matches whitespace and \S matches non-whitespace and parentheses determine what gets captured and stored in the match object's group):

regex3 = regex.compile("^\S+\s+\S+\s+(\S+)\s+\S+\s+(\S+)")
m = regex3.search(line)
word3 = m.group(1)
word5 = m.group(2)

A lot harder to read, but it will run a lot faster. (Comment your code!) In fact, if you just just replace that whole series of line.replace and split with one big fancy regex, that would be best. Without an example of what the input looks like it's not practical to construct that regex.

Also, you could try running a profiler.

Hi Leopd, can you please tell me what would be the regex to capture only certain fields from a line?

tshepang tshepang 7526 silver badges19 bronze badges · Answer 4 · 2011-07-15 21:39:12Z

You didn't complain about this part of the code, but I got some nitpicks:

with open(logfile) as logfile:
 filecontent = logfile.xreadlines()
 for line in filecontent:
# The call to xreadlines() is wasteful; You can iterate over the file descriptor directly.
# I also wouldn't use the same name for the filename and the file descriptor.
with open(logfile) as f:
 for line in f:

Two more suggestions:

Make use of the logging module. That way you won't have to do the kind of (buggy and unpretty) custom implementation here:

log = file('/logs/split_logs/processed_data.log', 'w')
log.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % (timestamp, opco, service, recvcount, ackcount, nackcount, retrycount))
log.close()

Have a look at PEP 8. It has something to say about your overlong lines.

StevenSteven · Answer 5 · 2011-07-13 20:09:52Z

Some other suggestions in addition to what has already been mentioned:

stuff like some_dict[jarid] = jarid suggests that you actually want a set instead of a dictionary?
you're using the setdefault method quite a lot: maybe have a look at collections.defaultdict
in the case of Cn[opco][service]['RECV'] and so on... you seem to use that only to get the number of occurrences? You don't even have to use a set for that: have a look at collections.Counter or use a defaultdict(int). (you could just do something like service_counter['RECV'] += 1, you don't even have to check whether you've seen the key before or not)
you're doing a whole lot of lookups again and again, for example you first ask for Cn[opco].keys(), then you do Cn[opco][service].keys(): you're looking up Cn[opco] again, and then you lookup the service (with the key you have just retrieved), and then you do the whole thing again, but with an extra level, and again... Use the dictionary's items() or iteritems() method to iterate over keys and values at the same time. For example:
```
for opco, opco_val in Cn.items():
 for service, service_val in opco_val.items():
```
etc.... That way, you don't have to look it up again (and certainly not the whole chain!)

Hello @Steven, I have tried using your suggestion above by iterating through the keys, values in pair using dict.items(), however, when I then try to use 'service' as my value, I am getting the dictionary held at this value (e.j: opco is giving me 'UK' which is correct, however service is giving me '{SERV1:{RECV:{eai123ea:2011年07月18日 21:09:13.251}}}' which is a sub-dict. Can you please give me another example of how I can use your approach to iterate through my dicts?

Stack Exchange Network

Holding records using dictionary

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Holding records using dictionary

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions