Optimize huge text file search

Question 1

I have several huge 100MB text files that I need to scan through to pick out certain frame numbers which relate to a specific log packet of interest. My plan was to scan for these frame numbers and drop them into a list (often 6000+ frames per text file!). Ok so far, but there is second packet of interest which should accompany the first packet, these pairs of packets can only be matched by a frame number in my newly created list, so to avoid grabbing useless/blank data from non-matching frames. So I would then re-scan the text file to get the frame related packet data. It was suggested on Stack Overflow that I go down the regex route, and whilst this method worked it was extremely slow, taking up to 3 minutes to process a single text file.

I was wondering if there is a suitable optimization to my code, or even a completely different approach to grabbing this data?

for root, subFolders, files in os.walk(path):
 for filename in files: 
 if filename.endswith('.txt'): 
 with open(os.path.join(root, filename), 'r') as f:
 print '\tProcessing file: '+filename
 for line in f:
 #first find first key packet and grab frame number 
 if 'KEY_FRAME details' in line:
 chunk = [next(f) for x in xrange(5)]
 FRAME = chunk[5].split()
 FRAME = FRAME[2]
 #drop frame number into a list
 framelist.append(str(FRAME))
 #return to the start of the file, and search for next packet
 f.seek(0)
 framed = re.compile('|'.join(framelist))
 framed = framed.pattern
 #Look for any frame number in list based on 'FrameNumber = '+f and 'FN = '+f match
 sentences = f
 for s in sentences:
 if any(('FrameNumber = '+f) in s for f in framelist):
 print 'first found'
 #do stuff
 if any(('FN = '+f) in s for f in framelist):
 print 'second found'
 #do stuff

The data I scan through is multiple repetitions of various packets, some packets are blank of info, some have paired info related to the particular frame.

17:29:50.040 AFP 276 Second_Packet details
 FN = 54332
 Tp = TDSAA
 Te = True
 St = Test
17:29:50.040 TWR 765 KEY_FRAME details
TAPP = 1
FrameNumber = 54332
Cap = 2
Tee = NA
17:29:50.040 AKK 347 Second_Packet details
 FN = 50000
 Tp = KLA
 Te = True
 St = NA
17:29:50.040 AFP 276 Second_Packet details
 FN = 54367
 Tp = Ax56
 Te = True
 St = Test
17:29:50.040 YYY 765 KEY_FRAME details
TAPP = 1
FrameNumber = 54367
Cap = 2
Tee = NA
17:29:50.040 YYY 765 KEY_FRAME details
TAPP = 1
etc......

Just to make it tougher the second packet often comes before the first key packet in the text files.

Finally, I am looking to present the data in packet groups like below, for each key frame number:

FrameNumber = 54367
TAPP = 1
Cap = 2
Tee = NA
Tp = Ax56
Te = True
St = Test

Question 2

So, for "Second_Packet" and "KEY_FRAME" to match they should be in the same file?

Question 3

Hi Ashwini Chaudhary, yes all packets are in the same file, but there are other packets present in the same file with un-matching frame numbers, so i want to ignore these also

Question 4

As it looks like you're storing IDs of all frames ever found in a list called frameslist, but you want to match KEY_FRAME and related Second_Packet only in the current file being iterated. So, in your code you're also iterating over frames that may have never appeared in the current file.

Instead of this I'd suggest you keep a track of all Second_Packet that appeared before a matching KEY_FRAME of the current file in a dictionary(second_packets) using the FN as key and for Second_Packets that already have a matching FN in packets dictionary we can merge the data write away. Later we can loop over the matching keys from second_packets and packets dictionary and update the corresponding data in packets dictionary.

And as it looks the frames have fixed set of fields so instead of storing all of the frames in a list or dictionary better store them a database(easiest option is sqlite3) or write to a csv file as soon as you complete processing a file. With this approach you'll only store the data from the current file in memory at a time and all other global data will go to the database or csv which is going to save lots of RAM. For 100 MB files this is pretty much feasible.

from pprint import pprint
def process_data(file_object, n):
 '''
 Read n lines from file_object and split them at = and return
 a dictionary.
 '''
 processed_data = (next(file_object).split('=', 1) for _ in xrange(n))
 return {k.strip(): v.strip() for k, v in processed_data}
# Do this for each file.
with open('file.txt') as f:
 second_packets = {}
 packets = {}
 for line in f:
 if 'KEY_FRAME details' in line:
 data = process_data(f, 4)
 packets[data['FrameNumber']] = data
 if 'Second_Packet details' in line:
 data = process_data(f, 4)
 # If a frame with current FN is already present in
 # packets then merge data with the data in packets else
 # store this data with FN as key in second_packets dictionary. 
 if data['FN'] in packets:
 packets[data['FN']].update(data)
 else:
 second_packets[data['FN']] = data
 # iterate only over the common keys and update
 # data in packets. 
 for k in second_packets.viewkeys() & packets.viewkeys():
 packets[k].update(second_packets[k])
 #Now either write packets dictionary to a csv file or database.
 pprint(packets)

Outputs:

{'54332': {'Cap': '2',
 'FN': '54332',
 'FrameNumber': '54332',
 'St': 'Test',
 'TAPP': '1',
 'Te': 'True',
 'Tee': 'NA',
 'Tp': 'TDSAA'},
 '54367': {'Cap': '2',
 'FN': '54367',
 'FrameNumber': '54367',
 'St': 'Test',
 'TAPP': '1',
 'Te': 'True',
 'Tee': 'NA',
 'Tp': 'Ax56'}}

Question 5

Thanks @Ashwini Chaudhary, this is exactly whiat I was looking for, it now takes less than 10 seconds to process the file!

Question 6

I would strongly recommend a different algorithm for processing your file. I would do it in a three-stage approach, requiring only two reads of the file:

on the first stage, we scan the file, and do a few things:
1. For each line, call the stream.tell() on the file stream, and remember the byte position.
2. identify all Second_Packet blocks, and the FN they relate to.
3. store the FN and the byte position from the Tell in to a dictionary.
4. identify all KEY_FRAME blocks, and create an object to represent it, with it's number. Store it in a list
the second stage involves processing the KEY_FRAME records, and identifying where the Second_Packet records are. From the map, order the requests to happen in byte-position order from the stream.
Here we scan the file again, in order of the Second_Packet byte positions and the KEY_FRAME instances they belong to.
1. seek in the file to the position of the first needed Second_Packet
2. strip off whatever information you need to complete the record.
3. update the KEY_FRAME instance of data with the required information

By performing only two scans through the file (the first is a full scan, the second is an in-order-but-random-and-selective scan) you reduce the amount of times you process the data.

In your current system, you are scanning the file many times (once and then an additional one for each KEY_FRAME record and an additional one again for each second...). The loops you have at the end are very costly:

for s in sentences:
 if any(('FrameNumber = '+f) in s for f in framelist):
 print 'first found'
 #do stuff
 if any(('FN = '+f) in s for f in framelist):
 print 'second found'
 #do stuff

Question 7

Thanks @rolfl for your answer, it looks a little beyond my current Python expertise at the moment, I shall have to take some time to research this and get back to you.

Question 8

If you know that the two packets cannot be very far apart, you can do this in one pass over the file.

Keep a sufficiently large, constant size FIFO buffer of the recently seen packets.
Identify key frames before putting them into the buffer and make an entry for them in a dictionary.
Identify second packets when they pop out from the buffer and update the corresponding entry in the dictionary if one exists.

score 4 · Accepted Answer · 2015-01-21 19:48:03Z

As it looks like you're storing IDs of all frames ever found in a list called frameslist, but you want to match KEY_FRAME and related Second_Packet only in the current file being iterated. So, in your code you're also iterating over frames that may have never appeared in the current file.

Instead of this I'd suggest you keep a track of all Second_Packet that appeared before a matching KEY_FRAME of the current file in a dictionary(second_packets) using the FN as key and for Second_Packets that already have a matching FN in packets dictionary we can merge the data write away. Later we can loop over the matching keys from second_packets and packets dictionary and update the corresponding data in packets dictionary.

And as it looks the frames have fixed set of fields so instead of storing all of the frames in a list or dictionary better store them a database(easiest option is sqlite3) or write to a csv file as soon as you complete processing a file. With this approach you'll only store the data from the current file in memory at a time and all other global data will go to the database or csv which is going to save lots of RAM. For 100 MB files this is pretty much feasible.

from pprint import pprint
def process_data(file_object, n):
 '''
 Read n lines from file_object and split them at = and return
 a dictionary.
 '''
 processed_data = (next(file_object).split('=', 1) for _ in xrange(n))
 return {k.strip(): v.strip() for k, v in processed_data}
# Do this for each file.
with open('file.txt') as f:
 second_packets = {}
 packets = {}
 for line in f:
 if 'KEY_FRAME details' in line:
 data = process_data(f, 4)
 packets[data['FrameNumber']] = data
 if 'Second_Packet details' in line:
 data = process_data(f, 4)
 # If a frame with current FN is already present in
 # packets then merge data with the data in packets else
 # store this data with FN as key in second_packets dictionary. 
 if data['FN'] in packets:
 packets[data['FN']].update(data)
 else:
 second_packets[data['FN']] = data
 # iterate only over the common keys and update
 # data in packets. 
 for k in second_packets.viewkeys() & packets.viewkeys():
 packets[k].update(second_packets[k])
 #Now either write packets dictionary to a csv file or database.
 pprint(packets)

Outputs:

{'54332': {'Cap': '2',
 'FN': '54332',
 'FrameNumber': '54332',
 'St': 'Test',
 'TAPP': '1',
 'Te': 'True',
 'Tee': 'NA',
 'Tp': 'TDSAA'},
 '54367': {'Cap': '2',
 'FN': '54367',
 'FrameNumber': '54367',
 'St': 'Test',
 'TAPP': '1',
 'Te': 'True',
 'Tee': 'NA',
 'Tp': 'Ax56'}}

Thanks @Ashwini Chaudhary, this is exactly whiat I was looking for, it now takes less than 10 seconds to process the file!

Stack Exchange Network

Optimize huge text file search

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimize huge text file search

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions