I have several huge 100MB text files that I need to scan through to pick out certain frame numbers which relate to a specific log packet of interest. My plan was to scan for these frame numbers and drop them into a list (often 6000+ frames per text file!). Ok so far, but there is second packet of interest which should accompany the first packet, these pairs of packets can only be matched by a frame number in my newly created list, so to avoid grabbing useless/blank data from non-matching frames. So I would then re-scan the text file to get the frame related packet data. It was suggested on Stack Overflow that I go down the regex route, and whilst this method worked it was extremely slow, taking up to 3 minutes to process a single text file.
I was wondering if there is a suitable optimization to my code, or even a completely different approach to grabbing this data?
for root, subFolders, files in os.walk(path):
for filename in files:
if filename.endswith('.txt'):
with open(os.path.join(root, filename), 'r') as f:
print '\tProcessing file: '+filename
for line in f:
#first find first key packet and grab frame number
if 'KEY_FRAME details' in line:
chunk = [next(f) for x in xrange(5)]
FRAME = chunk[5].split()
FRAME = FRAME[2]
#drop frame number into a list
framelist.append(str(FRAME))
#return to the start of the file, and search for next packet
f.seek(0)
framed = re.compile('|'.join(framelist))
framed = framed.pattern
#Look for any frame number in list based on 'FrameNumber = '+f and 'FN = '+f match
sentences = f
for s in sentences:
if any(('FrameNumber = '+f) in s for f in framelist):
print 'first found'
#do stuff
if any(('FN = '+f) in s for f in framelist):
print 'second found'
#do stuff
The data I scan through is multiple repetitions of various packets, some packets are blank of info, some have paired info related to the particular frame.
17:29:50.040 AFP 276 Second_Packet details FN = 54332 Tp = TDSAA Te = True St = Test 17:29:50.040 TWR 765 KEY_FRAME details TAPP = 1 FrameNumber = 54332 Cap = 2 Tee = NA 17:29:50.040 AKK 347 Second_Packet details FN = 50000 Tp = KLA Te = True St = NA 17:29:50.040 AFP 276 Second_Packet details FN = 54367 Tp = Ax56 Te = True St = Test 17:29:50.040 YYY 765 KEY_FRAME details TAPP = 1 FrameNumber = 54367 Cap = 2 Tee = NA 17:29:50.040 YYY 765 KEY_FRAME details TAPP = 1 etc......
Just to make it tougher the second packet often comes before the first key packet in the text files.
Finally, I am looking to present the data in packet groups like below, for each key frame number:
FrameNumber = 54367 TAPP = 1 Cap = 2 Tee = NA Tp = Ax56 Te = True St = Test
-
\$\begingroup\$ So, for "Second_Packet" and "KEY_FRAME" to match they should be in the same file? \$\endgroup\$Ashwini Chaudhary– Ashwini Chaudhary2015年01月21日 17:30:15 +00:00Commented Jan 21, 2015 at 17:30
-
\$\begingroup\$ Hi Ashwini Chaudhary, yes all packets are in the same file, but there are other packets present in the same file with un-matching frame numbers, so i want to ignore these also \$\endgroup\$MikG– MikG2015年01月21日 18:21:44 +00:00Commented Jan 21, 2015 at 18:21
3 Answers 3
As it looks like you're storing IDs of all frames ever found in a list called frameslist
, but you want to match KEY_FRAME
and related Second_Packet
only in the current file being iterated. So, in your code you're also iterating over frames that may have never appeared in the current file.
Instead of this I'd suggest you keep a track of all Second_Packet
that appeared before a matching KEY_FRAME
of the current file in a dictionary(second_packets
) using the FN
as key and for Second_Packet
s that already have a matching FN
in packets
dictionary we can merge the data write away. Later we can loop over the matching keys from second_packets
and packets
dictionary and update the corresponding data in packets dictionary.
And as it looks the frames have fixed set of fields so instead of storing all of the frames in a list or dictionary better store them a database(easiest option is sqlite3
) or write to a csv file as soon as you complete processing a file. With this approach you'll only store the data from the current file in memory at a time and all other global data will go to the database or csv which is going to save lots of RAM. For 100 MB files this is pretty much feasible.
from pprint import pprint
def process_data(file_object, n):
'''
Read n lines from file_object and split them at = and return
a dictionary.
'''
processed_data = (next(file_object).split('=', 1) for _ in xrange(n))
return {k.strip(): v.strip() for k, v in processed_data}
# Do this for each file.
with open('file.txt') as f:
second_packets = {}
packets = {}
for line in f:
if 'KEY_FRAME details' in line:
data = process_data(f, 4)
packets[data['FrameNumber']] = data
if 'Second_Packet details' in line:
data = process_data(f, 4)
# If a frame with current FN is already present in
# packets then merge data with the data in packets else
# store this data with FN as key in second_packets dictionary.
if data['FN'] in packets:
packets[data['FN']].update(data)
else:
second_packets[data['FN']] = data
# iterate only over the common keys and update
# data in packets.
for k in second_packets.viewkeys() & packets.viewkeys():
packets[k].update(second_packets[k])
#Now either write packets dictionary to a csv file or database.
pprint(packets)
Outputs:
{'54332': {'Cap': '2',
'FN': '54332',
'FrameNumber': '54332',
'St': 'Test',
'TAPP': '1',
'Te': 'True',
'Tee': 'NA',
'Tp': 'TDSAA'},
'54367': {'Cap': '2',
'FN': '54367',
'FrameNumber': '54367',
'St': 'Test',
'TAPP': '1',
'Te': 'True',
'Tee': 'NA',
'Tp': 'Ax56'}}
-
1\$\begingroup\$ Thanks @Ashwini Chaudhary, this is exactly whiat I was looking for, it now takes less than 10 seconds to process the file! \$\endgroup\$MikG– MikG2015年01月22日 14:54:17 +00:00Commented Jan 22, 2015 at 14:54
I would strongly recommend a different algorithm for processing your file. I would do it in a three-stage approach, requiring only two reads of the file:
on the first stage, we scan the file, and do a few things:
- For each line, call the
stream.tell()
on the file stream, and remember the byte position. - identify all Second_Packet blocks, and the FN they relate to.
- store the FN and the byte position from the Tell in to a dictionary.
- identify all KEY_FRAME blocks, and create an object to represent it, with it's number. Store it in a list
- For each line, call the
the second stage involves processing the KEY_FRAME records, and identifying where the Second_Packet records are. From the map, order the requests to happen in byte-position order from the stream.
Here we scan the file again, in order of the Second_Packet byte positions and the KEY_FRAME instances they belong to.
- seek in the file to the position of the first needed Second_Packet
- strip off whatever information you need to complete the record.
- update the KEY_FRAME instance of data with the required information
By performing only two scans through the file (the first is a full scan, the second is an in-order-but-random-and-selective scan) you reduce the amount of times you process the data.
In your current system, you are scanning the file many times (once and then an additional one for each KEY_FRAME record and an additional one again for each second...). The loops you have at the end are very costly:
for s in sentences: if any(('FrameNumber = '+f) in s for f in framelist): print 'first found' #do stuff if any(('FN = '+f) in s for f in framelist): print 'second found' #do stuff
-
\$\begingroup\$ Thanks @rolfl for your answer, it looks a little beyond my current Python expertise at the moment, I shall have to take some time to research this and get back to you. \$\endgroup\$MikG– MikG2015年01月21日 16:55:31 +00:00Commented Jan 21, 2015 at 16:55
If you know that the two packets cannot be very far apart, you can do this in one pass over the file.
- Keep a sufficiently large, constant size FIFO buffer of the recently seen packets.
- Identify key frames before putting them into the buffer and make an entry for them in a dictionary.
- Identify second packets when they pop out from the buffer and update the corresponding entry in the dictionary if one exists.
Explore related questions
See similar questions with these tags.