5
\$\begingroup\$

I have a big log file (say 1-3 Gb) which I need to parse, extract data & save it in a CSV file.

Text File Data

 * D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us
 * D:40027C5C rd-byte 00 *core0\Global\Ypf_OILL_OilLvlOn 20.342us
 * D:40010044 rd-word 0FE2 *l\u2SAD_OILLVS_RecoveryCounter 0.160us
 * D:40010044 wr-word 0FE1 *l\u2SAD_OILLVS_RecoveryCounter 0.040us
 * D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us

I have to extract the variable name which is after the last \ and then the number of Read & Write along with the datatype & store it in a CSV file.

CSV File Result

Variable Datatype CORE 0 CORE 1 CORE X 
 Read Write Read Write Read Write 
 OS_inKernel byte 0 0 111768 111878 0 0
 OS_globalIntLevel long 0 0 281604 237901 0 0

The problem is it takes too much time. Can you please look at the attached code and suggest ways to make it faster?

import string
import sys
import time
MyFile = open("C:\\Users\\AEC_FULL\\Saravanan\\Workspace\\Trace32Log_Parser\\core1_sram_ReadWrite.txt")#core0_sram_ReadWrite_rawdata
GeneratedFile = open(str(("C:\\Users\\AEC_FULL\\Saravanan\\Workspace\\Trace32Log_Parser\\")+'ParsedOutput.csv'),'w')
try:
MyVariableList = []
TimeStartTest = time.time() #Starting Time 
GeneratedFile.write('\nVariable')
GeneratedFile.write(', Datatype')
GeneratedFile.write(', CORE 0')
GeneratedFile.write(',, CORE 1')
GeneratedFile.write(',, CORE X')
GeneratedFile.write('\n,, Read ')
GeneratedFile.write(', Write ')
GeneratedFile.write(', Read ')
GeneratedFile.write(', Write ')
GeneratedFile.write(', Read ')
GeneratedFile.write(', Write ') 
GeneratedFile.write('\n') 
for CurrentLine in MyFile:
 NoofSpaces = 0
 if CurrentLine.find('\\') != -1:
 MyVariable = CurrentLine[CurrentLine.rfind('\\')+1:].split(' ')[0] 
 elif CurrentLine.find('*\\') != -1:
 MyVariable = CurrentLine[CurrentLine.rfind('*\\')+1:].split(' ')[0] 
 elif CurrentLine.find('*') != -1: 
 MyVariable = CurrentLine[CurrentLine.rfind('*')+1:].split(' ')[0] 
 VariableFound = 0 
 MyVariableList.sort()
 Lowerbound = 0
 Upperbound = len(MyVariableList)-1
 while Lowerbound <= Upperbound and VariableFound == 0:
 middle_pos = (Lowerbound+Upperbound) // 2
 if MyVariableList[middle_pos] < MyVariable:
 Lowerbound = middle_pos + 1
 elif MyVariableList[middle_pos] > MyVariable:
 Upperbound = middle_pos - 1
 else:
 VariableFound = 1 
 if VariableFound == 0: 
 MyVariableList.append(MyVariable) 
 try:
 MyFile1 = open("C:\\Users\\AEC_FULL\\Saravanan\\Workspace\\Trace32Log_Parser\\core1_sram_ReadWrite.txt")#core0_sram_ReadWrite_rawdata
 Core0_ReadCount = 0
 Core0_WriteCount = 0
 Core1_ReadCount = 0
 Core1_WriteCount = 0
 CoreX_ReadCount = 0
 CoreX_WriteCount = 0 
 for CurrentLine1 in MyFile1:
 if CurrentLine1.find(MyVariable) != -1:
 ## CORE 0 ##
 if CurrentLine1.find("0\\Global") != -1:
 DataType = CurrentLine1.split(' ')[0].split('-')[1]
 DataOperation = CurrentLine1.split(' ')[0].split('-')[0].split(' ')[-1]
 if DataOperation == 'rd':
 Core0_ReadCount = Core0_ReadCount + 1
 elif DataOperation == 'wr':
 Core0_WriteCount = Core0_WriteCount + 1 
 ## CORE 1 ## 
 elif CurrentLine1.find("1\\Global") != -1: 
 DataType = CurrentLine1.split(' ')[0].split('-')[1]
 DataOperation = CurrentLine1.split(' ')[0].split('-')[0].split(' ')[-1]
 if DataOperation == 'rd':
 Core1_ReadCount = Core1_ReadCount + 1
 elif DataOperation == 'wr':
 Core1_WriteCount = Core1_WriteCount + 1 
 ## CORE X ## 
 else:
 DataType = CurrentLine1.split(' ')[0].split('-')[1]
 DataOperation = CurrentLine1.split(' ')[0].split('-')[0].split(' ')[-1]
 if DataOperation == 'rd':
 CoreX_ReadCount = CoreX_ReadCount + 1
 elif DataOperation == 'wr':
 CoreX_WriteCount = CoreX_WriteCount + 1
 GeneratedFile.write('\n %s' %MyVariable)
 GeneratedFile.write(', %s' %DataType) 
 GeneratedFile.write(', %d' %Core0_ReadCount)
 GeneratedFile.write(', %d' %Core0_WriteCount) 
 GeneratedFile.write(', %d' %Core1_ReadCount)
 GeneratedFile.write(', %d' %Core1_WriteCount) 
 GeneratedFile.write(', %d' %CoreX_ReadCount)
 GeneratedFile.write(', %d' %CoreX_WriteCount) 
 GeneratedFile.write('\n') 
 finally:
 MyFile1.close() 
except:
 print sys.exc_info() 
finally:
 GeneratedFile.close() 
 MyFile.close()
 TimeStopTest = time.time()
 print str(int((TimeStopTest - TimeStartTest)/60))
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Apr 8, 2015 at 8:21
\$\endgroup\$
3
  • 1
    \$\begingroup\$ Have you profiled it? \$\endgroup\$ Commented Apr 8, 2015 at 8:35
  • \$\begingroup\$ Are you sure that this code works? elif CurrentLine.find('*\\') != -1: couldn't possibly happen. \$\endgroup\$ Commented Apr 8, 2015 at 10:03
  • \$\begingroup\$ Please check your indentation. The entire body of the try block is not indented. \$\endgroup\$ Commented Apr 8, 2015 at 10:35

2 Answers 2

8
\$\begingroup\$

Strategy

The cause of the performance problem is that for every line where you encounter a new variable name, you open the same file again and scan through, looking at every line that contains the same variable name. That is a process that takes \$O(n^2)\$ time, where \$n\$ is the number of lines in the input file. For large input files, anything slower than \$O(n)\$ would be unacceptable.

A root cause of your problem is that you aren't using data structures effectively. For example, you maintain a MyVariableList array, which you periodically sort so that you can do a binary search on it. What you want is a set, or if you want to preserve the order of appearance of the variables in the output, an OrderedDict.

A reasonable approach would make just one pass through the file, accumulating statistics in a data structure as you go, and then write a summary of that data structure when you reach the end of the input.

The outline of your program should look like this. To note:

  • Organize code into functions
  • Use with blocks to open files, so that they will automatically be closed
  • Take advantage of Python's csv library.
from collections import OrderedDict
import csv
import re
def analyze_log(f):
 stats = OrderedDict()
 for line in f:
 re.search(...)
 ...
 return stats
def write_stats(stats, f):
 out = csv.writer(f)
 out.writerow(...)
 for var in stats:
 out.writerow(...)
def main(input_filename, output_filename):
 with open(input_filename) as input_file:
 stats = analyze_log(input_file)
 with open(output_filename, 'w') as output_file:
 write_stats(stats, output_file)
if __name__ == '__main__':
 main(r'C:\Users\AEC_FULL\...\core1_sram_ReadWrite.txt',
 r'C:\Users\AEC_FULL\...\ParsedOutput.csv')

Even better, take the input from fileinput.input(), and write the output to sys.stdout. Use the shell command to specify the input files and redirect the output to a file, and avoid hard-coding the input and output filenames in your script. Then you could just write

import fileinput
import sys
if __name__ == '__main__':
 write_stats(analyze_log(fileinput.input()), sys.stdout)

Parsing

Slicing each line using finds and splits all over the place is confusing. You would be much better off trying to "make sense" of the input file fields, then using regular expression matches.

def analyze_log(f):
 stats = OrderedDict()
 for line in f:
 _, _, rw_datatype, _, core_varname, _ = line.split()
 match = re.search(r'.*[*\\](.*)', core_varname)
 if not match:
 continue
 var = match.group(1)
 match = re.search(r'([01])\\Global', core_varname)
 core = match and match.group(1) or 'X'
 rw, datatype = rw_datatype.split('-', 1)
 var_stats = stats.get(var, {'rd': {'0': 0, '1': 0, 'X': 0},
 'wr': {'0': 0, '1': 0, 'X': 0},
 'type': datatype })
 stats[var] = var_stats
 var_stats[rw][core] += 1
 return stats

Writing

Using the CSV library would result in tidier code than writing commas and newlines.

def write_stats(stats, f):
 out = csv.writer(f)
 out.writerow(['Variable', 'Datatype',
 'CORE 0', None, 'CORE 1', None, 'CORE X', None])
 out.writerow([None, None] +
 ['Read', 'Write'] * 3)
 for var in stats:
 out.writerow([var, stats[var]['type'],
 stats[var]['rd']['0'], stats[var]['wr']['0'],
 stats[var]['rd']['1'], stats[var]['wr']['1'],
 stats[var]['rd']['X'], stats[var]['wr']['X']])
answered Apr 8, 2015 at 11:58
\$\endgroup\$
0
3
\$\begingroup\$

Ok, I think I have another solution, but it's pure AWK. This content in :

 * D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us
 * D:40027C5C rd-byte 00 *core0\Global\Ypf_OILL_OilLvlOn 20.342us
 * D:40010044 rd-word 0FE2 *l\u2SAD_OILLVS_RecoveryCounter 0.160us
 * D:40010044 wr-word 0FE1 *l\u2SAD_OILLVS_RecoveryCounter 0.040us
 * D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us 

This code in armEtmToCsv.awk :

#!/usr/bin/awk -f
BEGIN{
 FS = " "
}
{
 n = split(5,ドルa,"\\")
 m = split(3,ドル b, "-")
 rw = b[1]
 t = b[2]
 if(match(5,ドル/0\\Global/)) c = "0";
 else if(match(5,ドル/1\\Global/)) c = "1";
 else c="X";
 key = a[n] "*" rw "*" t "*" c 
 arr[key]++
}
END{
 for (k in arr){
 split(k,d,"*")
 print d[1] "," d[2] "," d[3] "," d[4] "," arr[k]
 }
}

Give this output:

Ypf_OILL_OilLvlOn,rd,byte,0,1
u2SAD_OILLVS_RecoveryCounter,wr,word,X,1
u2SAD_OILLVS_RecoveryCounter,rd,word,X,1
u4TimeHiCnt,wr,long,0,2

Isn't that almost what you wanted? Of course there's a for in the END statement but it's a short one, it parses one key once. Turn this in pandas dataframe with this python script and you're done, you may have to use the pivot table feature in Pandas:

import subprocess
import pandas as pd
import io
cmd = './armEtmToCsv.awk armEtm.txt'
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
o, e = p.communicate()
p.wait()
r = p.returncode
i = io.StringIO(o.decode())
df = pd.read_csv(i)
print('Parsing result in Pandas dataframe:')
print(df)

I really think this way is very fast but can't test it...

answered Apr 8, 2015 at 14:38
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.