I have a big log file (say 1-3 Gb) which I need to parse, extract data & save it in a CSV file.
Text File Data
* D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us
* D:40027C5C rd-byte 00 *core0\Global\Ypf_OILL_OilLvlOn 20.342us
* D:40010044 rd-word 0FE2 *l\u2SAD_OILLVS_RecoveryCounter 0.160us
* D:40010044 wr-word 0FE1 *l\u2SAD_OILLVS_RecoveryCounter 0.040us
* D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us
I have to extract the variable name which is after the last \ and then the number of Read & Write along with the datatype & store it in a CSV file.
CSV File Result
Variable Datatype CORE 0 CORE 1 CORE X
Read Write Read Write Read Write
OS_inKernel byte 0 0 111768 111878 0 0
OS_globalIntLevel long 0 0 281604 237901 0 0
The problem is it takes too much time. Can you please look at the attached code and suggest ways to make it faster?
import string
import sys
import time
MyFile = open("C:\\Users\\AEC_FULL\\Saravanan\\Workspace\\Trace32Log_Parser\\core1_sram_ReadWrite.txt")#core0_sram_ReadWrite_rawdata
GeneratedFile = open(str(("C:\\Users\\AEC_FULL\\Saravanan\\Workspace\\Trace32Log_Parser\\")+'ParsedOutput.csv'),'w')
try:
MyVariableList = []
TimeStartTest = time.time() #Starting Time
GeneratedFile.write('\nVariable')
GeneratedFile.write(', Datatype')
GeneratedFile.write(', CORE 0')
GeneratedFile.write(',, CORE 1')
GeneratedFile.write(',, CORE X')
GeneratedFile.write('\n,, Read ')
GeneratedFile.write(', Write ')
GeneratedFile.write(', Read ')
GeneratedFile.write(', Write ')
GeneratedFile.write(', Read ')
GeneratedFile.write(', Write ')
GeneratedFile.write('\n')
for CurrentLine in MyFile:
NoofSpaces = 0
if CurrentLine.find('\\') != -1:
MyVariable = CurrentLine[CurrentLine.rfind('\\')+1:].split(' ')[0]
elif CurrentLine.find('*\\') != -1:
MyVariable = CurrentLine[CurrentLine.rfind('*\\')+1:].split(' ')[0]
elif CurrentLine.find('*') != -1:
MyVariable = CurrentLine[CurrentLine.rfind('*')+1:].split(' ')[0]
VariableFound = 0
MyVariableList.sort()
Lowerbound = 0
Upperbound = len(MyVariableList)-1
while Lowerbound <= Upperbound and VariableFound == 0:
middle_pos = (Lowerbound+Upperbound) // 2
if MyVariableList[middle_pos] < MyVariable:
Lowerbound = middle_pos + 1
elif MyVariableList[middle_pos] > MyVariable:
Upperbound = middle_pos - 1
else:
VariableFound = 1
if VariableFound == 0:
MyVariableList.append(MyVariable)
try:
MyFile1 = open("C:\\Users\\AEC_FULL\\Saravanan\\Workspace\\Trace32Log_Parser\\core1_sram_ReadWrite.txt")#core0_sram_ReadWrite_rawdata
Core0_ReadCount = 0
Core0_WriteCount = 0
Core1_ReadCount = 0
Core1_WriteCount = 0
CoreX_ReadCount = 0
CoreX_WriteCount = 0
for CurrentLine1 in MyFile1:
if CurrentLine1.find(MyVariable) != -1:
## CORE 0 ##
if CurrentLine1.find("0\\Global") != -1:
DataType = CurrentLine1.split(' ')[0].split('-')[1]
DataOperation = CurrentLine1.split(' ')[0].split('-')[0].split(' ')[-1]
if DataOperation == 'rd':
Core0_ReadCount = Core0_ReadCount + 1
elif DataOperation == 'wr':
Core0_WriteCount = Core0_WriteCount + 1
## CORE 1 ##
elif CurrentLine1.find("1\\Global") != -1:
DataType = CurrentLine1.split(' ')[0].split('-')[1]
DataOperation = CurrentLine1.split(' ')[0].split('-')[0].split(' ')[-1]
if DataOperation == 'rd':
Core1_ReadCount = Core1_ReadCount + 1
elif DataOperation == 'wr':
Core1_WriteCount = Core1_WriteCount + 1
## CORE X ##
else:
DataType = CurrentLine1.split(' ')[0].split('-')[1]
DataOperation = CurrentLine1.split(' ')[0].split('-')[0].split(' ')[-1]
if DataOperation == 'rd':
CoreX_ReadCount = CoreX_ReadCount + 1
elif DataOperation == 'wr':
CoreX_WriteCount = CoreX_WriteCount + 1
GeneratedFile.write('\n %s' %MyVariable)
GeneratedFile.write(', %s' %DataType)
GeneratedFile.write(', %d' %Core0_ReadCount)
GeneratedFile.write(', %d' %Core0_WriteCount)
GeneratedFile.write(', %d' %Core1_ReadCount)
GeneratedFile.write(', %d' %Core1_WriteCount)
GeneratedFile.write(', %d' %CoreX_ReadCount)
GeneratedFile.write(', %d' %CoreX_WriteCount)
GeneratedFile.write('\n')
finally:
MyFile1.close()
except:
print sys.exc_info()
finally:
GeneratedFile.close()
MyFile.close()
TimeStopTest = time.time()
print str(int((TimeStopTest - TimeStartTest)/60))
2 Answers 2
Strategy
The cause of the performance problem is that for every line where you encounter a new variable name, you open the same file again and scan through, looking at every line that contains the same variable name. That is a process that takes \$O(n^2)\$ time, where \$n\$ is the number of lines in the input file. For large input files, anything slower than \$O(n)\$ would be unacceptable.
A root cause of your problem is that you aren't using data structures effectively. For example, you maintain a MyVariableList
array, which you periodically sort so that you can do a binary search on it. What you want is a set
, or if you want to preserve the order of appearance of the variables in the output, an OrderedDict
.
A reasonable approach would make just one pass through the file, accumulating statistics in a data structure as you go, and then write a summary of that data structure when you reach the end of the input.
The outline of your program should look like this. To note:
- Organize code into functions
- Use
with
blocks to open files, so that they will automatically be closed - Take advantage of Python's
csv
library.
from collections import OrderedDict
import csv
import re
def analyze_log(f):
stats = OrderedDict()
for line in f:
re.search(...)
...
return stats
def write_stats(stats, f):
out = csv.writer(f)
out.writerow(...)
for var in stats:
out.writerow(...)
def main(input_filename, output_filename):
with open(input_filename) as input_file:
stats = analyze_log(input_file)
with open(output_filename, 'w') as output_file:
write_stats(stats, output_file)
if __name__ == '__main__':
main(r'C:\Users\AEC_FULL\...\core1_sram_ReadWrite.txt',
r'C:\Users\AEC_FULL\...\ParsedOutput.csv')
Even better, take the input from fileinput.input()
, and write the output to sys.stdout
. Use the shell command to specify the input files and redirect the output to a file, and avoid hard-coding the input and output filenames in your script. Then you could just write
import fileinput
import sys
if __name__ == '__main__':
write_stats(analyze_log(fileinput.input()), sys.stdout)
Parsing
Slicing each line using finds and splits all over the place is confusing. You would be much better off trying to "make sense" of the input file fields, then using regular expression matches.
def analyze_log(f):
stats = OrderedDict()
for line in f:
_, _, rw_datatype, _, core_varname, _ = line.split()
match = re.search(r'.*[*\\](.*)', core_varname)
if not match:
continue
var = match.group(1)
match = re.search(r'([01])\\Global', core_varname)
core = match and match.group(1) or 'X'
rw, datatype = rw_datatype.split('-', 1)
var_stats = stats.get(var, {'rd': {'0': 0, '1': 0, 'X': 0},
'wr': {'0': 0, '1': 0, 'X': 0},
'type': datatype })
stats[var] = var_stats
var_stats[rw][core] += 1
return stats
Writing
Using the CSV library would result in tidier code than writing commas and newlines.
def write_stats(stats, f):
out = csv.writer(f)
out.writerow(['Variable', 'Datatype',
'CORE 0', None, 'CORE 1', None, 'CORE X', None])
out.writerow([None, None] +
['Read', 'Write'] * 3)
for var in stats:
out.writerow([var, stats[var]['type'],
stats[var]['rd']['0'], stats[var]['wr']['0'],
stats[var]['rd']['1'], stats[var]['wr']['1'],
stats[var]['rd']['X'], stats[var]['wr']['X']])
Ok, I think I have another solution, but it's pure AWK. This content in :
* D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us
* D:40027C5C rd-byte 00 *core0\Global\Ypf_OILL_OilLvlOn 20.342us
* D:40010044 rd-word 0FE2 *l\u2SAD_OILLVS_RecoveryCounter 0.160us
* D:40010044 wr-word 0FE1 *l\u2SAD_OILLVS_RecoveryCounter 0.040us
* D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us
This code in armEtmToCsv.awk :
#!/usr/bin/awk -f
BEGIN{
FS = " "
}
{
n = split(5,ドルa,"\\")
m = split(3,ドル b, "-")
rw = b[1]
t = b[2]
if(match(5,ドル/0\\Global/)) c = "0";
else if(match(5,ドル/1\\Global/)) c = "1";
else c="X";
key = a[n] "*" rw "*" t "*" c
arr[key]++
}
END{
for (k in arr){
split(k,d,"*")
print d[1] "," d[2] "," d[3] "," d[4] "," arr[k]
}
}
Give this output:
Ypf_OILL_OilLvlOn,rd,byte,0,1
u2SAD_OILLVS_RecoveryCounter,wr,word,X,1
u2SAD_OILLVS_RecoveryCounter,rd,word,X,1
u4TimeHiCnt,wr,long,0,2
Isn't that almost what you wanted? Of course there's a for in the END statement but it's a short one, it parses one key once. Turn this in pandas dataframe with this python script and you're done, you may have to use the pivot table feature in Pandas:
import subprocess
import pandas as pd
import io
cmd = './armEtmToCsv.awk armEtm.txt'
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
o, e = p.communicate()
p.wait()
r = p.returncode
i = io.StringIO(o.decode())
df = pd.read_csv(i)
print('Parsing result in Pandas dataframe:')
print(df)
I really think this way is very fast but can't test it...
Explore related questions
See similar questions with these tags.
elif CurrentLine.find('*\\') != -1:
couldn't possibly happen. \$\endgroup\$try
block is not indented. \$\endgroup\$