Parsing a big text file, extract data & store it in a CSV file

Question 1

I have a big log file (say 1-3 Gb) which I need to parse, extract data & save it in a CSV file.

Text File Data

 * D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us
 * D:40027C5C rd-byte 00 *core0\Global\Ypf_OILL_OilLvlOn 20.342us
 * D:40010044 rd-word 0FE2 *l\u2SAD_OILLVS_RecoveryCounter 0.160us
 * D:40010044 wr-word 0FE1 *l\u2SAD_OILLVS_RecoveryCounter 0.040us
 * D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us

I have to extract the variable name which is after the last \ and then the number of Read & Write along with the datatype & store it in a CSV file.

CSV File Result

Variable Datatype CORE 0 CORE 1 CORE X 
 Read Write Read Write Read Write 
 OS_inKernel byte 0 0 111768 111878 0 0
 OS_globalIntLevel long 0 0 281604 237901 0 0

The problem is it takes too much time. Can you please look at the attached code and suggest ways to make it faster?

import string
import sys
import time
MyFile = open("C:\\Users\\AEC_FULL\\Saravanan\\Workspace\\Trace32Log_Parser\\core1_sram_ReadWrite.txt")#core0_sram_ReadWrite_rawdata
GeneratedFile = open(str(("C:\\Users\\AEC_FULL\\Saravanan\\Workspace\\Trace32Log_Parser\\")+'ParsedOutput.csv'),'w')
try:
MyVariableList = []
TimeStartTest = time.time() #Starting Time 
GeneratedFile.write('\nVariable')
GeneratedFile.write(', Datatype')
GeneratedFile.write(', CORE 0')
GeneratedFile.write(',, CORE 1')
GeneratedFile.write(',, CORE X')
GeneratedFile.write('\n,, Read ')
GeneratedFile.write(', Write ')
GeneratedFile.write(', Read ')
GeneratedFile.write(', Write ')
GeneratedFile.write(', Read ')
GeneratedFile.write(', Write ') 
GeneratedFile.write('\n') 
for CurrentLine in MyFile:
 NoofSpaces = 0
 if CurrentLine.find('\\') != -1:
 MyVariable = CurrentLine[CurrentLine.rfind('\\')+1:].split(' ')[0] 
 elif CurrentLine.find('*\\') != -1:
 MyVariable = CurrentLine[CurrentLine.rfind('*\\')+1:].split(' ')[0] 
 elif CurrentLine.find('*') != -1: 
 MyVariable = CurrentLine[CurrentLine.rfind('*')+1:].split(' ')[0] 
 VariableFound = 0 
 MyVariableList.sort()
 Lowerbound = 0
 Upperbound = len(MyVariableList)-1
 while Lowerbound <= Upperbound and VariableFound == 0:
 middle_pos = (Lowerbound+Upperbound) // 2
 if MyVariableList[middle_pos] < MyVariable:
 Lowerbound = middle_pos + 1
 elif MyVariableList[middle_pos] > MyVariable:
 Upperbound = middle_pos - 1
 else:
 VariableFound = 1 
 if VariableFound == 0: 
 MyVariableList.append(MyVariable) 
 try:
 MyFile1 = open("C:\\Users\\AEC_FULL\\Saravanan\\Workspace\\Trace32Log_Parser\\core1_sram_ReadWrite.txt")#core0_sram_ReadWrite_rawdata
 Core0_ReadCount = 0
 Core0_WriteCount = 0
 Core1_ReadCount = 0
 Core1_WriteCount = 0
 CoreX_ReadCount = 0
 CoreX_WriteCount = 0 
 for CurrentLine1 in MyFile1:
 if CurrentLine1.find(MyVariable) != -1:
 ## CORE 0 ##
 if CurrentLine1.find("0\\Global") != -1:
 DataType = CurrentLine1.split(' ')[0].split('-')[1]
 DataOperation = CurrentLine1.split(' ')[0].split('-')[0].split(' ')[-1]
 if DataOperation == 'rd':
 Core0_ReadCount = Core0_ReadCount + 1
 elif DataOperation == 'wr':
 Core0_WriteCount = Core0_WriteCount + 1 
 ## CORE 1 ## 
 elif CurrentLine1.find("1\\Global") != -1: 
 DataType = CurrentLine1.split(' ')[0].split('-')[1]
 DataOperation = CurrentLine1.split(' ')[0].split('-')[0].split(' ')[-1]
 if DataOperation == 'rd':
 Core1_ReadCount = Core1_ReadCount + 1
 elif DataOperation == 'wr':
 Core1_WriteCount = Core1_WriteCount + 1 
 ## CORE X ## 
 else:
 DataType = CurrentLine1.split(' ')[0].split('-')[1]
 DataOperation = CurrentLine1.split(' ')[0].split('-')[0].split(' ')[-1]
 if DataOperation == 'rd':
 CoreX_ReadCount = CoreX_ReadCount + 1
 elif DataOperation == 'wr':
 CoreX_WriteCount = CoreX_WriteCount + 1
 GeneratedFile.write('\n %s' %MyVariable)
 GeneratedFile.write(', %s' %DataType) 
 GeneratedFile.write(', %d' %Core0_ReadCount)
 GeneratedFile.write(', %d' %Core0_WriteCount) 
 GeneratedFile.write(', %d' %Core1_ReadCount)
 GeneratedFile.write(', %d' %Core1_WriteCount) 
 GeneratedFile.write(', %d' %CoreX_ReadCount)
 GeneratedFile.write(', %d' %CoreX_WriteCount) 
 GeneratedFile.write('\n') 
 finally:
 MyFile1.close() 
except:
 print sys.exc_info() 
finally:
 GeneratedFile.close() 
 MyFile.close()
 TimeStopTest = time.time()
 print str(int((TimeStopTest - TimeStartTest)/60))

Question 2

Have you profiled it?

Question 3

Are you sure that this code works? elif CurrentLine.find('*\\') != -1: couldn't possibly happen.

Question 4

Please check your indentation. The entire body of the try block is not indented.

Question 5

Strategy

The cause of the performance problem is that for every line where you encounter a new variable name, you open the same file again and scan through, looking at every line that contains the same variable name. That is a process that takes \$O(n^2)\$ time, where \$n\$ is the number of lines in the input file. For large input files, anything slower than \$O(n)\$ would be unacceptable.

A root cause of your problem is that you aren't using data structures effectively. For example, you maintain a MyVariableList array, which you periodically sort so that you can do a binary search on it. What you want is a set, or if you want to preserve the order of appearance of the variables in the output, an OrderedDict.

A reasonable approach would make just one pass through the file, accumulating statistics in a data structure as you go, and then write a summary of that data structure when you reach the end of the input.

The outline of your program should look like this. To note:

Organize code into functions
Use with blocks to open files, so that they will automatically be closed
Take advantage of Python's csv library.

from collections import OrderedDict
import csv
import re
def analyze_log(f):
 stats = OrderedDict()
 for line in f:
 re.search(...)
 ...
 return stats
def write_stats(stats, f):
 out = csv.writer(f)
 out.writerow(...)
 for var in stats:
 out.writerow(...)
def main(input_filename, output_filename):
 with open(input_filename) as input_file:
 stats = analyze_log(input_file)
 with open(output_filename, 'w') as output_file:
 write_stats(stats, output_file)
if __name__ == '__main__':
 main(r'C:\Users\AEC_FULL\...\core1_sram_ReadWrite.txt',
 r'C:\Users\AEC_FULL\...\ParsedOutput.csv')

Even better, take the input from fileinput.input(), and write the output to sys.stdout. Use the shell command to specify the input files and redirect the output to a file, and avoid hard-coding the input and output filenames in your script. Then you could just write

import fileinput
import sys
if __name__ == '__main__':
 write_stats(analyze_log(fileinput.input()), sys.stdout)

Parsing

Slicing each line using finds and splits all over the place is confusing. You would be much better off trying to "make sense" of the input file fields, then using regular expression matches.

def analyze_log(f):
 stats = OrderedDict()
 for line in f:
 _, _, rw_datatype, _, core_varname, _ = line.split()
 match = re.search(r'.*[*\\](.*)', core_varname)
 if not match:
 continue
 var = match.group(1)
 match = re.search(r'([01])\\Global', core_varname)
 core = match and match.group(1) or 'X'
 rw, datatype = rw_datatype.split('-', 1)
 var_stats = stats.get(var, {'rd': {'0': 0, '1': 0, 'X': 0},
 'wr': {'0': 0, '1': 0, 'X': 0},
 'type': datatype })
 stats[var] = var_stats
 var_stats[rw][core] += 1
 return stats

Writing

Using the CSV library would result in tidier code than writing commas and newlines.

def write_stats(stats, f):
 out = csv.writer(f)
 out.writerow(['Variable', 'Datatype',
 'CORE 0', None, 'CORE 1', None, 'CORE X', None])
 out.writerow([None, None] +
 ['Read', 'Write'] * 3)
 for var in stats:
 out.writerow([var, stats[var]['type'],
 stats[var]['rd']['0'], stats[var]['wr']['0'],
 stats[var]['rd']['1'], stats[var]['wr']['1'],
 stats[var]['rd']['X'], stats[var]['wr']['X']])

Question 6

Ok, I think I have another solution, but it's pure AWK. This content in :

 * D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us
 * D:40027C5C rd-byte 00 *core0\Global\Ypf_OILL_OilLvlOn 20.342us
 * D:40010044 rd-word 0FE2 *l\u2SAD_OILLVS_RecoveryCounter 0.160us
 * D:40010044 wr-word 0FE1 *l\u2SAD_OILLVS_RecoveryCounter 0.040us
 * D:40035FC8 wr-long 00000008 \\core0\Global\u4TimeHiCnt 1.000us

This code in armEtmToCsv.awk :

#!/usr/bin/awk -f
BEGIN{
 FS = " "
}
{
 n = split(5,ドルa,"\\")
 m = split(3,ドル b, "-")
 rw = b[1]
 t = b[2]
 if(match(5,ドル/0\\Global/)) c = "0";
 else if(match(5,ドル/1\\Global/)) c = "1";
 else c="X";
 key = a[n] "*" rw "*" t "*" c 
 arr[key]++
}
END{
 for (k in arr){
 split(k,d,"*")
 print d[1] "," d[2] "," d[3] "," d[4] "," arr[k]
 }
}

Give this output:

Ypf_OILL_OilLvlOn,rd,byte,0,1
u2SAD_OILLVS_RecoveryCounter,wr,word,X,1
u2SAD_OILLVS_RecoveryCounter,rd,word,X,1
u4TimeHiCnt,wr,long,0,2

Isn't that almost what you wanted? Of course there's a for in the END statement but it's a short one, it parses one key once. Turn this in pandas dataframe with this python script and you're done, you may have to use the pivot table feature in Pandas:

import subprocess
import pandas as pd
import io
cmd = './armEtmToCsv.awk armEtm.txt'
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
o, e = p.communicate()
p.wait()
r = p.returncode
i = io.StringIO(o.decode())
df = pd.read_csv(i)
print('Parsing result in Pandas dataframe:')
print(df)

I really think this way is very fast but can't test it...

score 8 · Accepted Answer · 2015-04-08 11:58:39Z

Strategy

The cause of the performance problem is that for every line where you encounter a new variable name, you open the same file again and scan through, looking at every line that contains the same variable name. That is a process that takes \$O(n^2)\$ time, where \$n\$ is the number of lines in the input file. For large input files, anything slower than \$O(n)\$ would be unacceptable.

A root cause of your problem is that you aren't using data structures effectively. For example, you maintain a MyVariableList array, which you periodically sort so that you can do a binary search on it. What you want is a set, or if you want to preserve the order of appearance of the variables in the output, an OrderedDict.

A reasonable approach would make just one pass through the file, accumulating statistics in a data structure as you go, and then write a summary of that data structure when you reach the end of the input.

The outline of your program should look like this. To note:

Organize code into functions
Use with blocks to open files, so that they will automatically be closed
Take advantage of Python's csv library.

from collections import OrderedDict
import csv
import re
def analyze_log(f):
 stats = OrderedDict()
 for line in f:
 re.search(...)
 ...
 return stats
def write_stats(stats, f):
 out = csv.writer(f)
 out.writerow(...)
 for var in stats:
 out.writerow(...)
def main(input_filename, output_filename):
 with open(input_filename) as input_file:
 stats = analyze_log(input_file)
 with open(output_filename, 'w') as output_file:
 write_stats(stats, output_file)
if __name__ == '__main__':
 main(r'C:\Users\AEC_FULL\...\core1_sram_ReadWrite.txt',
 r'C:\Users\AEC_FULL\...\ParsedOutput.csv')

Even better, take the input from fileinput.input(), and write the output to sys.stdout. Use the shell command to specify the input files and redirect the output to a file, and avoid hard-coding the input and output filenames in your script. Then you could just write

import fileinput
import sys
if __name__ == '__main__':
 write_stats(analyze_log(fileinput.input()), sys.stdout)

Parsing

Slicing each line using finds and splits all over the place is confusing. You would be much better off trying to "make sense" of the input file fields, then using regular expression matches.

def analyze_log(f):
 stats = OrderedDict()
 for line in f:
 _, _, rw_datatype, _, core_varname, _ = line.split()
 match = re.search(r'.*[*\\](.*)', core_varname)
 if not match:
 continue
 var = match.group(1)
 match = re.search(r'([01])\\Global', core_varname)
 core = match and match.group(1) or 'X'
 rw, datatype = rw_datatype.split('-', 1)
 var_stats = stats.get(var, {'rd': {'0': 0, '1': 0, 'X': 0},
 'wr': {'0': 0, '1': 0, 'X': 0},
 'type': datatype })
 stats[var] = var_stats
 var_stats[rw][core] += 1
 return stats

Writing

Using the CSV library would result in tidier code than writing commas and newlines.

def write_stats(stats, f):
 out = csv.writer(f)
 out.writerow(['Variable', 'Datatype',
 'CORE 0', None, 'CORE 1', None, 'CORE X', None])
 out.writerow([None, None] +
 ['Read', 'Write'] * 3)
 for var in stats:
 out.writerow([var, stats[var]['type'],
 stats[var]['rd']['0'], stats[var]['wr']['0'],
 stats[var]['rd']['1'], stats[var]['wr']['1'],
 stats[var]['rd']['X'], stats[var]['wr']['X']])

Stack Exchange Network

Parsing a big text file, extract data & store it in a CSV file

2 Answers 2

Strategy

Parsing

Writing

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing a big text file, extract data & store it in a CSV file

2 Answers 2

Strategy

Parsing

Writing

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions