function responsible for parsing 2 csv txt files, comparing them and writing csv txt files as output

Question 1

 def Ana_exc():
 global global_dic, missing_key_w, out_put_defult, ffd_ana_exception_path_w, ana_exc_input_path, ana_5min_input_path, min_flag
 count_path1 = 0
 count_path2 = 0
 meow2 = ''
 ana_exc_time = ''
 ana_ffm_track = []
 ana_exc_missing = []
 time_track = []
 ana_exc_ffm_header = True
 with open(ffm_all_w + 'ana_ffm.txt', 'w') as ana_ffm, open(missing_key_w + 'ana_missint_keys.txt', 'w') as ana_missing_keys:
 for i in range(len(ana_5min_input_path)):
 if not count_path1 > len(ana_5min_input_path):
 with open(ana_5min_input_path[count_path1], 'r') as ana_5min:
 count_path1 = count_path1 + 1
 for x in range(len(ana_exc_input_path)):
 if not count_path2 > len(ana_exc_input_path):
 with open(ana_exc_input_path[count_path2], 'r') as ana_exc, open(ffd_ana_exception_path_w + 'ana_ffd.txt' + str(count_path2), 'w') as ffd_ana:
 count_path2 = count_path2 + 1
 ana_ffd_header = True
 # per 2 files exist a metadata file, this will write the header for the txt file
 if ana_exc_ffm_header:
 ana_ffm.write('header' + ',' + '1' + '\n')
 ana_exc_ffm_header = False
 # in charge of reading and processing file1 (with random time stamp)
 for line in ana_exc:
 min_flag = True
 # spliting the fields of csv txt file
 col = line.split(",")
 # to ignore random rows that contain random numbers
 if str(line[2]).startswith('/'):
 # making a unique key to allow compareson between files
 ana_exc_key = (col[1] + '|' + col[2] + '|' + col[3] + '|' + col[4])
 # extract time stamp from field
 ana_exc_time = col[0]
 # match with a cross refrence dictionary to ensure the point is acceptable
 if ana_exc_key in global_dic:
 # transfer human readble time to unix time stamp
 meow = datetime.datetime.strptime(ana_exc_time, "%d/%m/%Y %H:%M:%S") # change str time to date/time obj
 unix_timestamp = calendar.timegm(meow.timetuple()) # do the conversion to unix stamp
 time_ms1 = unix_timestamp * 1000
 time_exc = time_ms1
 # write metadata file, after chacking the point has not been written before
 if ana_exc_key not in ana_ffm_track:
 ana_ffm.write('point' + ',' + str(global_dic[ana_exc_key]['cpKey']) + ',' + str(global_dic[ana_exc_key]['SCADA Key']) + ',' + str(global_dic[ana_exc_key]['Point Name']) + ',' + 'analog' + ',' + ',' + '1' + '\n')
 ana_ffm_track.append(ana_exc_key)
 # if time stamp of file 1 is same as the time stamp of file 2. for the points fitting this critiria process from file 2 insted of file 1
 if meow.minute % 5 or meow.minute == 00 and time_ms1 not in time_track:
 min_flag = False
 for line2 in ana_5min:
 col2 = line2.split(",")
 if str(line2[2]).startswith('/'):
 ana_5min_key = (col2[1] + '|' + col2[2] + '|' + col2[3] + '|' + col2[4])
 ana_5min_time = col2[0]
 if ana_5min_key in global_dic:
 meow2 = datetime.datetime.strptime(ana_5min_time, "%d/%m/%Y %H:%M:%S") # change str time to date/time obj
 unix_timestamp = calendar.timegm(meow2.timetuple()) # do the conversion to unix stamp
 time_ms = unix_timestamp * 1000
 time_ana = time_ms
 if ana_ffd_header:
 ffd_ana.write('header' + ',' + str(time_ms) + ',' + '1' + '\n')
 ana_ffd_header = False
 ffd_ana.write('value' + ',' + str(global_dic[ana_5min_key]['cpKey']) + ',' + str(global_dic[ana_5min_key]['SCADA Key']) + ',' + str(col2[6]) + ',' + str(time_ana) + ',' + str(time_ana) + ',' + '0' + ',' + '0' + ',' + '0' + '\n')
 if ana_5min_key not in ana_ffm_track:
 ana_ffm.write('point' + ',' + str(global_dic[ana_5min_key]['cpKey']) + ',' + str(global_dic[ana_5min_key]['SCADA Key']) + ',' + str(global_dic[ana_5min_key]['Point Name']) + ',' + 'analog' + ',' + ',' + '1' + '\n')
 ana_ffm_track.append(ana_5min_key)
 else:
 if ana_5min_key not in ana_exc_missing:
 ana_missing_keys.write(ana_5min_key + '\n')
 ana_exc_missing.append(ana_5min_key)
 if meow.hour != meow2.hour or meow.minute != meow2.minute or meow.second != meow2.second:
 break
 time_track.append(time_ms1)
 if ana_ffd_header:
 ffd_ana.write('header' + ',' + str(time_exc) + ',' + '1' + '\n')
 ana_ffd_header = False
 if time_ms1 not in time_track:
 ffd_ana.write('value' + ',' + str(global_dic[ana_exc_key]['cpKey']) + ',' + str(global_dic[ana_exc_key]['SCADA Key']) + ',' + str(col[6]) + ',' + str(time_exc) + ',' + str(time_exc) + ',' + '0' + ',' + '0' + ',' + '0' + '\n')
 else:
 if ana_exc_key not in ana_exc_missing:
 ana_missing_keys.write(ana_exc_key + '\n')
 ana_exc_missing.append(ana_exc_key)
 else:
 break
 else:
 break
 return None

I need help cleaning up the function posted above, this function will open multiple txt files, read and extract some information, than write back to multiple txt files. the code is too dirty and at times very slow.

the function will be handling files with millions of lines
im new to coding
txt files are comma delimited
specially with the section of the code that is handling file opening and closing
the code does work, just needs clean up and improvements

summery of the code: It opens files to read from and write to. It looks at each line of one file, separate them to columns. the fist column in both files are time stamps, first file the time stamps are random, second file contains lines with time-stamps that are increments of 5 minutes. whenever the fist file has time stamps that match the time stamp of the second file the line from second file is processed, otherwise the line from first file is processed. Lines from both file1 and file2 will only be processed if they have a match to the global_dict (dictionary) otherwise they will be written to missing files. Also a key is constructed from multiple field of each line to serve as a unique identifier

file ex:

file 1: (csv format)(time-stamps random)
 time-stamp0,field1.1,field1.2,field1.3,field1.4,field1.5,...
 3
 time-stamp1,field2.1,field2.2,field2.3,field2.4,field2.5,...
 5
 time-stampn,fieldn.1,fieldn.2,fieldn.3,fieldn.4,fieldn.5,...
 12
 .......
file 2: (csv format)(time-stamps 5 minute increments)
 time-stamp0,field1.1,field1.2,field1.3,field1.4,field1.5,...
 1
 time-stamp1,field2.1,field2.2,field2.3,field2.4,field2.5,...
 5, 6
 time-stampn,fieldn.1,fieldn.2,fieldn.3,fieldn.4,fieldn.5,...
 .......

Question 2

The code is currently way too hard to follow to help you all the way through but there are some things I can suggest right away:

Break this into multiple functions that each accomplish one task and one task only and name the functions to explain exactly what they do. That will help reduce some of the indenting and make it easier to read in the future. If you find yourself indenting more than a couple of times, you should ask how you can refactor the ifs or for loops into functions.
It's slow because you are reading the entirety of two "files with millions of lines" before you even start. Don't do that . . .
Use Python's built-in csv module. As you can see in this StackOverflow question, it will help by reading files one line at a time. Hopefully one line at a time is ok for comparing the two files (it's not clear from the code above).
Always try to look for built-in or 3rd party modules before you write your own code to do the same. Once you get more familiar with coding then you can fall into the trap of thinking everyone else's code is dumb and you need to reinvent the wheel. For now, find existing wheels that have been tested and cleaned up and use them. To that point, there is no reason not to start tasks with searches like python diff two csv files to see what's out there. The top result for that search is a package called csvdiff. I think I might scrap the code you have and use that. If it works, take a look the code itself and see how they accomplished the task. It may be a bit hard to follow at times because the package has to handle a bunch of different issues but you probably will learn something.
In Python (and languages like it), looping over a number instead of the collection itself (like you're doing with all these range functions) is a "code smell". You usually don't need the number and it's a performance drain. If you do need a number inside the loop, consider enumerate(my_collection) which will give you a counter and a collection element as you go.

Question 3

-Thank you for your response. I can not use many of the modules and 3rd party packages available. The system that will be running this code, does not have internet access, nor administrative access to add any thing to the available python package.

Question 4

But you can download the package into the folder your code lives in and reference it that way. If you can upload your code, you can upload the other files as well.

Question 5

The system and OS does not give access to any one. The code will be moved as 1 file on a flash drive. i also physically will not touch the system. I have made request to add PANDA and CSV. but i have to ensure i have a reasonable solution in the meantime. The code will process 2 years worth of data. each day containing GBs worth of files

Question 6

"I have made request to add PANDA and CSV" This is confusing to me: Pandas is a 3rd-party module, but csv is a Python built-in which should be available to you with import csv. As for the other restrictions, there's nothing stopping you from making one really big file that includes any code you like (though obviously trying to cram all of Pandas in would be silly, you probably can steal what you need from csvdiff and jam it into your file).

Question 7

please point me to the area that you believe im reading the entire file. (i really dont know witch part of the code your referring to)

Tom TomTom 1865 bronze badges · Accepted Answer · 2016-08-02 15:54:30Z

The code is currently way too hard to follow to help you all the way through but there are some things I can suggest right away:

Break this into multiple functions that each accomplish one task and one task only and name the functions to explain exactly what they do. That will help reduce some of the indenting and make it easier to read in the future. If you find yourself indenting more than a couple of times, you should ask how you can refactor the ifs or for loops into functions.
It's slow because you are reading the entirety of two "files with millions of lines" before you even start. Don't do that . . .
Use Python's built-in csv module. As you can see in this StackOverflow question, it will help by reading files one line at a time. Hopefully one line at a time is ok for comparing the two files (it's not clear from the code above).
Always try to look for built-in or 3rd party modules before you write your own code to do the same. Once you get more familiar with coding then you can fall into the trap of thinking everyone else's code is dumb and you need to reinvent the wheel. For now, find existing wheels that have been tested and cleaned up and use them. To that point, there is no reason not to start tasks with searches like python diff two csv files to see what's out there. The top result for that search is a package called csvdiff. I think I might scrap the code you have and use that. If it works, take a look the code itself and see how they accomplished the task. It may be a bit hard to follow at times because the package has to handle a bunch of different issues but you probably will learn something.
In Python (and languages like it), looping over a number instead of the collection itself (like you're doing with all these range functions) is a "code smell". You usually don't need the number and it's a performance drain. If you do need a number inside the loop, consider enumerate(my_collection) which will give you a counter and a collection element as you go.

-Thank you for your response. I can not use many of the modules and 3rd party packages available. The system that will be running this code, does not have internet access, nor administrative access to add any thing to the available python package.
But you can download the package into the folder your code lives in and reference it that way. If you can upload your code, you can upload the other files as well.
The system and OS does not give access to any one. The code will be moved as 1 file on a flash drive. i also physically will not touch the system. I have made request to add PANDA and CSV. but i have to ensure i have a reasonable solution in the meantime. The code will process 2 years worth of data. each day containing GBs worth of files
"I have made request to add PANDA and CSV" This is confusing to me: Pandas is a 3rd-party module, but csv is a Python built-in which should be available to you with import csv. As for the other restrictions, there's nothing stopping you from making one really big file that includes any code you like (though obviously trying to cram all of Pandas in would be silly, you probably can steal what you need from csvdiff and jam it into your file).
please point me to the area that you believe im reading the entire file. (i really dont know witch part of the code your referring to)

Stack Exchange Network

function responsible for parsing 2 csv txt files, comparing them and writing csv txt files as output

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

function responsible for parsing 2 csv txt files, comparing them and writing csv txt files as output

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions