function responsible for parsing 2 csv txt files, comparing them and writing csv txt files as output
def Ana_exc():
global global_dic, missing_key_w, out_put_defult, ffd_ana_exception_path_w, ana_exc_input_path, ana_5min_input_path, min_flag
count_path1 = 0
count_path2 = 0
meow2 = ''
ana_exc_time = ''
ana_ffm_track = []
ana_exc_missing = []
time_track = []
ana_exc_ffm_header = True
with open(ffm_all_w + 'ana_ffm.txt', 'w') as ana_ffm, open(missing_key_w + 'ana_missint_keys.txt', 'w') as ana_missing_keys:
for i in range(len(ana_5min_input_path)):
if not count_path1 > len(ana_5min_input_path):
with open(ana_5min_input_path[count_path1], 'r') as ana_5min:
count_path1 = count_path1 + 1
for x in range(len(ana_exc_input_path)):
if not count_path2 > len(ana_exc_input_path):
with open(ana_exc_input_path[count_path2], 'r') as ana_exc, open(ffd_ana_exception_path_w + 'ana_ffd.txt' + str(count_path2), 'w') as ffd_ana:
count_path2 = count_path2 + 1
ana_ffd_header = True
# per 2 files exist a metadata file, this will write the header for the txt file
if ana_exc_ffm_header:
ana_ffm.write('header' + ',' + '1' + '\n')
ana_exc_ffm_header = False
# in charge of reading and processing file1 (with random time stamp)
for line in ana_exc:
min_flag = True
# spliting the fields of csv txt file
col = line.split(",")
# to ignore random rows that contain random numbers
if str(line[2]).startswith('/'):
# making a unique key to allow compareson between files
ana_exc_key = (col[1] + '|' + col[2] + '|' + col[3] + '|' + col[4])
# extract time stamp from field
ana_exc_time = col[0]
# match with a cross refrence dictionary to ensure the point is acceptable
if ana_exc_key in global_dic:
# transfer human readble time to unix time stamp
meow = datetime.datetime.strptime(ana_exc_time, "%d/%m/%Y %H:%M:%S") # change str time to date/time obj
unix_timestamp = calendar.timegm(meow.timetuple()) # do the conversion to unix stamp
time_ms1 = unix_timestamp * 1000
time_exc = time_ms1
# write metadata file, after chacking the point has not been written before
if ana_exc_key not in ana_ffm_track:
ana_ffm.write('point' + ',' + str(global_dic[ana_exc_key]['cpKey']) + ',' + str(global_dic[ana_exc_key]['SCADA Key']) + ',' + str(global_dic[ana_exc_key]['Point Name']) + ',' + 'analog' + ',' + ',' + '1' + '\n')
ana_ffm_track.append(ana_exc_key)
# if time stamp of file 1 is same as the time stamp of file 2. for the points fitting this critiria process from file 2 insted of file 1
if meow.minute % 5 or meow.minute == 00 and time_ms1 not in time_track:
min_flag = False
for line2 in ana_5min:
col2 = line2.split(",")
if str(line2[2]).startswith('/'):
ana_5min_key = (col2[1] + '|' + col2[2] + '|' + col2[3] + '|' + col2[4])
ana_5min_time = col2[0]
if ana_5min_key in global_dic:
meow2 = datetime.datetime.strptime(ana_5min_time, "%d/%m/%Y %H:%M:%S") # change str time to date/time obj
unix_timestamp = calendar.timegm(meow2.timetuple()) # do the conversion to unix stamp
time_ms = unix_timestamp * 1000
time_ana = time_ms
if ana_ffd_header:
ffd_ana.write('header' + ',' + str(time_ms) + ',' + '1' + '\n')
ana_ffd_header = False
ffd_ana.write('value' + ',' + str(global_dic[ana_5min_key]['cpKey']) + ',' + str(global_dic[ana_5min_key]['SCADA Key']) + ',' + str(col2[6]) + ',' + str(time_ana) + ',' + str(time_ana) + ',' + '0' + ',' + '0' + ',' + '0' + '\n')
if ana_5min_key not in ana_ffm_track:
ana_ffm.write('point' + ',' + str(global_dic[ana_5min_key]['cpKey']) + ',' + str(global_dic[ana_5min_key]['SCADA Key']) + ',' + str(global_dic[ana_5min_key]['Point Name']) + ',' + 'analog' + ',' + ',' + '1' + '\n')
ana_ffm_track.append(ana_5min_key)
else:
if ana_5min_key not in ana_exc_missing:
ana_missing_keys.write(ana_5min_key + '\n')
ana_exc_missing.append(ana_5min_key)
if meow.hour != meow2.hour or meow.minute != meow2.minute or meow.second != meow2.second:
break
time_track.append(time_ms1)
if ana_ffd_header:
ffd_ana.write('header' + ',' + str(time_exc) + ',' + '1' + '\n')
ana_ffd_header = False
if time_ms1 not in time_track:
ffd_ana.write('value' + ',' + str(global_dic[ana_exc_key]['cpKey']) + ',' + str(global_dic[ana_exc_key]['SCADA Key']) + ',' + str(col[6]) + ',' + str(time_exc) + ',' + str(time_exc) + ',' + '0' + ',' + '0' + ',' + '0' + '\n')
else:
if ana_exc_key not in ana_exc_missing:
ana_missing_keys.write(ana_exc_key + '\n')
ana_exc_missing.append(ana_exc_key)
else:
break
else:
break
return None
I need help cleaning up the function posted above, this function will open multiple txt files, read and extract some information, than write back to multiple txt files. the code is too dirty and at times very slow.
- the function will be handling files with millions of lines
- im new to coding
- txt files are comma delimited
- specially with the section of the code that is handling file opening and closing
- the code does work, just needs clean up and improvements
summery of the code: It opens files to read from and write to. It looks at each line of one file, separate them to columns. the fist column in both files are time stamps, first file the time stamps are random, second file contains lines with time-stamps that are increments of 5 minutes. whenever the fist file has time stamps that match the time stamp of the second file the line from second file is processed, otherwise the line from first file is processed. Lines from both file1 and file2 will only be processed if they have a match to the global_dict (dictionary) otherwise they will be written to missing files. Also a key is constructed from multiple field of each line to serve as a unique identifier
file ex:
file 1: (csv format)(time-stamps random)
time-stamp0,field1.1,field1.2,field1.3,field1.4,field1.5,...
3
time-stamp1,field2.1,field2.2,field2.3,field2.4,field2.5,...
5
time-stampn,fieldn.1,fieldn.2,fieldn.3,fieldn.4,fieldn.5,...
12
.......
file 2: (csv format)(time-stamps 5 minute increments)
time-stamp0,field1.1,field1.2,field1.3,field1.4,field1.5,...
1
time-stamp1,field2.1,field2.2,field2.3,field2.4,field2.5,...
5, 6
time-stampn,fieldn.1,fieldn.2,fieldn.3,fieldn.4,fieldn.5,...
.......
1 Answer 1
The code is currently way too hard to follow to help you all the way through but there are some things I can suggest right away:
- Break this into multiple functions that each accomplish one task and one task only and name the functions to explain exactly what they do. That will help reduce some of the indenting and make it easier to read in the future. If you find yourself indenting more than a couple of times, you should ask how you can refactor the
ifs
orfor
loops into functions. - It's slow because you are reading the entirety of two "files with millions of lines" before you even start. Don't do that . . .
- Use Python's built-in
csv
module. As you can see in this StackOverflow question, it will help by reading files one line at a time. Hopefully one line at a time is ok for comparing the two files (it's not clear from the code above). - Always try to look for built-in or 3rd party modules before you write your own code to do the same. Once you get more familiar with coding then you can fall into the trap of thinking everyone else's code is dumb and you need to reinvent the wheel. For now, find existing wheels that have been tested and cleaned up and use them. To that point, there is no reason not to start tasks with searches like
python diff two csv files
to see what's out there. The top result for that search is a package calledcsvdiff
. I think I might scrap the code you have and use that. If it works, take a look the code itself and see how they accomplished the task. It may be a bit hard to follow at times because the package has to handle a bunch of different issues but you probably will learn something. - In Python (and languages like it), looping over a number instead of the collection itself (like you're doing with all these
range
functions) is a "code smell". You usually don't need the number and it's a performance drain. If you do need a number inside the loop, considerenumerate(my_collection)
which will give you a counter and a collection element as you go.
-
\$\begingroup\$ -Thank you for your response. I can not use many of the modules and 3rd party packages available. The system that will be running this code, does not have internet access, nor administrative access to add any thing to the available python package. \$\endgroup\$Dirty-Santa– Dirty-Santa2016年08月02日 15:59:15 +00:00Commented Aug 2, 2016 at 15:59
-
\$\begingroup\$ But you can download the package into the folder your code lives in and reference it that way. If you can upload your code, you can upload the other files as well. \$\endgroup\$Tom– Tom2016年08月02日 16:00:17 +00:00Commented Aug 2, 2016 at 16:00
-
\$\begingroup\$ The system and OS does not give access to any one. The code will be moved as 1 file on a flash drive. i also physically will not touch the system. I have made request to add PANDA and CSV. but i have to ensure i have a reasonable solution in the meantime. The code will process 2 years worth of data. each day containing GBs worth of files \$\endgroup\$Dirty-Santa– Dirty-Santa2016年08月02日 16:09:29 +00:00Commented Aug 2, 2016 at 16:09
-
\$\begingroup\$ "I have made request to add PANDA and CSV" This is confusing to me: Pandas is a 3rd-party module, but
csv
is a Python built-in which should be available to you withimport csv
. As for the other restrictions, there's nothing stopping you from making one really big file that includes any code you like (though obviously trying to cram all of Pandas in would be silly, you probably can steal what you need fromcsvdiff
and jam it into your file). \$\endgroup\$Tom– Tom2016年08月02日 16:25:24 +00:00Commented Aug 2, 2016 at 16:25 -
\$\begingroup\$ please point me to the area that you believe im reading the entire file. (i really dont know witch part of the code your referring to) \$\endgroup\$Dirty-Santa– Dirty-Santa2016年08月02日 16:30:43 +00:00Commented Aug 2, 2016 at 16:30
Explore related questions
See similar questions with these tags.