CSV file parser and compare

Question 1

This may seem like a lot of stuff? I just need help with 2 small parts the code works, however I have provided the rest of the info in case some one can help. USING PYTHON 3.4

Code below is responsible for comparing multiple CSV files against a cross-reference file and creating a metadata file, information files, also a file to keep track of points that did not have a match in the cross-reference file.

it will compare files that are ordered in daily manner, each day holds 1 5min-file , 3 exc-file, 1 ala-file, 1 accu-file.

It will produce 1 file that holds the points, one file that holds points with their timestamps, and a file that holds points that have no match with the cross-reference file

The code works fine.

# cross reference file:
header1, header2, header3, header4, header5, header6
aaaaaaa1, bbbbbbb1, ccccccc1, ddddddd1, eeeeeee1, x42, trg, zxc, dfg 
aaaaaaa2, bbbbbbb2, ccccccc2, ddddddd2, eeeeeee2, fffffff2, zxc, hjg
aaaaaaa3, bbbbbbb3, ccccccc3, ddddddd3, eeeeeee3, fffffff3, vcx, hhf
aaaaaaa5, bbbbbbb5, ccccccc5, ddddddd5, eeeeeee5, fffffff5, vcx, hhf
...

# exce-file: (all time stamps start from 0)
1/1/2014 12:00:00 AM, aaaaaaa2, bbbbbbb2, ccccccc2, ddddddd2, eeeeeee2, v2
1/1/2014 12:00:00 AM, aaaaaaa3, bbbbbbb3, ccccccc3, ddddddd3, eeeeeee3, x3
6, 8 #lines like this should be ignore
1/1/2014 12:00:01 AM, aaaaaaa4, bbbbbbb4, ccccccc4, ddddddd4, eeeeeee4, i4
1/1/2014 12:00:00 AM, aaaaaaa5, bbbbbbb5, ccccccc5, ddddddd5, eeeeeee5, o5
1/1/2014 12:00:01 AM, aaaaaaa6, bbbbbbb6, ccccccc6, ddddddd6, eeeeeee6, p6
3, 22, 14 #lines like this should be ignore
1/1/2014 12:00:00 AM, aaaaaaa7, bbbbbbb7, ccccccc7, ddddddd7, eeeeeee7, l7
...

# 5min_file:(all time stamps are 5 minute increments and start from 0)
1/1/2014 12:00:00 AM, aaaaaaa2, bbbbbbb2, ccccccc2, ddddddd2, eeeeeee2, h2
1 #lines like this should be ignore
1/1/2014 12:00:00 AM, aaaaaaa3, bbbbbbb3, ccccccc3, ddddddd3, eeeeeee3, g3
1/1/2014 12:00:00 AM, aaaaaaa5, bbbbbbb5, ccccccc5, ddddddd5, eeeeeee5, t5
43, 12, 14 #lines like this should be ignore
1/1/2014 12:00:00 AM, aaaaaaa7, bbbbbbb7, ccccccc7, ddddddd7, eeeeeee7, y7
...

# ala and acu files have the same format as exc-file
...

# ffm output file:
header1, earliest time stamp (in unix), 1
aaaaaaa2, bbbbbbb1, ccccccc1, ddddddd1, eeeeeee1, fffffff1
aaaaaaa3, bbbbbbb1, ccccccc1, ddddddd1, eeeeeee1, fffffff1
aaaaaaa4, bbbbbbb1, ccccccc1, ddddddd1, eeeeeee1, fffffff1
...

# ffd output file:
%m/%d/%Y %H:%M:%S1, aaaaaaa2, bbbbbbb2, ccccccc2, ddddddd2, eeeeeee2, h2
%m/%d/%Y %H:%M:%S1, aaaaaaa3, bbbbbbb3, ccccccc3, ddddddd3, eeeeeee3, g3
%m/%d/%Y %H:%M:%S1.1, aaaaaaa4, bbbbbbb4, ccccccc4, ddddddd4, eeeeeee4, i4
%m/%d/%Y %H:%M:%S2, aaaaaaa5, bbbbbbb5, ccccccc5, ddddddd5, eeeeeee5, t5
%m/%d/%Y %H:%M:%S2.1, aaaaaaa6, bbbbbbb6, ccccccc6, ddddddd6, eeeeeee6, p6
%m/%d/%Y %H:%M:%S3, aaaaaaa7, bbbbbbb7, ccccccc7, ddddddd7, eeeeeee7, y7
...

# missing:
aaer45, bber45, ccer45, dder45, eeeeeee1, fffffff1 ---> NO MATCH
aaaaa3, bbbbbbb1, ccdc90, ddddddd1, eeeeeee1, fffffff1 ----> NO MATCH
...

What I would like is for you to help me with and point me to the right direction. (full code is included below)

in the analog_exc file I'm opening multiple files (both to read and write), is there a cleaner way to do this? (chunk of code for this section is right below):

with open(ffm_all_w + 'ana_ffm.txt', 'w') as ana_ffm, open(missing_key_w + 'ana_missint_keys.txt', 'w') as ana_missing_keys:
 for x in range(len(ana_exc_input_path)):
 if not count_path2> len(ana_exc_input_path):
 with open(ana_exc_input_path[count_path2], 'r') as ana_exc, open(ffd_ana_exception_path_w + file_name_analog[count_path2] + '.txt' + str(count_path2), 'w') as ffd_ana:

2- The comparing and writing the ana_5min and ana_exc takes too long, is there a better way to do this?

def Analog_5_min():
 global ana_5min_dic, global_dic, ana_5min_input_path
 counter = 0
 with open(ana_5min_input_path[counter], 'r') as file0:
 counter += 1
 for line in file0:
 if '/' in str(line):
 row = line.split(',')
 key1 = row[1] + '|' + row[2] + '|' + row[3] + '|' + row[4]
 if key1 in global_dic:
 ana_5min_dic[key1] = {'time': row[0], 'value': row[6]}
compare_func ():
 for line in ana_exc:
 col = line.split(",")
 ana_exc_key = (col[1] + '|' + col[2] + '|' + col[3] + '|' + col[4])
 ana_exc_time = col[0]
 if ana_exc_key in ana_5min_dic:
 if ana_exc_key not in ana_ffm_track:
 ana_ffm.write('point' + ',' + str(global_dic[ana_exc_key]['cpKey']) + ',' + str(global_dic[ana_exc_key]['header7']) + ',' + str(global_dic[ana_exc_key]['header5']) + ',' + 'analog' + ',' + ',' + '1' + '\n')
 ana_ffm_track.append(ana_exc_key)
 meow = datetime.datetime.strptime(ana_exc_time, '%m/%d/%Y %H:%M:%S')
 # change str time to date/time obj
 unix_timestamp = calendar.timegm(meow.timetuple()) # do the conversion to unix stamp
 time_ms1 = unix_timestamp * 1000
# afterwards it writes files as described above

Full code incase someone has other suggestions or wants to look at it:

import csv, datetime, calendar, time, os, argparse, sys, fnmatch # there is stuff here for late use
global_dic = {}
ana_5min_dic = {}
ffd_ana_5min_path_w = ''
ffd_ana_exception_path_w = ''
missing_key_w = ''
ffd_ana_hourly_path_w = ''
ffm_all_w = ''
ffd_alarm_path = ''
ffd_digital_path = ''
ffd_aacu_path = ''
out_put_defult = False
min_flag = False
ana_5min_input_path = []
ana_exc_input_path = []
# ana_1hr_input_path = []
alam_exc_input_path = []
acu_exc_input_path = []
dig_exc_input_path = []
ana_ffm_track = []
file_name_analog = []
file_name_digital = []
file_name_accu = []
file_name_alarms = []
# create files and path for output
def make_output_dir(output_path):
 global ffd_ana_5min_path_w, ffd_ana_exception_path_w, missing_key_w, ffm_all_w, out_put_defult, ffd_alarm_path, ffd_digital_path, ffd_aacu_path
 try:
 if out_put_defult:
 path = str(os.getcwd()) + '\\' + 'output'
 else:
 path = str(output_path)
 root_path = 'D:\\good_data\\output' + '\\' + str(datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
 folders = ['ffd_ana_exception', 'missing_keys', 'ffm_all', 'ffd_alarm_exception', 'ffd_digital_exception', 'ffd_accu_exception']
 ffd_ana_exception_path_w = os.path.join(str(root_path), 'ffd_ana_exception' + '\\')
 ffd_alarm_path = os.path.join(str(root_path), 'ffd_alarm_exception' + '\\')
 ffd_digital_path = os.path.join(str(root_path), 'ffd_digital_exception' + '\\')
 ffd_aacu_path = os.path.join(str(root_path), 'ffd_accu_exception' + '\\')
 ffm_all_w = os.path.join(str(root_path), 'ffm_all' + '\\')
 missing_key_w = os.path.join(str(root_path), 'missing_keys' + '\\')
 for folder in folders:
 if not os.path.exists(folder):
 os.makedirs(os.path.join(root_path, folder))
 except FileExistsError:
 print('Cannot create a file when that file already exists')
 pass
 return None
# Walk the directory and find needed files
def file_search(input_path):
 global ana_5min_input_path, ana_exc_input_path, ana_1hr_input_path, alam_exc_input_path, acu_exc_input_path, dig_exc_input_path, file_name_analog, file_name_accu, file_name_alarms, file_name_digital
 for root, dirnames, filenames in os.walk('C:\\Users\\data_meow'):
 for filename in fnmatch.filter(filenames, '*.csv'):
 if filename.startswith("Accumulators"):
 file_name_accu.append(filename.strip('.csv'))
 acu_exc_input_path.append(os.path.join(root, filename))
 elif filename.startswith("Alarms"):
 file_name_alarms.append(filename.strip('.csv'))
 alam_exc_input_path.append(os.path.join(root, filename))
 elif filename.startswith("Analog_exp"):
 file_name_analog.append(filename.strip('.csv'))
 ana_exc_input_path.append(os.path.join(root, filename))
 elif filename.startswith("Analog_per_5_min"):
 ana_5min_input_path.append(os.path.join(root, filename))
 elif filename.startswith("Digital_exc"):
 file_name_digital.append(filename.strip('.csv'))
 dig_exc_input_path.append(os.path.join(root, filename))
 return None
# creat a dictionary from cross refrence file
def xref():
 global global_dic
 with open('NPPD_XREF.cbt', 'r') as file0:
 reader1 = csv.reader(file0, delimiter='\t')
 header = next(reader1)
 for row in reader1:
 key = (row[0] + '|' + row[1] + '|' + row[2] + '|' + row[3])
 global_dic[key] = {header[0]: row[0], header[1]: row[1], header[2]: row[2], header[3]: row[3], header[4]: row[4], header[5]: row[5], header[6]: row[6], header[7]: row[7], header[8]: row[8], header[9]: row[9]}
 return None
# compare exception analog file with cross refrence file, if there is a point matching than compare with 5-minute analog file,
# where the time stamps of exception-analog file and 5-minute analog file match write output from 5-minute analog gile,
# otherwise use exception-analog file.
# keeps track of points that do not have a match in the cross refrence file and create a txt file for later review
# create 2 output files for later use
def Analog_5_min():
 global ana_5min_dic, global_dic, ana_5min_input_path
 counter = 0
 with open(ana_5min_input_path[counter], 'r') as file0:
 counter += 1
 for line in file0:
 if '/' in str(line):
 row = line.split(',')
 key1 = row[1] + '|' + row[2] + '|' + row[3] + '|' + row[4]
 if key1 in global_dic:
 ana_5min_dic[key1] = {'time': row[0], 'value': row[6]}
# compare exception analog file with cross refrence dictionary, if there is a point matching than compare the point with 5-minute analog dictionary,
# where the time stamps of exception-analog file and 5-minute analog dictionary match write output from 5-minute analog file,
# otherwise use exception-analog file.
# keeps track of points that do not have a match in the cross refrence file and create a txt file for later review
# create 2 output files for later use
def Ana_exc():
 global global_dic, missing_key_w, out_put_defult, ffd_ana_exception_path_w, ana_exc_input_path, ana_ffm_track, ana_5min_dic, file_name_analog
 count_path2 = 0
 ana_exc_missing = []
 ana_exc_ffm_header = True
 with open(ffm_all_w + 'ana_ffm.txt', 'w') as ana_ffm, open(missing_key_w + 'ana_missint_keys.txt', 'w') as ana_missing_keys:
 for x in range(len(ana_exc_input_path)):
 if not count_path2 > len(ana_exc_input_path):
 with open(ana_exc_input_path[count_path2], 'r') as ana_exc, open(ffd_ana_exception_path_w + file_name_analog[count_path2] + '.txt' + str(count_path2), 'w') as ffd_ana:
 count_path2 = count_path2 + 1
 ana_ffd_header = True
 if ana_exc_ffm_header:
 ana_ffm.write('header' + ',' + '1' + '\n')
 ana_exc_ffm_header = False
 for line in ana_exc:
 col = line.split(",")
 ana_exc_key = (col[1] + '|' + col[2] + '|' + col[3] + '|' + col[4])
 ana_exc_time = col[0]
 if ana_exc_key in ana_5min_dic:
 if ana_exc_key not in ana_ffm_track:
 ana_ffm.write('point' + ',' + str(global_dic[ana_exc_key]['cpKey']) + ',' + str(global_dic[ana_exc_key]['header7']) + ',' + str(global_dic[ana_exc_key]['header5']) + ',' + 'analog' + ',' + ',' + '1' + '\n')
 ana_ffm_track.append(ana_exc_key)
 meow = datetime.datetime.strptime(ana_exc_time, '%m/%d/%Y %H:%M:%S') # change str time to date/time obj
 unix_timestamp = calendar.timegm(meow.timetuple()) # do the conversion to unix stamp
 time_ms1 = unix_timestamp * 1000
 if ana_ffd_header:
 ffd_ana.write('header' + ',' + str(time_ms1) + ',' + '1' + '\n')
 ana_ffd_header = False
 ffd_ana.write('value' + ',' + str(global_dic[ana_exc_key]['cpKey']) + ',' + str(global_dic[ana_exc_key]['header5']) + ',' + str(ana_5min_dic[ana_exc_key]['value']) + ',' + str(time_ms1) + ',' + str(time_ms1) + ',' + '0' + ',' + '0' + ',' + '0' + '\n')
 else:
 if '/' in str(line): # only process the lines that start with time stamps
 if ana_exc_key in global_dic:
 if ana_exc_key not in ana_ffm_track: # keep track of the points in an output file (metadata file)
 ana_ffm.write('point' + ',' + str(global_dic[ana_exc_key]['cpKey']) + ',' + str(global_dic[ana_exc_key]['header5']) + ',' + str(global_dic[ana_exc_key]['header7']) + ',' + 'analog' + ',' + ',' + '1' + '\n')
 ana_ffm_track.append(ana_exc_key)
 meow = datetime.datetime.strptime(str(ana_exc_time), '%m/%d/%Y %H:%M:%S') # change str time to date/time obj
 unix_timestamp = calendar.timegm(meow.timetuple()) # do the conversion to unix stamp
 time_ms1 = unix_timestamp * 1000
 if ana_ffd_header: # out-file1 header
 ffd_ana.write('header' + ',' + str(time_ms1) + ',' + '1' + '\n')
 ana_ffd_header = False
 ffd_ana.write('value' + ',' + str(global_dic[ana_exc_key]['header8']) + ',' + str(global_dic[ana_exc_key]['header5']) + ',' + str(col[6]) + ',' + str(time_ms1) + ',' + str(time_ms1) + ',' + '0' + ',' + '0' + ',' + '0' + '\n')
 else:
 if ana_exc_key not in ana_exc_missing:
 ana_missing_keys.write(ana_exc_key + '\n')
 ana_exc_missing.append(ana_exc_key)
 else:
 break
 return None
# looks at alarm files and if the points have a match in the cross refrence dictionary, it creates an output
# keeps track of points that do not have a match in the cross refrence file and create a txt file for later review
def Alarm_points():
 global alam_exc_input_path, global_dic, ffd_alarm_path, missing_key_w, ffm_all_w, ana_ffm_track, file_name_alarms
 count_path = 0
 ana_alarm_missing = []
 with open(ffm_all_w + 'ana_ffm.txt', 'a') as ana_ffm, open(missing_key_w + 'ana_alarm_missing_keys.txt', 'w') as ana_alarm_missing_keys:
 for i in range(len(ana_5min_input_path)):
 if not count_path > len(alam_exc_input_path):
 with open(alam_exc_input_path[count_path], 'r') as ana_alarm, open(ffd_alarm_path + file_name_alarms[count_path] + '.txt' + str(count_path), 'w') as ffd_alarm:
 count_path += 1
 ana_alarm_ffd_header = True
 for line in ana_alarm:
 col = line.split(",")
 if str(line[2]).startswith('/'):
 ana_alarm_key = (col[2] + '|' + col[3] + '|' + col[4] + '|' + col[5])
 ana_alarm_time = str(col[0])
 if ana_alarm_key in global_dic:
 if ana_alarm_key not in ana_ffm_track:
 ana_ffm.write('point' + ',' + str(global_dic[ana_alarm_key]['header8']) + ',' + str(global_dic[ana_alarm_key]['header5']) + ',' + str(global_dic[ana_alarm_key]['header7']) + ',' + 'alarm' + ',' + ',' + '1' + '\n')
 ana_ffm_track.append(str(ana_alarm_key))
 meow = datetime.datetime.strptime(ana_alarm_time, "%m/%d/%Y %H:%M:%S") # change str time to date/time obj
 unix_timestamp = calendar.timegm(meow.timetuple()) # do the conversion to unix stamp
 time_ms = unix_timestamp * 1000
 if ana_alarm_ffd_header:
 ffd_alarm.write('header' + ',' + str(time_ms) + ',' + '1' + '\n')
 ana_alarm_ffd_header = False
 ffd_alarm.write('alarm' + ',' + str(global_dic[ana_alarm_key]['header5']) + ',' + str(col[6]) + ',' + str(time_ms) + ',' + str(time_ms) + ',' + str(col[12]) + ',' + str(col[7]) + ',' + '1' + ',' + global_dic[ana_alarm_key]['header8'] + ',' + '1' + ',' + '0' + ',' + global_dic[ana_alarm_key]['header7'] + ','+ global_dic[ana_alarm_key]['Point Name'] + ',' + '\n')
 else:
 if ana_alarm_key not in ana_alarm_missing:
 ana_alarm_missing_keys.write(str(ana_alarm_key) + '\n')
 ana_alarm_missing.append(ana_alarm_key)
 else:
 break
 return None
# looks at alarm files and if the points have a match in the cross refrence dictionary, it creates an output
# keeps track of points that do not have a match in the cross refrence file and create a txt file for later review
def Digital_points():
 global dig_exc_input_path, global_dic, ffd_digital_path, missing_key_w, ffm_all_w, file_name_digital
 count_path = 0
 ana_digital_missing = []
 ana_ffm_dup = []
 with open(ffm_all_w + 'ana_ffm.txt', 'a') as ana_ffm, open(missing_key_w + 'ana_digital_missing_keys.txt', 'w') as ana_digital_missing_keys:
 for i in range(len(dig_exc_input_path)):
 if not count_path > len(dig_exc_input_path):
 with open(dig_exc_input_path[count_path], 'r') as ana_digital, open(ffd_digital_path + file_name_digital[count_path] +'.txt' + str(count_path), 'w') as ffd_digital:
 count_path += 1
 ana_digital_ffd_header = True
 for line in ana_digital:
 col = line.split(",")
 if str(line[2]).startswith('/'):
 ana_digital_key = (col[2] + '|' + col[3] + '|' + col[4] + '|' + col[5])
 ana_digital_time = str(col[0])
 if ana_digital_key in global_dic:
 if ana_digital_key not in ana_ffm_dup:
 ana_ffm.write('point' + ',' + str(global_dic[ana_digital_key]['header8']) + ',' + str(global_dic[ana_digital_key]['header5']) + ',' + str(global_dic[ana_digital_key]['header7']) + ',' + 'analog' + ',' + ',' + '1' + '\n')
 ana_ffm_dup.append(str(ana_digital_key))
 meow = datetime.datetime.strptime(ana_digital_time, "%m/%d/%Y %H:%M:%S") # change str time to date/time obj
 unix_timestamp = calendar.timegm(meow.timetuple()) # do the conversion to unix stamp
 time_ms = unix_timestamp * 1000
 if ana_digital_ffd_header:
 ffd_digital.write('header' + ',' + str(time_ms) + ',' + '1' + '\n')
 ana_digital_ffd_header = False
 ffd_digital.write('value' + ',' + str(global_dic[ana_digital_key]['header8']) + ',' + str(global_dic[ana_digital_key]['header5']) + ',' + str(col[7]) + ',' + str(time_ms) + ',' + str(time_ms) + ',' + '0' + ',' + '0' + ',' + '0' + '\n')
 else:
 if ana_digital_key not in ana_digital_missing:
 ana_digital_missing_keys.write(str(ana_digital_key) + '\n')
 ana_digital_missing.append(ana_digital_key)
 else:
 break
 return None
# looks at alarm files and if the points have a match in the cross refrence dictionary, it creates an output
# keeps track of points that do not have a match in the cross refrence file and create a txt file for later review
def Accumulators():
 global acu_exc_input_path, global_dic, ffd_aacu_path, missing_key_w, ffm_all_w, file_name_accu
 count_path = 0
 ana_accu_missing = []
 ana_ffm_dup = []
 with open(ffm_all_w + 'ana_ffm.txt', 'a') as ana_ffm, open(missing_key_w + 'ana_accu_missing_keys.txt', 'w') as ana_accu_missing_keys:
 for i in range(len(acu_exc_input_path)):
 if not count_path > len(acu_exc_input_path):
 with open(acu_exc_input_path[count_path], 'r') as ana_accu, open(ffd_aacu_path + file_name_accu[count_path] + '.txt', 'w') as ffd_accu:
 count_path += 1
 ana_accu_ffd_header = True
 for line in ana_accu:
 col = line.split(",")
 if str(line[2]).startswith('/'):
 ana_accu_key = (col[2] + '|' + col[3] + '|' + col[4] + '|' + col[5])
 ana_accu_time = str(col[0])
 if ana_accu_key in global_dic:
 if ana_accu_key not in ana_ffm_dup:
 ana_ffm.write('point' + ',' + str(global_dic[ana_accu_key]['header8']) + ',' + str(global_dic[ana_accu_key]['header6']) + ',' + str(global_dic[ana_accu_key]['header7']) + ',' + 'analog' + ',' + ',' + '1' + '\n')
 ana_ffm_dup.append(str(ana_accu_key))
 meow = datetime.datetime.strptime(ana_accu_time, "%m/%d/%Y %H:%M:%S") # change str time to date/time obj
 unix_timestamp = calendar.timegm(meow.timetuple()) # do the conversion to unix stamp
 time_ms = unix_timestamp * 1000
 if ana_accu_ffd_header:
 ffd_accu.write('header' + ',' + str(time_ms) + ',' + '1' + '\n')
 ana_accu_ffd_header = False
 ffd_accu.write('value' + ',' + str(global_dic[ana_accu_key]['header8']) + ',' + str(global_dic[ana_accu_key]['header5']) + ',' + str(col[7]) + ',' + str(time_ms) + ',' + str(time_ms) + ',' + '0' + ',' + '0' + ',' + '0' + '\n')
 else:
 if ana_accu_key not in ana_accu_missing:
 ana_accu_missing_keys.write(str(ana_accu_key) + '\n')
 ana_accu_missing.append(ana_accu_key)
 else:
 break
 return None
def main():
 out_path = ''
 input_path = ''
 start_time = time.time()
 make_output_dir(out_path)
 file_search(input_path)
 xref()
 Analog_5_min()
 Ana_exc()
 Alarm_points()
 Digital_points()
 Accumulators()
 print("took", time.time() - start_time, "to run")
main()

Question 2

Is there a reason you can't use pandas for this? Also, it would be helpful if your examples actually had value data (like timestamps), because currently we can't really run your code with the data you gave.

Question 3

@TheBlackCat, i have made the changes requested to the 5_min file and ana_exc file. if they work the rest follow the same type of logic. As of your question, i was asked not to use PANDA. If there is any way for me to upload txt files, please let me know.

Question 4

I can only use python standard libraries.

Question 5

My advice:

use tuples for keys, not string concatenation

One thing I can suggest: don't create your keys using string concatenation because this particular operation is not optimal at all and allocates a lot of memory & copies a lot of data.

Example for:

k = col[2] + '|' + col[3] + '|' + col[4] + '|' + col[5]

It's much better to use a tuple (which is hashable). You allocate less memory and you don't copy strings like you did. You'll save time if you do that operation a lot.

Replacement key:

k = tuple(col[2:6])

you'll have to change it several times in your code and since your keys seem to use following indices, you could write a "list2key" function like this:

def list2key(l,start,end):
 return tuple(l[start:end+1])
k = list2key(col,2,5)

avoid useless casts to string

I see an obvious one (several times in your code):

if '/' in str(line):

since line is already a string (read from the file), you just duplicate the string for nothing. Just do:

if '/' in line:

Jean-François Fabre Jean-François Fabre 8337 silver badges12 bronze badges · Accepted Answer · 2016-10-20 19:43:10Z

My advice:

use tuples for keys, not string concatenation

One thing I can suggest: don't create your keys using string concatenation because this particular operation is not optimal at all and allocates a lot of memory & copies a lot of data.

Example for:

k = col[2] + '|' + col[3] + '|' + col[4] + '|' + col[5]

It's much better to use a tuple (which is hashable). You allocate less memory and you don't copy strings like you did. You'll save time if you do that operation a lot.

Replacement key:

k = tuple(col[2:6])

you'll have to change it several times in your code and since your keys seem to use following indices, you could write a "list2key" function like this:

def list2key(l,start,end):
 return tuple(l[start:end+1])
k = list2key(col,2,5)

avoid useless casts to string

I see an obvious one (several times in your code):

if '/' in str(line):

since line is already a string (read from the file), you just duplicate the string for nothing. Just do:

if '/' in line:

Stack Exchange Network

CSV file parser and compare

1 Answer 1

use tuples for keys, not string concatenation

avoid useless casts to string

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CSV file parser and compare

1 Answer 1

use tuples for keys, not string concatenation

avoid useless casts to string

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions