I am a Python beginner, and I wrote a simple Python program, which does the following:
- search the pattern in the lines of a file (contains messages)
- pick the information from the lines and save it to disk
- remove the messages which matches with the regex (regular expression)
- saves remaining messages to another file,
Regexps are not available, so I pick one message -> create regex for it -> remove the matching messages and repeat the same with remaining messages.
# coding: utf-8
# In[50]:
import re
import csv
# ### Run this part only once in the starting. From here
# In[2]:
# ### Change the directory to working folder and give the right filename (hdfcbk),
# ### if unsure what to do go to your folder and right click and copy the filen here, it will look like /home/XYZ/.../Your_folder_name/hdfcbk
smsFile = open('hdfcbk', 'r')
data = smsFile.read()
data = data.split('\n')
main_data = data
regex_list = []
regl = []
# In[3]:
def regex_search(pattern, file_name):
remove_arr = []
res = []
remain_sms = []
for sms in file_name:
j= re.match(pattern,sms)
if j != None:
res.append(j.groupdict())
remove_arr.append(sms)
else:
remain_sms.append(sms)
return res, remove_arr, remain_sms
# In[4]:
def write_to_csv(result,csv_name):
keys = result[0][0].keys()
with open(csv_name, 'wb') as output_file:
dict_writer = csv.DictWriter(output_file, keys, dialect='excel')
dict_writer.writeheader()
dict_writer.writerows(result[0])
# In[12]:
# ### To here, now the repetitive run start
# ### Update this pattern file
# In[1]:
pat1 = 'INR (?P<Amount>(.*)) deposited to A\/c No (?P<AccountNo>(.*)) towards (?P<Towards>(.*)) Val (?P<Date>(.*)). Clr Bal is INR (?P<Balance>(.*)) subject to clearing.'
# In[8]:
A = regex_search(pat1,main_data)
# ### Updating main_data to remaining messages
# In[11]:
main_data = A[2]
# ### Writing remaining sms to a file, you don't need to change the file name as it will be updated everything as you run the script. Just look at the remaining sms and make new regex
# In[21]:
with open('remaining_sms.txt', 'w') as fp:
fp.write('\n'.join('%s' % x for x in main_data))
# ### Update the csv file
# In[ ]:
write_to_csv(A, 'hdfc_test_3.csv')
# ### Keeping all the regexes in one list, update the index number in [i,pat1]
# In[52]:
regl.append([1,pat1])
# ### Wrting the regex index to csv, run this part in the end, or if you're unsure that you will make the mistake run this part and keep changing the output file name
# In[53]:
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(regl)
I commented everything in the code. Now the thing is, I need to send this task to some people, who does not know anything in coding. That's why I commented so much.
Can you please review my code and suggest what else I can do to improve the code so that people can run the code without any hassle?
1 Answer 1
10 Tips, Problems, etc.
- Don't begin comments with
# ###
Put the code that you had outside of functions in a main() function. This makes it so that if the module is imported, this code won't be run inadvertently. You can use this standard framework:
def main(): # put code here that you want to run when file is executed if __name__ == "__main__": main()
Spell words in your comments correctly, and wrap them to a new line if they are getting too long. (After a few years of coding I have found that the vast majority of comments needn't be longer than one line.)
- Use
with open("file.txt", 'r') as f:
whenever working with files - Always end your file with a new line
- When comparing to
None
useis
andnot
instead of==
and!=
- Actually catch the three return values of the
regex_search
method explicitly with a well-named variable - No capital letters in variable names according to pep-8 guidelines
I find it much easier to use print than write when writing to a file inside of a
with open()
statement:print(some_text, file=some_file, flush=True)
Use string.format() instead of the deprecated old string formatting tools (%s...)
# coding: utf-8
import re
import csv
def regex_search(pattern, file_name):
remove_arr = []
res = []
remain_sms = []
for sms in file_name:
j = re.match(pattern, sms)
if j is not None:
res.append(j.groupdict())
remove_arr.append(sms)
else:
remain_sms.append(sms)
return res, remove_arr, remain_sms
def write_to_csv(result, csv_name):
keys = result[0][0].keys()
with open(csv_name, 'wb') as output_file:
dict_writer = csv.DictWriter(output_file, keys, dialect='excel')
dict_writer.writeheader()
dict_writer.writerows(result[0])
def main():
# Run this part only once in the starting. From here
# change the directory to working folder and give the right filename (hdfcbk),
# if unsure what to do go to your folder and right click and copy the file here,
# it will look like /home/XYZ/.../Your_folder_name/hdfcbk
with open('hdfcbk', 'r') as smsFile:
data = smsFile.read()
data = data.split('\n')
main_data = data
regl = []
pat1 = 'INR (?P<Amount>(.*)) deposited to A\/c No (?P<AccountNo>(.*)) towards (?P<Towards>(.*)) Val (?P<Date>(.*)). Clr Bal is INR (?P<Balance>(.*)) subject to clearing.'
# TODO - Use much more descriptive names...no idea what's going on here without searching for a while
a, b, c = regex_search(pat1, main_data)
# Updating main_data to remaining messages
main_data = c
# Writing remaining sms to a file, you don't need to change the file name as it will be updated
# everything as you run the script. Just look at the remaining sms and make new regex.
with open('remaining_sms.txt', 'w') as fp:
fp.write('\n'.join('{}'.format(x) for x in main_data))
# Update the csv file
write_to_csv([a, b, c], 'hdfc_test_3.csv')
# Keeping all the regexes in one list, update the index number in [i, pat1]
regl.append([1, pat1])
# Writing the regex index to csv, run this part in the end, or if you're unsure that you will
# make the mistake run this part and keep changing the output file name.
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(regl)
if __name__ == "__main__":
main()
-
\$\begingroup\$ BTW I didn't know what all of the
# In[x]
things were all about, so I removed those too. \$\endgroup\$Jake Dube– Jake Dube2016年10月28日 12:15:06 +00:00Commented Oct 28, 2016 at 12:15 -
\$\begingroup\$ I used jupyter notebook (ipython notebook), when I downloaded the notebook as .py file. It shows executed parts as
# In[x]
. \$\endgroup\$Dhruv Rana– Dhruv Rana2016年10月28日 12:59:04 +00:00Commented Oct 28, 2016 at 12:59 -
\$\begingroup\$ Can you explain the
def main()
part. I understood it will not execute while loading but then how I will execute the parts in the main method? Sorry for so many edits in comment. First time posting questions and comments. \$\endgroup\$Dhruv Rana– Dhruv Rana2016年10月28日 13:13:03 +00:00Commented Oct 28, 2016 at 13:13 -
\$\begingroup\$
main()
gets called when the__name__ == "__main__"
or when the file is being executed by itself. If you print out the__name__
from a file that is being executed it will be"__main__"
, but when it is imported as a module, the__name__
will be the actual name of the file in most cases. \$\endgroup\$Jake Dube– Jake Dube2016年10月28日 14:56:17 +00:00Commented Oct 28, 2016 at 14:56