Performing a regex search and saving results to CSV

Question 1

I am a Python beginner, and I wrote a simple Python program, which does the following:

search the pattern in the lines of a file (contains messages)
pick the information from the lines and save it to disk
remove the messages which matches with the regex (regular expression)
saves remaining messages to another file,

Regexps are not available, so I pick one message -> create regex for it -> remove the matching messages and repeat the same with remaining messages.

# coding: utf-8
# In[50]:
import re
import csv
# ### Run this part only once in the starting. From here 
# In[2]:
# ### Change the directory to working folder and give the right filename (hdfcbk), 
# ### if unsure what to do go to your folder and right click and copy the filen here, it will look like /home/XYZ/.../Your_folder_name/hdfcbk
smsFile = open('hdfcbk', 'r')
data = smsFile.read()
data = data.split('\n')
main_data = data
regex_list = []
regl = []
# In[3]:
def regex_search(pattern, file_name):
 remove_arr = []
 res = []
 remain_sms = []
 for sms in file_name:
 j= re.match(pattern,sms)
 if j != None:
 res.append(j.groupdict())
 remove_arr.append(sms)
 else:
 remain_sms.append(sms)
 return res, remove_arr, remain_sms
# In[4]:
def write_to_csv(result,csv_name):
 keys = result[0][0].keys()
 with open(csv_name, 'wb') as output_file:
 dict_writer = csv.DictWriter(output_file, keys, dialect='excel')
 dict_writer.writeheader()
 dict_writer.writerows(result[0])
# In[12]: 
# ### To here, now the repetitive run start
# ### Update this pattern file
# In[1]:
pat1 = 'INR (?P<Amount>(.*)) deposited to A\/c No (?P<AccountNo>(.*)) towards (?P<Towards>(.*)) Val (?P<Date>(.*)). Clr Bal is INR (?P<Balance>(.*)) subject to clearing.'
# In[8]:
A = regex_search(pat1,main_data)
# ### Updating main_data to remaining messages
# In[11]:
main_data = A[2]
# ### Writing remaining sms to a file, you don't need to change the file name as it will be updated everything as you run the script. Just look at the remaining sms and make new regex
# In[21]:
with open('remaining_sms.txt', 'w') as fp:
 fp.write('\n'.join('%s' % x for x in main_data))
# ### Update the csv file
# In[ ]:
write_to_csv(A, 'hdfc_test_3.csv')
# ### Keeping all the regexes in one list, update the index number in [i,pat1]
# In[52]:
regl.append([1,pat1])
# ### Wrting the regex index to csv, run this part in the end, or if you're unsure that you will make the mistake run this part and keep changing the output file name
# In[53]:
with open("output.csv", "wb") as f:
 writer = csv.writer(f)
 writer.writerows(regl)

I commented everything in the code. Now the thing is, I need to send this task to some people, who does not know anything in coding. That's why I commented so much.

Can you please review my code and suggest what else I can do to improve the code so that people can run the code without any hassle?

Question 2

10 Tips, Problems, etc.

Don't begin comments with # ###
Put the code that you had outside of functions in a main() function. This makes it so that if the module is imported, this code won't be run inadvertently. You can use this standard framework:
```
def main():
 # put code here that you want to run when file is executed
if __name__ == "__main__":
 main()
```
Spell words in your comments correctly, and wrap them to a new line if they are getting too long. (After a few years of coding I have found that the vast majority of comments needn't be longer than one line.)
Use with open("file.txt", 'r') as f: whenever working with files
Always end your file with a new line
When comparing to None use is and not instead of == and !=
Actually catch the three return values of the regex_search method explicitly with a well-named variable
No capital letters in variable names according to pep-8 guidelines
I find it much easier to use print than write when writing to a file inside of a with open() statement:
```
print(some_text, file=some_file, flush=True)
```
Use string.format() instead of the deprecated old string formatting tools (%s...)

# coding: utf-8
import re
import csv
def regex_search(pattern, file_name):
 remove_arr = []
 res = []
 remain_sms = []
 for sms in file_name:
 j = re.match(pattern, sms)
 if j is not None:
 res.append(j.groupdict())
 remove_arr.append(sms)
 else:
 remain_sms.append(sms)
 return res, remove_arr, remain_sms
def write_to_csv(result, csv_name):
 keys = result[0][0].keys()
 with open(csv_name, 'wb') as output_file:
 dict_writer = csv.DictWriter(output_file, keys, dialect='excel')
 dict_writer.writeheader()
 dict_writer.writerows(result[0])
def main():
 # Run this part only once in the starting. From here
 # change the directory to working folder and give the right filename (hdfcbk),
 # if unsure what to do go to your folder and right click and copy the file here,
 # it will look like /home/XYZ/.../Your_folder_name/hdfcbk
 with open('hdfcbk', 'r') as smsFile:
 data = smsFile.read()
 data = data.split('\n')
 main_data = data
 regl = []
 pat1 = 'INR (?P<Amount>(.*)) deposited to A\/c No (?P<AccountNo>(.*)) towards (?P<Towards>(.*)) Val (?P<Date>(.*)). Clr Bal is INR (?P<Balance>(.*)) subject to clearing.'
 # TODO - Use much more descriptive names...no idea what's going on here without searching for a while
 a, b, c = regex_search(pat1, main_data)
 # Updating main_data to remaining messages
 main_data = c
 # Writing remaining sms to a file, you don't need to change the file name as it will be updated
 # everything as you run the script. Just look at the remaining sms and make new regex.
 with open('remaining_sms.txt', 'w') as fp:
 fp.write('\n'.join('{}'.format(x) for x in main_data))
 # Update the csv file
 write_to_csv([a, b, c], 'hdfc_test_3.csv')
 # Keeping all the regexes in one list, update the index number in [i, pat1]
 regl.append([1, pat1])
 # Writing the regex index to csv, run this part in the end, or if you're unsure that you will
 # make the mistake run this part and keep changing the output file name.
 with open("output.csv", "wb") as f:
 writer = csv.writer(f)
 writer.writerows(regl)
if __name__ == "__main__":
 main()

Question 3

BTW I didn't know what all of the # In[x] things were all about, so I removed those too.

Question 4

I used jupyter notebook (ipython notebook), when I downloaded the notebook as .py file. It shows executed parts as # In[x].

Question 5

Can you explain the def main() part. I understood it will not execute while loading but then how I will execute the parts in the main method? Sorry for so many edits in comment. First time posting questions and comments.

Question 6

main() gets called when the __name__ == "__main__" or when the file is being executed by itself. If you print out the __name__ from a file that is being executed it will be "__main__", but when it is imported as a module, the __name__ will be the actual name of the file in most cases.

Jake Dube Jake Dube 2502 silver badges11 bronze badges · Accepted Answer · 2016-10-28 12:06:06Z

10 Tips, Problems, etc.

Don't begin comments with # ###
Put the code that you had outside of functions in a main() function. This makes it so that if the module is imported, this code won't be run inadvertently. You can use this standard framework:
```
def main():
 # put code here that you want to run when file is executed
if __name__ == "__main__":
 main()
```
Spell words in your comments correctly, and wrap them to a new line if they are getting too long. (After a few years of coding I have found that the vast majority of comments needn't be longer than one line.)
Use with open("file.txt", 'r') as f: whenever working with files
Always end your file with a new line
When comparing to None use is and not instead of == and !=
Actually catch the three return values of the regex_search method explicitly with a well-named variable
No capital letters in variable names according to pep-8 guidelines
I find it much easier to use print than write when writing to a file inside of a with open() statement:
```
print(some_text, file=some_file, flush=True)
```
Use string.format() instead of the deprecated old string formatting tools (%s...)

# coding: utf-8
import re
import csv
def regex_search(pattern, file_name):
 remove_arr = []
 res = []
 remain_sms = []
 for sms in file_name:
 j = re.match(pattern, sms)
 if j is not None:
 res.append(j.groupdict())
 remove_arr.append(sms)
 else:
 remain_sms.append(sms)
 return res, remove_arr, remain_sms
def write_to_csv(result, csv_name):
 keys = result[0][0].keys()
 with open(csv_name, 'wb') as output_file:
 dict_writer = csv.DictWriter(output_file, keys, dialect='excel')
 dict_writer.writeheader()
 dict_writer.writerows(result[0])
def main():
 # Run this part only once in the starting. From here
 # change the directory to working folder and give the right filename (hdfcbk),
 # if unsure what to do go to your folder and right click and copy the file here,
 # it will look like /home/XYZ/.../Your_folder_name/hdfcbk
 with open('hdfcbk', 'r') as smsFile:
 data = smsFile.read()
 data = data.split('\n')
 main_data = data
 regl = []
 pat1 = 'INR (?P<Amount>(.*)) deposited to A\/c No (?P<AccountNo>(.*)) towards (?P<Towards>(.*)) Val (?P<Date>(.*)). Clr Bal is INR (?P<Balance>(.*)) subject to clearing.'
 # TODO - Use much more descriptive names...no idea what's going on here without searching for a while
 a, b, c = regex_search(pat1, main_data)
 # Updating main_data to remaining messages
 main_data = c
 # Writing remaining sms to a file, you don't need to change the file name as it will be updated
 # everything as you run the script. Just look at the remaining sms and make new regex.
 with open('remaining_sms.txt', 'w') as fp:
 fp.write('\n'.join('{}'.format(x) for x in main_data))
 # Update the csv file
 write_to_csv([a, b, c], 'hdfc_test_3.csv')
 # Keeping all the regexes in one list, update the index number in [i, pat1]
 regl.append([1, pat1])
 # Writing the regex index to csv, run this part in the end, or if you're unsure that you will
 # make the mistake run this part and keep changing the output file name.
 with open("output.csv", "wb") as f:
 writer = csv.writer(f)
 writer.writerows(regl)
if __name__ == "__main__":
 main()

BTW I didn't know what all of the # In[x] things were all about, so I removed those too.
I used jupyter notebook (ipython notebook), when I downloaded the notebook as .py file. It shows executed parts as # In[x].
Can you explain the def main() part. I understood it will not execute while loading but then how I will execute the parts in the main method? Sorry for so many edits in comment. First time posting questions and comments.
main() gets called when the __name__ == "__main__" or when the file is being executed by itself. If you print out the __name__ from a file that is being executed it will be "__main__", but when it is imported as a module, the __name__ will be the actual name of the file in most cases.

Stack Exchange Network

Performing a regex search and saving results to CSV

1 Answer 1

10 Tips, Problems, etc.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Performing a regex search and saving results to CSV

1 Answer 1

10 Tips, Problems, etc.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions