The code below works fine for me. Is there any way that I can improve this code more? Note: delimiter is single whitespace only for email address.
getmail.py
#!/usr/bin/python
import re
handleone = open("scratchmail.txt", "r")
f1 = handleone.readlines()
handletwo = open("mailout.txt", "a+")
match_list = [ ]
for ArrayItem in f1:
match_list = re.findall(r'[\w\.-]+@[\w\.-]+', ArrayItem)
if(len(match_list)>0):
handletwo.write(match_list[0]+"\r\n")
Input file scratchmail.txt:
tickets.cgi:5141|5141Z4063442957942364067088508|1219588377|1219588377||PND||[email protected]|Mjpqafml|JgALKGXCuasMph|
tickets.cgi:358236|358236Z24989798132452492828304439|1433071308|1433071308||PND||[email protected]|Edison|CpwxPARuQaqPR|
tickets.cgi:86805|86805Z25422507290694218605033173|1232345784|1232345784||PND||[email protected]|Agfaodki|jPdSNVlbXi|
...
Output file mailout.txt:
[email protected]
[email protected]
[email protected]
1 Answer 1
Working with file handles
You should always close the file handles that you open.
You didn't close any of them.
In addition,
you should use with open(...) as ...
syntax when working with files,
so that they will be automatically closed when leaving the scope.
Checking if a list is empty
There is a much simpler way to check if a list is empty, instead of this:
if(len(match_list)>0):
The Pythonic way to write is simply this:
if match_list:
Unnecessary initializations and data storage
There's no need to initialize match_list
before the loops.
You reassign it anyway inside.
You don't need to store all the lines from the first file into a list. You can process the input line by line and write output line by line. That can save a lot of memory, as you only need to keep one line in memory at a time, not the entire input file.
Poor naming
The file handles are very poorly named.
handleone
and handletwo
don't convey that one is for input the other is for output.
ArrayItem
doesn't follow the recommended naming convention of snake_case
(see PEP8),
and it doesn't describe what it really is.
But f1
wins the trophy of worst name in the posted code.
lines
would have been better.
Minor optimization
Instead of re-evaluating regular expressions in every loop,
I use re.compile
to compile them first, to make all uses efficient.
However, this might not be necessary,
as it seems recent versions of Python have a caching mechanism.
Suggested implementation
With the above suggestions applied, the code becomes simpler and cleaner:
import re
re_pattern = re.compile(r'[\w\.-]+@[\w\.-]+')
with open("input.txt") as fh_in:
with open("mailout.txt", "a+") as fh_out:
for line in fh_in:
match_list = re_pattern.findall(line)
if match_list:
fh_out.write(match_list[0]+"\r\n")
-
4\$\begingroup\$ See Multiple variables in Python 'with' statement. \$\endgroup\$200_success– 200_success2015年06月12日 08:16:49 +00:00Commented Jun 12, 2015 at 8:16
cut -d \| -f 8 scratchmail.txt >> mailout.txt
. \$\endgroup\$