Extracting emails from a file and writing them to another file

Question 1

The code below works fine for me. Is there any way that I can improve this code more? Note: delimiter is single whitespace only for email address.

getmail.py

#!/usr/bin/python
import re
handleone = open("scratchmail.txt", "r")
f1 = handleone.readlines()
handletwo = open("mailout.txt", "a+")
match_list = [ ] 
for ArrayItem in f1:
 match_list = re.findall(r'[\w\.-]+@[\w\.-]+', ArrayItem)
 if(len(match_list)>0):
 handletwo.write(match_list[0]+"\r\n")

Input file scratchmail.txt:

tickets.cgi:5141|5141Z4063442957942364067088508|1219588377|1219588377||PND||[email protected]|Mjpqafml|JgALKGXCuasMph|
tickets.cgi:358236|358236Z24989798132452492828304439|1433071308|1433071308||PND||[email protected]|Edison|CpwxPARuQaqPR|
tickets.cgi:86805|86805Z25422507290694218605033173|1232345784|1232345784||PND||[email protected]|Agfaodki|jPdSNVlbXi|
...

Output file mailout.txt:

[email protected]
[email protected]
[email protected]

Question 2

Or, you know, you could use *nix tools, e.g. cut -d \| -f 8 scratchmail.txt >> mailout.txt.

Question 3

yeah, i tried with cut and awk -F'|' '{ print 8ドル }' input.txt >>mailout.txt they both work.

Question 4

Working with file handles

You should always close the file handles that you open. You didn't close any of them. In addition, you should use with open(...) as ... syntax when working with files, so that they will be automatically closed when leaving the scope.

Checking if a list is empty

There is a much simpler way to check if a list is empty, instead of this:

if(len(match_list)>0):

The Pythonic way to write is simply this:

if match_list:

Unnecessary initializations and data storage

There's no need to initialize match_list before the loops. You reassign it anyway inside.

You don't need to store all the lines from the first file into a list. You can process the input line by line and write output line by line. That can save a lot of memory, as you only need to keep one line in memory at a time, not the entire input file.

Poor naming

The file handles are very poorly named. handleone and handletwo don't convey that one is for input the other is for output.

ArrayItem doesn't follow the recommended naming convention of snake_case (see PEP8), and it doesn't describe what it really is.

But f1 wins the trophy of worst name in the posted code. lines would have been better.

Minor optimization

Instead of re-evaluating regular expressions in every loop, I use re.compile to compile them first, to make all uses efficient. However, this might not be necessary, as it seems recent versions of Python have a caching mechanism.

Suggested implementation

With the above suggestions applied, the code becomes simpler and cleaner:

import re
re_pattern = re.compile(r'[\w\.-]+@[\w\.-]+')
with open("input.txt") as fh_in: 
 with open("mailout.txt", "a+") as fh_out:
 for line in fh_in:
 match_list = re_pattern.findall(line)
 if match_list:
 fh_out.write(match_list[0]+"\r\n")

Question 5

See Multiple variables in Python 'with' statement.

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2015-06-12 08:14:58Z

Working with file handles

You should always close the file handles that you open. You didn't close any of them. In addition, you should use with open(...) as ... syntax when working with files, so that they will be automatically closed when leaving the scope.

Checking if a list is empty

There is a much simpler way to check if a list is empty, instead of this:

if(len(match_list)>0):

The Pythonic way to write is simply this:

if match_list:

Unnecessary initializations and data storage

There's no need to initialize match_list before the loops. You reassign it anyway inside.

You don't need to store all the lines from the first file into a list. You can process the input line by line and write output line by line. That can save a lot of memory, as you only need to keep one line in memory at a time, not the entire input file.

Poor naming

The file handles are very poorly named. handleone and handletwo don't convey that one is for input the other is for output.

ArrayItem doesn't follow the recommended naming convention of snake_case (see PEP8), and it doesn't describe what it really is.

But f1 wins the trophy of worst name in the posted code. lines would have been better.

Minor optimization

Instead of re-evaluating regular expressions in every loop, I use re.compile to compile them first, to make all uses efficient. However, this might not be necessary, as it seems recent versions of Python have a caching mechanism.

Suggested implementation

With the above suggestions applied, the code becomes simpler and cleaner:

import re
re_pattern = re.compile(r'[\w\.-]+@[\w\.-]+')
with open("input.txt") as fh_in: 
 with open("mailout.txt", "a+") as fh_out:
 for line in fh_in:
 match_list = re_pattern.findall(line)
 if match_list:
 fh_out.write(match_list[0]+"\r\n")

4

\$\begingroup\$ See Multiple variables in Python 'with' statement. \$\endgroup\$

200_success
– 200_success

2015年06月12日 08:16:49 +00:00
Commented Jun 12, 2015 at 8:16

Stack Exchange Network

Extracting emails from a file and writing them to another file

getmail.py

1 Answer 1

Working with file handles

Checking if a list is empty

Unnecessary initializations and data storage

Poor naming

Minor optimization

Suggested implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extracting emails from a file and writing them to another file

getmail.py

1 Answer 1

Working with file handles

Checking if a list is empty

Unnecessary initializations and data storage

Poor naming

Minor optimization

Suggested implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions