10
\$\begingroup\$

I'm working on a Python script that reads a text file and extracts all the unique email addresses. The code works as intended, it produces the correct output, throws no errors, and I've tested it on multiple text files.

I'm looking for feedback on:

  1. Code efficiency
  2. Best practices
  3. Any improvements in readability or performance
import re
def extract_unique_emails(filename):
 try:
 with open(filename, 'r', encoding='utf-8') as file:
 content = file.read()
 except FileNotFoundError:
 print(f"File '{filename}' not found.")
 return []
 email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
 emails = re.findall(email_pattern, content)
 unique_emails = sorted(set(emails))
 return unique_emails
# Example usage
if __name__ == "__main__":
 emails = extract_unique_emails("sample.txt")
 for email in emails:
 print(email)
asked Jun 17 at 15:49
\$\endgroup\$
12
  • 3
    \$\begingroup\$ You may want to read RFC 5322 as your regex won't match quote pairs nor comments. You may have even more email features you just won't match. IIRC a non-regex greedy match -> a BNF/regex strict match is more performant. \$\endgroup\$ Commented Jun 17 at 16:10
  • 1
    \$\begingroup\$ Very enlightening reading \$\endgroup\$ Commented Jun 18 at 3:44
  • 4
    \$\begingroup\$ @Peilonrayz: I'm usually a big fan of RFC compliance, but in the case of email addresses that old standard is just not useful. Nobody in their right mind uses comments or spaces in their mail addresses, but (at least here in Europe) companies do use non-ASCII characters, both in the local part and the domain part. \$\endgroup\$ Commented Jun 18 at 6:01
  • 1
    \$\begingroup\$ A kibitz from a non-Python person: Depending on the quantity expected, I think I'd prefer a big list sorted by provider, sub-sorted by email account name... Or, perhaps as an option for either sort sequence... Just a suggestion... (I truly hope this is not in aid of some new "scammer" data harvesting project...) \$\endgroup\$ Commented Jun 18 at 6:04
  • 1
    \$\begingroup\$ Sorry. Can't get '4Gb' and regex out of my head. With full confidence and admiration for 'regex' and its wizardry, a bunch of email addresses in an ocean of other stuff may be faster using a PET analogy. My "C background" approach would be to only look for each magical '@', glomming their regions of prefix & suffix "strings". I just don't know if regex would try character-by-character checking of every "word-ish" string, rejecting most and "getting lucky" every so often... Let '@' be the 'radioactive marker' that shouts for attention and focus. The rest is just blather that can be ignored. \$\endgroup\$ Commented Jun 18 at 23:22

6 Answers 6

12
\$\begingroup\$

Explicit contract

Returning an empty list to indicate "file not found" is probably the only big sin here. Now your function returns the same empty list in both of the following cases:

  • File does not exist
  • File is OK and contains no email addresses

If the file wasn't found, it also prints to stdout, which is both non-standard (error messages should go to stderr) and inconvenient (what if I want to use this function in a larger application with normal logging configuration?).

This is... surprising. I'd prefer such a function to tell me that a file does not exist. And that's exceptional case, so FileNotFoundError is just the right way to tell that.

Also note that your function still isn't exception-free: files that can't be decoded as UTF8 and files that can't be accessed due to filesystem permissions will still cause exceptions (UnicodeDecodeError and PermissionError, respectively; try running your script on /bin/ls to see the former).

So let's just propagate file exceptions to the caller instead.

 with open(filename, 'r', encoding='utf-8') as file:
 content = file.read()

Use modern APIs

Standard library has a better way to spell "read this file into memory".

from pathlib import Path
content = Path(filename).read_text(encoding='utf-8')

Nice, huh? Knowing about Path, it's probably better to use it from the very beginning, see my final suggestion below. Doing that would also prevent confusing filename and content by callers (possible if both are strings).

Compile once

You define a pattern as a raw string, good job with that! However, your function still might have to parse the pattern on every invocation. Let's compile it once:

EMAIL_PATTERN = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
# ...
emails = EMAIL_PATTERN.findall(content)

As noted by @Booboo, your function has a good chance to reuse the compiled expression anyway (see the note from the docs quoted below), but that depends on other regexes used in the program and is not guaranteed. Since there's no penalty for doing that explicitly once, I'd recommend just extracting the regex anyway. (And performance probably isn't critical here anyway)

The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

Type hints

If it's intended as a small standalone script, skip this section - typing small codebases is not worth the effort.

If it's part of a library, however, consider documenting what your function accepts and returns:

def extract_unique_emails(filename: str) -> list[str]:
 ...

Usability

If-main guard is good. Consider providing a simple CLI interface that takes filename from arguments.

Correctness

Your regex does not accept all valid emails and accepts some invalid emails. However, it really seems to cover almost all "good" emails encountered in wild and only accept a few invalid emails (e.g. [email protected] is invalid if I'm not mistaken). This should be documented but is most likely fine for your use case.

Rewrite suggestion

import re
import sys
from pathlib import Path
EMAIL_PATTERN = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
def extract_unique_emails(file_path: Path) -> list[str]:
 content = file_path.read_text(encoding='utf-8')
 emails = EMAIL_PATTERN.findall(content)
 return sorted(set(emails))
def _parse_cli_args():
 from argparse import ArgumentParser
 parser = ArgumentParser(description="Extract unique email addresses from file")
 parser.add_argument("filename", help="File to process", type=Path)
 return parser.parse_args()
if __name__ == "__main__":
 args = _parse_cli_args()
 try:
 emails = extract_unique_emails(args.filename)
 except FileNotFoundError:
 # Common case, offer a tailored error message
 print(f"File not found at '{args.filename}'", file=sys.stderr)
 except OSError as exc:
 # Other filesystem errors - permissions, directory, etc.
 print(f"Cannot open file at '{args.filename}': {exc}", file=sys.stderr)
 except UnicodeDecodeError as exc:
 print(
 f"File at '{args.filename}' is not valid UTF-8 text: {exc}",
 file=sys.stderr
 )
 else:
 for email in emails:
 print(email)

And now

$ printf '[email protected]' >/tmp/sample.txt
$ for file in /tmp /tmp/doesnotexist /root/.viminfo /bin/ls /tmp/sample.txt; do
 printf '\n%s:\n' "$file"
 python /tmp/a.py "$file"
done
/tmp:
Cannot open file at '/tmp': [Errno 21] Is a directory: '/tmp'
/tmp/doesnotexist:
File not found at '/tmp/doesnotexist'
/root/.viminfo:
Cannot open file at '/root/.viminfo': [Errno 13] Permission denied: '/root/.viminfo'
/bin/ls:
File at '/bin/ls' is not valid UTF-8 text: 'utf-8' codec can't decode byte 0xd8 in position 96: invalid continuation byte
/tmp/sample.txt:
[email protected]
answered Jun 17 at 18:08
\$\endgroup\$
5
  • \$\begingroup\$ See this note for re.compile, which states (the emphasis is mine): Note: The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions. \$\endgroup\$ Commented Jun 18 at 13:57
  • \$\begingroup\$ @Booboo yes, but that depends on the fact that other parts of the program are not regex-heavy. Performance likely isn't a big concern here anyway, but there's no reason not to compile it explicitly in advance:) \$\endgroup\$ Commented Jun 18 at 14:02
  • \$\begingroup\$ True enough. But you stated, "... your function still has to parse the pattern on every invocation. " and that is not strictly true and often it will not be true. So, consider incorporating your comment to me in your answer. \$\endgroup\$ Commented Jun 18 at 14:07
  • \$\begingroup\$ For potentially-large files, I'd avoid content = file_path.read_text(encoding='utf-8') and instead loop over with something like open(file_path, 'r', encoding='utf-8') as f: for line in f: FindAddresses(line) - that is, \$\endgroup\$ Commented Jun 18 at 17:31
  • 1
    \$\begingroup\$ @DewiMorgan I have a false memory that was already addressed in another answer... Anyway, sometimes this is more than acceptable, when you're working with files of several KB and under, which is common for e.g. HTML pages. I'd prefer concise code that dimply reads the whole thing into memory in such case. \$\endgroup\$ Commented Jun 18 at 20:58
10
\$\begingroup\$

High level advice: the RFCs have many subtle details — prefer to rely on a well tested library that understands the spec before you roll your own.

design of Public API

 except FileNotFoundError:

It's not obvious to me that caller really wants to get back "empty list" in that case. DbC suggests that caller was responsible for passing in a good filespec, and if they didn't, well, they get what they deserve. Raising an error seems a more appropriate way of notifying the caller than returning []. As written, the OP code doesn't let caller distinguish between the results for "/dev/null" and "/no/such/file.txt".

But fine, suppose that "unreadable file" should produce an empty container. The OP code covers one case, while ignoring e.g. PermissionError and IsADirectoryError. Maybe the parent OSError would be a more suitable except target? And I don't even know what gets raised in Windows land when another app has readme.txt open for read, with a lock on it, so our python app is unable to read it.

Prefer to go the simpler route of just letting the exception bubble up the call stack.

case insensitive

Prefer re.IGNORECASE over repeated use of a-zA-Z.

wrong regex

email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

TLD

\.[a-zA-Z]{2,}

This is just wrong.

It omits valid characters, such as digits. As of this writing I count 143 IANA assigned TLDs that contain digits. The first is XN--11B4C3D; the last is XN--ZFR164B. They use IDNA Punycode to express the notion of "website" in Arabic and Chinese: موقع and 中文网.

Which brings us to IDNA: Your regex prohibits the - dash character in TLDs, which prohibits the XN-- prefix.

SLD through hostname

@[a-zA-Z0-9.-]+\.

What, we don't like _ underscore?

validating

If you're keen to validate the global-part FQDN, ask the DNS for an {MX, A, AAAA} record. Ultimately some stage of your pipeline is going to do that, and it wouldn't hurt to do it up front in this stage.

localpart

[a-zA-Z0-9._%+-]+@

Thank you, thank you for admitting of <[email protected]>. But why are we even delving so deep into local-part? We are following the SHOULD aspect of the spec when we reject a Quoted-string; I can get behind that. But there's lots of other valid characters that we're rejecting here. For example "[email protected]" doesn't match.

More pragmatically, in an e-commerce setting I have seen lots of prospects register with local-part written in Farsi and other non-European scripts. Even in western Europe we might have "bü[email protected]". The OP code seems very US ASCII centric, perhaps due to the locale its business needs focus on.

length restriction

SMTP says that local-part shall not exceed 64 characters. So you might want to just look for a non-blank sequence, of at most 64 chars, followed by at-sign.

answered Jun 18 at 4:42
\$\endgroup\$
7
\$\begingroup\$

A couple of points on your function:

import re
def extract_unique_emails(filename):
 try:
 with open(filename, 'r', encoding='utf-8') as file:
 content = file.read()
 except FileNotFoundError:
 print(f"File '{filename}' not found.")
 return []
 email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
 emails = re.findall(email_pattern, content)
 unique_emails = sorted(set(emails))
 return unique_emails

You should probably be printing error messages to standard error rather than standard output. This allows someone running your code to redirect these using standard CLI tools, if they so desire.

import sys
import re
def extract_unique_emails(filename):
 try:
 with open(filename, 'r', encoding='utf-8') as file:
 content = file.read()
 except FileNotFoundError:
 print(f"File '{filename}' not found.", file=sys.stderr)
 return []
 email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
 emails = re.findall(email_pattern, content)
 unique_emails = sorted(set(emails))
 return unique_emails

Secondly, your function may be doing too much by deciding to handle the FileNotFoundError exception by printing an error and returning an empty list. This makes it very difficult or impossible for the caller to know if there actually were no email addresses in the file, or if the file didn't exist at all.

It would be both easier and more useful to let that exception propagate to the caller and the caller can then decide how to handle it.

import re
def extract_unique_emails(filename):
 with open(filename, 'r', encoding='utf-8') as file:
 content = file.read()
 email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
 emails = re.findall(email_pattern, content)
 unique_emails = sorted(set(emails))
 return unique_emails
answered Jun 18 at 0:14
\$\endgroup\$
6
\$\begingroup\$

Documentation

The PEP 8 style guide recommends adding docstrings for functions. For example:

def extract_unique_emails(filename):
 """
 Extract unique email addresses from a text file.
 The input is a path to a test file.
 The function returns a list of email addresses.
 """

Simpler

These 2 lines:

unique_emails = sorted(set(emails))
return unique_emails

can be combined into 1 line:

return sorted(set(emails))

Similarly, these lines:

emails = extract_unique_emails("sample.txt")
for email in emails:

can be combined:

for email in extract_unique_emails("sample.txt"):

The regular expression:

email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

can be simplified using \d, \w and the IGNORECASE flag:

email_pattern = r'[\w.%+-]+@[a-z\d.-]+\.[a-z]{2,}'
emails = re.findall(email_pattern, content, re.IGNORECASE)
answered Jun 17 at 16:21
\$\endgroup\$
1
  • 2
    \$\begingroup\$ If we're critiquing regular expressions, it may be worth adding a requirement that the email address be bracketed by word boundaries. \$\endgroup\$ Commented Jun 18 at 5:28
5
\$\begingroup\$

One thing that always annoys me about sites collecting email addresses. [email protected] and [email protected] are the same address. [email protected] is not necessarily the same address, though it may be at the discretion of example.com. (Only postmaster is required to be accepted in any case.) Many companies just assume they can upper-case the whole thing.

So, if you want unique addresses, you need to normalize the domain names. This might include applying appropriate punycode variants if you match non-ascii glyphs. You might consider having lists of domains with known case-sensitivity for the localpart.

answered Jun 18 at 21:13
\$\endgroup\$
1
  • \$\begingroup\$ The email-validator library might be useful. \$\endgroup\$ Commented Jun 18 at 22:38
4
\$\begingroup\$

We have seen several answers about how to improve your python, and why using a regex to detect a mail address is a bad idea, but there's one more thing I want to point out:

Reading a complete file in one big "slurp" is almost always a bad idea. It means you need to have a much RAM as the file's size, and in languages that use 32-bit characters internally, with ascii or utf-8 files, you need 4 times as much RAM as the file's size.

This is ok for data that needs to be present at all times while your program is running, for example, a configuration file. It's also ok for an one-shot program with a known small(-ish) data file. But it can cause lots of problem in any kind of production code.

Running your program on your 16 GB PC against a 100 MB data file will work well. Running it on your 256 MB VPS server against a 4 GB log file will result in one of

  • your kernel killing your program while it's reading the file, to keep the machine working
  • your machine grinding to an unresponsive halt while it's desperately trying to swap your memory out
  • (worst case) your cloud provider assigning more ram to your machine "because you needed it" and presenting you with a huge bill for that.

You can avoid this kind of problem by reading the input file line by line, processing each line, until you reach the end of the file.

<rant>

Unfortunately, there are many many tutorials out from people who don't really know what they're doing, think they figured out how to do it, and share their newly acquired wisdom with the world by writing a "tutorial" or making a youtube video. Don't learn from those people. I've seen many a junior developer fall into this kind of trap then wondering why the code that worked so well on their machine fails miserably in production.

</rant>

Now, I don't know Python well enough to give you a working example (and I refuse to ask ChatGPT for one), but what you want to do is akin to this pseudo code:

Set a result set to empty
open the file
while (read a line from the file does not result in EOF)
 get the set of emails from this line
 merge the line-set into the result-set, eliminating duplicates
 read the next line
end while
close the file
emit the result set

That way, you only ever need ram for one line, plus the set of found emails, which should be a lot less than the input file itself.

toolic
15.2k5 gold badges29 silver badges213 bronze badges
answered Jun 19 at 9:12
\$\endgroup\$
2
  • 1
    \$\begingroup\$ Now run this against a file with a single line of 4GB. :) \$\endgroup\$ Commented Jun 19 at 17:45
  • \$\begingroup\$ @HagenvonEitzen Quite right. In this case, you need to have a maximum read size, and then, if you got a partial read, prepend the end of the current read to the beginning of the next read, so that you don't miss something that crosses a block. \$\endgroup\$ Commented Jun 19 at 22:00

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.