Return to Answer

added 151 characters in body

edited Jun 18 at 14:45

41.8k
3
38
157

High level advice: the RFCs have many subtle details — prefer to rely on a well tested library that understands the spec before you roll your own.

Source Link

answered Jun 18 at 4:42

J_H

answered Jun 18 at 4:42

J_H

41.8k
3
38
157

design of Public API

 except FileNotFoundError:

It's not obvious to me that caller really wants to get back "empty list" in that case. DbC suggests that caller was responsible for passing in a good filespec, and if they didn't, well, they get what they deserve. Raising an error seems a more appropriate way of notifying the caller than returning []. As written, the OP code doesn't let caller distinguish between the results for "/dev/null" and "/no/such/file.txt".

But fine, suppose that "unreadable file" should produce an empty container. The OP code covers one case, while ignoring e.g. PermissionError and IsADirectoryError. Maybe the parent OSError would be a more suitable except target? And I don't even know what gets raised in Windows land when another app has readme.txt open for read, with a lock on it, so our python app is unable to read it.

Prefer to go the simpler route of just letting the exception bubble up the call stack.

case insensitive

Prefer re.IGNORECASE over repeated use of a-zA-Z.

wrong regex

email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

TLD

\.[a-zA-Z]{2,}

This is just wrong.

It omits valid characters, such as digits. As of this writing I count 143 IANA assigned TLDs that contain digits. The first is XN--11B4C3D; the last is XN--ZFR164B. They use IDNA Punycode to express the notion of "website" in Arabic and Chinese: موقع and 中文网.

Which brings us to IDNA: Your regex prohibits the - dash character in TLDs, which prohibits the XN-- prefix.

SLD through hostname

@[a-zA-Z0-9.-]+\.

What, we don't like _ underscore?

validating

If you're keen to validate the global-part FQDN, ask the DNS for an {MX, A, AAAA} record. Ultimately some stage of your pipeline is going to do that, and it wouldn't hurt to do it up front in this stage.

localpart

[a-zA-Z0-9._%+-]+@

Thank you, thank you for admitting of <[email protected]>. But why are we even delving so deep into local-part? We are following the SHOULD aspect of the spec when we reject a Quoted-string; I can get behind that. But there's lots of other valid characters that we're rejecting here. For example "[email protected]" doesn't match.

More pragmatically, in an e-commerce setting I have seen lots of prospects register with local-part written in Farsi and other non-European scripts. Even in western Europe we might have "bü[email protected]". The OP code seems very US ASCII centric, perhaps due to the locale its business needs focus on.

length restriction

SMTP says that local-part shall not exceed 64 characters. So you might want to just look for a non-blank sequence, of at most 64 chars, followed by at-sign.

lang-py