High level advice: the RFCs have many subtle details — prefer to rely on a well tested library that understands the spec before you roll your own.
High level advice: the RFCs have many subtle details — prefer to rely on a well tested library that understands the spec before you roll your own.
design of Public API
except FileNotFoundError:
It's not obvious to me that caller really wants to
get back "empty list" in that case.
DbC
suggests that caller was responsible for passing in a
good filespec, and if they didn't, well, they get
what they deserve.
Raising an error seems a more appropriate way of
notifying the caller than returning []
.
As written, the OP code doesn't let caller distinguish
between the results for "/dev/null"
and "/no/such/file.txt"
.
But fine, suppose that "unreadable file" should
produce an empty container.
The OP code covers one case,
while ignoring e.g. PermissionError and IsADirectoryError.
Maybe the parent OSError would be a more
suitable except
target?
And I don't even know what gets raised in
Windows land when another app has readme.txt
open for read, with a lock on it,
so our python app is unable to read it.
Prefer to go the simpler route of just letting the exception bubble up the call stack.
case insensitive
Prefer re.IGNORECASE
over repeated use of a-zA-Z
.
wrong regex
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
TLD
\.[a-zA-Z]{2,}
This is just wrong.
It omits valid characters, such as digits. As of this writing I count 143 IANA assigned TLDs that contain digits. The first is XN--11B4C3D; the last is XN--ZFR164B. They use IDNA Punycode to express the notion of "website" in Arabic and Chinese: موقع and 中文网.
Which brings us to IDNA: Your regex prohibits
the -
dash character in TLDs, which prohibits
the XN-- prefix.
SLD through hostname
@[a-zA-Z0-9.-]+\.
What, we don't like _
underscore?
validating
If you're keen to validate the global-part FQDN, ask the DNS for an {MX, A, AAAA} record. Ultimately some stage of your pipeline is going to do that, and it wouldn't hurt to do it up front in this stage.
localpart
[a-zA-Z0-9._%+-]+@
Thank you, thank you for admitting of <[email protected]>
.
But why are we even delving so deep into
local-part?
We are following the SHOULD aspect of the spec
when we reject a Quoted-string; I can get behind that.
But there's lots of other valid characters
that we're rejecting here.
For example "[email protected]" doesn't match.
More pragmatically, in an e-commerce setting I have seen lots of prospects register with local-part written in Farsi and other non-European scripts. Even in western Europe we might have "bü[email protected]". The OP code seems very US ASCII centric, perhaps due to the locale its business needs focus on.
length restriction
SMTP says that local-part shall not exceed 64 characters. So you might want to just look for a non-blank sequence, of at most 64 chars, followed by at-sign.