Format any mail into readable text

Question 1

Scope is to throw any mail, HTML or plain or multipart into this and receive a readable formatted text that deals with special characters and HTML issues. Field of usage: System that imports mails into a mysql db, or prints them to PDF, etc.

What can I improve here?

import imapclient, pyzmail, html2text
"""
Connect to IMAP. Login and select folder.
retreive UID of Mails and fetch the latest.
return the body of the latest message
"""
def latestMail():
 imapObj = imapclient.IMAPClient('imap.server.com', ssl=False)
 imapObj.login('username', 'password')
 imapObj.select_folder('INFO', readonly=True)
 UIDs = imapObj.search(criteria='ALL', charset=None)
 rawMessages = imapObj.fetch(UIDs[0], ['BODY[]', 'FLAGS'])
 message = pyzmail.PyzMessage.factory(rawMessages[UIDs[0]]['BODY[]'])
 return message
"""
Open message object, check if content is html or plain or both.
If both, only plain will be taken. for Plain, decodes the
message if charset is given, if not we will use system default
for html, additionally format html so the information
is exportable to any db.
"""
def parser(message):
 if message.text_part is not None and message.html_part is not None:
 multipart = True
 else:
 multipart = False
 if message.text_part is not None:
 try:
 body = message.text_part.get_payload().decode(message.text_part.charset)
 except TypeError:
 body = message.text_part.get_payload()
 if message.html_part is not None and multipart is False:
 try:
 body = html2text.html2text(message.html_part.get_payload().decode(message.html_part.charset))
 except Exception:
 raise Systemexit
 return body
print parser(latestMail())

Question 2

docstrings

This is a lovely module-level docstring.

"""
Connect to IMAP. Login and select folder.
retreive UID of Mails and fetch the latest.
return the body of the latest message
"""

Alas, it appears you intended for it to apply to this function:

def latestMail():
 imapObj = ...

To accomplish that, you would need to write a function-level docstring:

def latestMail():
 """
 your text goes here
 """
 imapObj = ...

Then e.g. help(latestMail) would display appropriate help.

Similarly for the parser() docstring. As it stands you computed a constant (a string), and then discarded it.

Also, PEP 8 asks that you name it latest_mail(), imap_obj, etc. Though a name like client would be better, yet.

credentials don't belong in source code

This looks like trouble:

 imapObj.login('username', 'password')

I assume your git repo has 'Secret123' or some other credential checked in. Avoid doing that. Prefer os.environ['IMAP_PASSWORD'], or dotenv.

expensive query

 UIDs = imapObj.search(criteria='ALL', charset=None)

Given that you literally only look at the most recent unique ID, you'd be better off with criterion='(RECENT)', so both the server and the network perform less work.

Also, my reading of ancient python2 docs suggests you'd want a tuple unpack there:

typ, msgnums = M.search(None, '(FROM "LDJ")')

assign a bool

 if message.text_part is not None and message.html_part is not None:
 multipart = True
 else:
 multipart = False

Consider rephrasing as

 multipart = (message.text_part is not None
 and message.html_part is not None)

or even

 multipart = bool(message.text_part and message.html_part)

fatal exception

This seems unhelpful:

 except Exception:
 raise Systemexit

Better to not try at all, so when we bail out it's at least apparent what went wrong.

Also, I found the except TypeError: "handler" a bit frightening. I suppose that if .decode() failed, instead of a string we return raw bytes?!? I am soooo glad python3 made incompatible changes which have spared a generation of programmers from worrying about such details.

Question 3

It looks good. I only have a few coding style suggestions.

Naming

The PEP 8 style guide recommends snake_case for function and variable names.

latestMail would be latest_mail.

rawMessages would be raw_messages.

The name parser is a bit generic. email_parser is more specific.

Documentation

The docstrings are very helpful. In the following, I would make some minor adjustments to capitalize "HTML" and explicitly mention "plain text":

"""
Open message object, check if content is HTML or plain text or both.
If both, only plain will be taken. For plain, decodes the
message if charset is given, if not we will use system default
for HTML, additionally format HTML so the information
is exportable to any db.
"""

Portability

I realize this question was posted many years ago when Python version 2.x was prevalent, but now that it is deprecated, consider porting to 3.x.

Simpler

These 2 lines :

message = pyzmail.PyzMessage.factory(rawMessages[UIDs[0]]['BODY[]'])
return message

can be simplified as:

return pyzmail.PyzMessage.factory(rawMessages[UIDs[0]]['BODY[]'])

This eliminates the intermediate message variable.

When I see double negatives (not None) as in this code:

if message.text_part is not None and message.html_part is not None:

I think it is easier to read by inverting the logic as:

if message.text_part is None or message.html_part is None:
 multipart = False
else:
 multipart = True

J_H J_H 41.8k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2025-03-14 23:11:32Z

docstrings

This is a lovely module-level docstring.

"""
Connect to IMAP. Login and select folder.
retreive UID of Mails and fetch the latest.
return the body of the latest message
"""

Alas, it appears you intended for it to apply to this function:

def latestMail():
 imapObj = ...

To accomplish that, you would need to write a function-level docstring:

def latestMail():
 """
 your text goes here
 """
 imapObj = ...

Then e.g. help(latestMail) would display appropriate help.

Similarly for the parser() docstring. As it stands you computed a constant (a string), and then discarded it.

Also, PEP 8 asks that you name it latest_mail(), imap_obj, etc. Though a name like client would be better, yet.

credentials don't belong in source code

This looks like trouble:

 imapObj.login('username', 'password')

I assume your git repo has 'Secret123' or some other credential checked in. Avoid doing that. Prefer os.environ['IMAP_PASSWORD'], or dotenv.

expensive query

 UIDs = imapObj.search(criteria='ALL', charset=None)

Given that you literally only look at the most recent unique ID, you'd be better off with criterion='(RECENT)', so both the server and the network perform less work.

Also, my reading of ancient python2 docs suggests you'd want a tuple unpack there:

typ, msgnums = M.search(None, '(FROM "LDJ")')

assign a bool

 if message.text_part is not None and message.html_part is not None:
 multipart = True
 else:
 multipart = False

Consider rephrasing as

 multipart = (message.text_part is not None
 and message.html_part is not None)

or even

 multipart = bool(message.text_part and message.html_part)

fatal exception

This seems unhelpful:

 except Exception:
 raise Systemexit

Better to not try at all, so when we bail out it's at least apparent what went wrong.

Also, I found the except TypeError: "handler" a bit frightening. I suppose that if .decode() failed, instead of a string we return raw bytes?!? I am soooo glad python3 made incompatible changes which have spared a generation of programmers from worrying about such details.

toolic toolic 15.2k5 gold badges29 silver badges213 bronze badges · Answer 2 · 2025-03-13 17:56:15Z

It looks good. I only have a few coding style suggestions.

Naming

The PEP 8 style guide recommends snake_case for function and variable names.

latestMail would be latest_mail.

rawMessages would be raw_messages.

The name parser is a bit generic. email_parser is more specific.

Documentation

The docstrings are very helpful. In the following, I would make some minor adjustments to capitalize "HTML" and explicitly mention "plain text":

"""
Open message object, check if content is HTML or plain text or both.
If both, only plain will be taken. For plain, decodes the
message if charset is given, if not we will use system default
for HTML, additionally format HTML so the information
is exportable to any db.
"""

Portability

I realize this question was posted many years ago when Python version 2.x was prevalent, but now that it is deprecated, consider porting to 3.x.

Simpler

These 2 lines :

message = pyzmail.PyzMessage.factory(rawMessages[UIDs[0]]['BODY[]'])
return message

can be simplified as:

return pyzmail.PyzMessage.factory(rawMessages[UIDs[0]]['BODY[]'])

This eliminates the intermediate message variable.

When I see double negatives (not None) as in this code:

if message.text_part is not None and message.html_part is not None:

I think it is easier to read by inverting the logic as:

if message.text_part is None or message.html_part is None:
 multipart = False
else:
 multipart = True

Stack Exchange Network

Format any mail into readable text

2 Answers 2

docstrings

credentials don't belong in source code

expensive query

assign a bool

fatal exception

Naming

Documentation

Portability

Simpler

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Format any mail into readable text

2 Answers 2

docstrings

credentials don't belong in source code

expensive query

assign a bool

fatal exception

Naming

Documentation

Portability

Simpler

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions