I have a working script that I wrote in Python which for e-mail messages that consist only of a plain text part + an HTML part, discards the HTML and keeps only the plain text part.
The script is not exactly elegant and, as can be seen from the code, it smells like it is C (in particular, I simulate the use of bitmasks) and I am not exactly satisfied with some points of it.
I know that the script has some issues (like code duplication and the hack mentioned above), but I don't know the idiomatic Pythonic way of writing it and I would appreciate any kind of criticism to improve it in any way to make the code elegant.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Author: Rogério Theodoro de Brito <[email protected]>
License: GPL-2+
Copyright: 2010-2012 Rogério Theodoro de Brito
drop-alternatives is a simple Python script for those who hate emails in
HTML and who prefer their inbox to have as many messages in pure text as
feasible. This script is generally meant to be run as a filter with procmail
or some other mail delivery agent.
It tries to be moderately conservative and only act when things are
moderately safe:
* If the message is `multipart` and has a `text/plain` and a `text/html`
part, keep the `text/plain` part only.
* In all other cases keep the message intact.
"""
import email
import email.message
def compose_message(orig, body):
"""
Create new message with headers from `orig` and body from `body`.
* `orig`: The original message.
* `body`: The body that we want the new message to have.
* Returns a new message.
`compose_message` creates a new message with most of the fields from
`orig`, with fields from `body` (if any) and with the payload of
`body`. The fields excluded from `orig` are the following:
* `content-length`
* `content-type`
* `lines`
* `status`
"""
wanted = email.message.Message()
wanted.set_payload(body.get_payload())
unwanted_fields = ["content-length", "content-type", "lines", "status"]
# The dictionaries `orig` and `body` have only headers as their items.
for field in unwanted_fields:
del orig[field]
for k, v in orig.items():
wanted[k] = v
for k, v in body.items():
wanted[k] = v
return wanted
def sanitize(msg):
"""
Given an RFC-2822 message `msg`, generate its 'sanitized' version.
* `msg`: The message to be sanitized.
* Returns a sanitized version of `msg`.
`sanitize` tries to be moderately conservative and only act when things
are moderately safe:
* If the message is multipart and has a `text/plain` and a `text/html`
part, keep the `text/plain` part only.
* In all other cases keep the message intact.
"""
if not msg.is_multipart():
return msg
# 'composition' is a bitmask containing the kind of the parts
TEXTPLAIN = 1 # text/plain
TEXTHTML = 2 # text/html
MISCPARTS = 4 # anything else
composition = 0
text_taken = False
for part in msg.walk():
if (part.get_content_maintype() == "multipart" or
part.get_content_type() == "message/external-body" or
part.get_payload() == ""):
continue
elif part.get_content_type() == "text/plain":
if not text_taken:
text_taken = True
body = part
composition |= TEXTPLAIN
else:
# if we are seeing a second text/plain part, stop throwing
# things
composition |= MISCPARTS
break
elif part.get_content_type() == "text/html":
composition |= TEXTHTML
else:
composition |= MISCPARTS
if composition == (TEXTPLAIN + TEXTHTML) or composition == TEXTPLAIN:
return compose_message(msg, body)
else:
return msg
if __name__ == "__main__":
import sys
res = sanitize(email.message_from_file(sys.stdin))
print res.as_string(unixfrom=False),
2 Answers 2
I think the code is quite fine as it is. Some small notes:
if composition == (TEXTPLAIN + TEXTHTML) or composition == TEXTPLAIN:
Just to make sure: this code tests that TEXTPLAIN
is set and that MISCPARTS
is not set. I would make that explicit, otherwise this code is hard to understand (in particular, it’s easy to miss that MISCPARTS
mustn’t be set):
if (composition & TEXTPLAIN) != 0 and (composition & MISCPARTS) == 0:
If you’re not happy with bit operations, you can a set
instead.
composition = set()
# ...
if TEXTPLAIN in composition and MISCPARTS not in composition:
... and define TEXTPLAIN
etc. as simple consecutive constants rather than bitmasks.
This is more pythonic certainly, but I actually find the use of bit masks entirely appropriate here.
-
\$\begingroup\$ Thanks, @Konrad. I thought that the bitfields would be condemned in python, but as you mention right now, they even feel a bit natural. BTW, in C I would use an enum for the constants. Is there any Pythonic idiom with the same "flavour" as C's enums? \$\endgroup\$rbrito– rbrito2012年06月24日 15:58:37 +00:00Commented Jun 24, 2012 at 15:58
Your code is fine as it is, and there isn't much I can recommend. However, I believe that this may profit from being made into a class.
#!/usr/bin/env python
import email
import email.message
class MyMail:
unwanted_fields = ["content-length", "content-type", "lines", "status"]
def __init__(self, fp):
self.res = self.sanitize(email.message_from_file(fp))
def display(self):
print self.res.as_string(unixfrom=False)
def compose_message(self, orig, body):
wanted = email.message.Message()
wanted.set_payload(body.get_payload())
# The dictionaries `orig` and `body` have only headers as their items.
for field in self.unwanted_fields: del orig[field]
for k, v in orig.items() + body.items(): wanted[k] = v
return wanted
Your code does not really care about text/html
mime, other than to skip it. Other than that, it cares that text/plain
is seen only once. This seemed to be an overkill for bit fiddling. Removing that,
def sanitize(self, msg):
if not msg.is_multipart(): return msg
compose = False
text_taken = False
for part in msg.walk():
if (part.get_content_maintype() == "multipart" or
part.get_content_type() == "message/external-body" or
part.get_content_type() == "text/html" or
part.get_payload() == ""):
continue
elif part.get_content_type() == "text/plain" and not text_taken:
body = part
compose = True
text_taken = True
# if we are seeing a second text/plain part, stop throwing
# things
else: return msg
return self.compose_message(msg, body) if compose else msg
if __name__ == "__main__":
import sys
s = MyMail(sys.stdin)
s.display()
-
1\$\begingroup\$ I’m not sure I agree. While this looks OK and conventional, there’s this great talk arguing that you should Stop Writing Classes. It makes a lot of good points. Well, it could be argued that the presence of
display
makes a class here worthwhile. Anyway, +1. \$\endgroup\$Konrad Rudolph– Konrad Rudolph2012年06月24日 08:46:22 +00:00Commented Jun 24, 2012 at 8:46 -
\$\begingroup\$ Thanks for analysing my code and concluding that it is not utter terrible. :) I think that, as with @Konrad, I will still keep the code outside of a class (I may change my mind in the future), but I will adopt your absense of bitfields. Your solution is more elegant, IMVHO. \$\endgroup\$rbrito– rbrito2012年06月24日 16:10:48 +00:00Commented Jun 24, 2012 at 16:10