4
\$\begingroup\$

I have a working script that I wrote in Python which for e-mail messages that consist only of a plain text part + an HTML part, discards the HTML and keeps only the plain text part.

The script is not exactly elegant and, as can be seen from the code, it smells like it is C (in particular, I simulate the use of bitmasks) and I am not exactly satisfied with some points of it.

I know that the script has some issues (like code duplication and the hack mentioned above), but I don't know the idiomatic Pythonic way of writing it and I would appreciate any kind of criticism to improve it in any way to make the code elegant.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Author: Rogério Theodoro de Brito <[email protected]>
License: GPL-2+
Copyright: 2010-2012 Rogério Theodoro de Brito
drop-alternatives is a simple Python script for those who hate emails in
HTML and who prefer their inbox to have as many messages in pure text as
feasible. This script is generally meant to be run as a filter with procmail
or some other mail delivery agent.
It tries to be moderately conservative and only act when things are
moderately safe:
* If the message is `multipart` and has a `text/plain` and a `text/html`
 part, keep the `text/plain` part only.
* In all other cases keep the message intact.
"""
import email
import email.message
def compose_message(orig, body):
 """
 Create new message with headers from `orig` and body from `body`.
 * `orig`: The original message.
 * `body`: The body that we want the new message to have.
 * Returns a new message.
 `compose_message` creates a new message with most of the fields from
 `orig`, with fields from `body` (if any) and with the payload of
 `body`. The fields excluded from `orig` are the following:
 * `content-length`
 * `content-type`
 * `lines`
 * `status`
 """
 wanted = email.message.Message()
 wanted.set_payload(body.get_payload())
 unwanted_fields = ["content-length", "content-type", "lines", "status"]
 # The dictionaries `orig` and `body` have only headers as their items.
 for field in unwanted_fields:
 del orig[field]
 for k, v in orig.items():
 wanted[k] = v
 for k, v in body.items():
 wanted[k] = v
 return wanted
def sanitize(msg):
 """
 Given an RFC-2822 message `msg`, generate its 'sanitized' version.
 * `msg`: The message to be sanitized.
 * Returns a sanitized version of `msg`.
 `sanitize` tries to be moderately conservative and only act when things
 are moderately safe:
 * If the message is multipart and has a `text/plain` and a `text/html`
 part, keep the `text/plain` part only.
 * In all other cases keep the message intact.
 """
 if not msg.is_multipart():
 return msg
 # 'composition' is a bitmask containing the kind of the parts
 TEXTPLAIN = 1 # text/plain
 TEXTHTML = 2 # text/html
 MISCPARTS = 4 # anything else
 composition = 0
 text_taken = False
 for part in msg.walk():
 if (part.get_content_maintype() == "multipart" or
 part.get_content_type() == "message/external-body" or
 part.get_payload() == ""):
 continue
 elif part.get_content_type() == "text/plain":
 if not text_taken:
 text_taken = True
 body = part
 composition |= TEXTPLAIN
 else:
 # if we are seeing a second text/plain part, stop throwing
 # things
 composition |= MISCPARTS
 break
 elif part.get_content_type() == "text/html":
 composition |= TEXTHTML
 else:
 composition |= MISCPARTS
 if composition == (TEXTPLAIN + TEXTHTML) or composition == TEXTPLAIN:
 return compose_message(msg, body)
 else:
 return msg
if __name__ == "__main__":
 import sys
 res = sanitize(email.message_from_file(sys.stdin))
 print res.as_string(unixfrom=False),
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jun 23, 2012 at 9:04
\$\endgroup\$

2 Answers 2

3
\$\begingroup\$

I think the code is quite fine as it is. Some small notes:

if composition == (TEXTPLAIN + TEXTHTML) or composition == TEXTPLAIN:

Just to make sure: this code tests that TEXTPLAIN is set and that MISCPARTS is not set. I would make that explicit, otherwise this code is hard to understand (in particular, it’s easy to miss that MISCPARTS mustn’t be set):

if (composition & TEXTPLAIN) != 0 and (composition & MISCPARTS) == 0:

If you’re not happy with bit operations, you can a set instead.

composition = set()
# ...
if TEXTPLAIN in composition and MISCPARTS not in composition:

... and define TEXTPLAIN etc. as simple consecutive constants rather than bitmasks.

This is more pythonic certainly, but I actually find the use of bit masks entirely appropriate here.

answered Jun 23, 2012 at 12:22
\$\endgroup\$
1
  • \$\begingroup\$ Thanks, @Konrad. I thought that the bitfields would be condemned in python, but as you mention right now, they even feel a bit natural. BTW, in C I would use an enum for the constants. Is there any Pythonic idiom with the same "flavour" as C's enums? \$\endgroup\$ Commented Jun 24, 2012 at 15:58
2
\$\begingroup\$

Your code is fine as it is, and there isn't much I can recommend. However, I believe that this may profit from being made into a class.

#!/usr/bin/env python
import email
import email.message
class MyMail:
 unwanted_fields = ["content-length", "content-type", "lines", "status"]
 def __init__(self, fp):
 self.res = self.sanitize(email.message_from_file(fp))
 def display(self):
 print self.res.as_string(unixfrom=False)
 def compose_message(self, orig, body):
 wanted = email.message.Message()
 wanted.set_payload(body.get_payload())
 # The dictionaries `orig` and `body` have only headers as their items.
 for field in self.unwanted_fields: del orig[field]
 for k, v in orig.items() + body.items(): wanted[k] = v
 return wanted

Your code does not really care about text/html mime, other than to skip it. Other than that, it cares that text/plain is seen only once. This seemed to be an overkill for bit fiddling. Removing that,

 def sanitize(self, msg):
 if not msg.is_multipart(): return msg
 compose = False
 text_taken = False
 for part in msg.walk():
 if (part.get_content_maintype() == "multipart" or
 part.get_content_type() == "message/external-body" or
 part.get_content_type() == "text/html" or
 part.get_payload() == ""):
 continue
 elif part.get_content_type() == "text/plain" and not text_taken:
 body = part
 compose = True
 text_taken = True
 # if we are seeing a second text/plain part, stop throwing
 # things
 else: return msg
 return self.compose_message(msg, body) if compose else msg
if __name__ == "__main__":
 import sys
 s = MyMail(sys.stdin)
 s.display()
answered Jun 24, 2012 at 4:19
\$\endgroup\$
2
  • 1
    \$\begingroup\$ I’m not sure I agree. While this looks OK and conventional, there’s this great talk arguing that you should Stop Writing Classes. It makes a lot of good points. Well, it could be argued that the presence of display makes a class here worthwhile. Anyway, +1. \$\endgroup\$ Commented Jun 24, 2012 at 8:46
  • \$\begingroup\$ Thanks for analysing my code and concluding that it is not utter terrible. :) I think that, as with @Konrad, I will still keep the code outside of a class (I may change my mind in the future), but I will adopt your absense of bitfields. Your solution is more elegant, IMVHO. \$\endgroup\$ Commented Jun 24, 2012 at 16:10

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.