homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: mbox parser incorrect behaviour
Type: behavior Stage:
Components: email, Library (Lib) Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, petri.lehtinen, r.david.murray, sdaoden, wally1980
Priority: normal Keywords:

Created on 2011年03月31日 12:04 by wally1980, last changed 2022年04月11日 14:57 by admin.

Messages (8)
msg132657 - (view) Author: valera (wally1980) Date: 2011年03月31日 12:04
mailbox.mbox parser is splitting mbox files by "^From " pattern, which is wrong , in fairy it should split mbox by "\nFrom ".
Illustration:
------
From bla-blah@localhost
Header1
Header2
body1
body2
From blah-blah2@localhost
Header1
body1
From your dear friend
body3
------
This mbox would be splitted in 3 messages instead of 2
msg132671 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011年03月31日 14:13
All the references I could find talk about triggering the match without the proceeding newline. That is, it is not certain that a blank line will precede the 'From ' header, and the typical quoting rules for mbox format call for any 'From ' at the start of a line (whether preceded by a blank line or not) to be quoted. This might have something to do with the fact that otherwise you have to special case the first line of the mbox, but I don't really know.
What tool are you using that is producing the unquoted 'From ' lines in your mbox? I know there are variants on the mbox format, so if one of them has the format you propose, this would become a feature request to support that variant mbox format.
msg132687 - (view) Author: valera (wally1980) Date: 2011年03月31日 16:48
On 2011年3月31日 14:13:50 +0000
"R. David Murray" <report@bugs.python.org> wrote:
> 
> R. David Murray <rdmurray@bitdance.com> added the comment:
> 
> All the references I could find talk about triggering the match
> without the proceeding newline. That is, it is not certain that a
> blank line will precede the 'From ' header, and the typical quoting
> rules for mbox format call for any 'From ' at the start of a line
> (whether preceded by a blank line or not) to be quoted. This might
> have something to do with the fact that otherwise you have to special
> case the first line of the mbox, but I don't really know.
> 
> What tool are you using that is producing the unquoted 'From ' lines
> in your mbox? I know there are variants on the mbox format, so if
> one of them has the format you propose, this would become a feature
> request to support that variant mbox format.
> 
> ----------
> nosy: +r.david.murray
> 
Hello, David !
This is an email from netcraft mailing list - the host which accepted
it is running sendmail with some antivirus software on top -
mimedefang + spamassassin from what I know.
Could be tat something is broken in that chain, I've spotted the error
when I was writing the script for mailbox --> maildir conversion,
while migrating this server.
So I had to inherit mailbox.mbox and fix as I need, I'll investigate
further what lead to such behaviour. 
Nevertheless, here is snippet from rfc4155 - 
In order to improve interoperability among messaging systems, this
 memo defines a "default" mbox database format, which MUST be
 supported by all implementations that claim to be compliant with this
 specification.
 The "default" mbox database format uses a linear sequence of Internet
 messages, with each message being immediately prefaced by a separator
 line, and being terminated by an empty line.
---
So I think assuming that there should be an empty line before
"From " separator line is fine (for the second email and further) and
would help to deal with all kinds of mbox mailboxes, fix is rather
trivial.
Best regards,
Valery Masiutsin
msg138245 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011年06月13日 13:56
Hello Valery Masiutsin, i recently stumbled over this while searching
for the link to the standart i've stored in another issue.
(Without being logged in, say.)
The de-facto standart (http://qmail.org/man/man5/mbox.html) says:
HOW A MESSAGE IS READ
 A reader scans through an mbox file looking for From_ lines.
 Any From_ line marks the beginning of a message. The reader
 should not attempt to take advantage of the fact that every
 From_ line (past the beginning of the file) is preceded by a
 blank line.
This is however the recent version. The "mbox" manpage of my up-to-date
Mac OS X 10.6.7 does not state this, for example. It's from 2002.
However, all known MBOX standarts, i.e. MBOXO, MBOXRD, MBOXCL, require
proper quoting of non-From_ "From " lines (by preceeding with '>').
So your example should not fail in Python.
(But hey - are you sure *that* has been produced by Perl?)
You're right however that Python seems to only support the old MBOXO
way of un-escaping only plain "From " to/from ">From ", which is not
even mentioned anymore in the current standart - that only describes
MBOXRD ("(>*From )" -> ">"+match.group(1)). 
(Lucky me: i own Mac OS X, otherwise i wouldn't even know.)
Thus you're in trouble if the unescaping is performed before the split..
This is another issue, though: "MBOX parser uses MBOXO algorithm".
;> - Ciao, Steffen
msg163812 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012年06月24日 17:41
It seems to me that "^From " is the correct way to match the start of each message. This is also what the qmail manual page (linked in the previous message) says. So closing as invalid.
msg163872 - (view) Author: valera (wally1980) Date: 2012年06月24日 23:03
Hello Petri
Qmail manpage does not sound as a valid reference for me, I've pointed
 relevant RFC (which dictates correct behaviour) as a reference, python
mbox parser does not conform to it.
Best regards,
Valery Masiutsin
On Sun, Jun 24, 2012 at 6:41 PM, Petri Lehtinen <report@bugs.python.org>wrote:
>
> Petri Lehtinen <petri@digip.org> added the comment:
>
> It seems to me that "^From " is the correct way to match the start of each
> message. This is also what the qmail manual page (linked in the previous
> message) says. So closing as invalid.
>
> ----------
> nosy: +petri.lehtinen
> resolution: -> invalid
> stage: test needed -> committed/rejected
> status: open -> closed
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue11728>
> _______________________________________
>
msg163902 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012年06月25日 06:15
Actually, you're right. Sorry for overlooking the RFC. But that said, the RFC itself refers to the same manpage as a reference that's "mostly authoritative for those variations that are otherwise only documented in anecdotal form". So I guess it's quite a good reference after all :)
In Appendix A, RFC 4155 defines a set of rules for a "default" mbox format that maximizes interoperability between different mbox implementations.
The important things in the RFC concerning this issue are:
* There MUST be an empty line after each message.
* The RFC does not specify any escape syntax for message body lines starting with "From ". It says: "Recipient systems are expected to parse full separator lines as they are documented above."
Because the RFC states that there must be an empty line after each message, and it aims for maximum interoperability, I think we can assume that there always is an empty line there. But looking for "\n\nFrom " is not enough for finding the starting points of messages. We should actually parse the whole separator line which consists of "From ", an email address (addr-spec in RFC 2822), a timestamp (in UNIX ctime format without timezone), and a newline character.
I think this should be the default mode for reading mbox files. See #13698 for adding support for other formats.
msg164636 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012年07月04日 04:24
Some thoughts on doing "clever tricks" to enhance mbox parsing:
 http://www.jwz.org/doc/content-length.html 
History
Date User Action Args
2022年04月11日 14:57:15adminsetgithub: 55937
2020年11月10日 18:18:35iritkatrielsetversions: + Python 3.8, Python 3.9, Python 3.10, - Python 2.7, Python 3.2, Python 3.3, Python 3.4
2012年07月04日 04:24:49petri.lehtinensetmessages: + msg164636
2012年06月25日 06:15:54petri.lehtinensetstatus: closed -> open

components: + email

nosy: + barry
messages: + msg163902
resolution: not a bug ->
stage: resolved ->
2012年06月24日 23:03:53wally1980setmessages: + msg163872
2012年06月24日 17:41:08petri.lehtinensetstatus: open -> closed

nosy: + petri.lehtinen
messages: + msg163812

resolution: not a bug
stage: test needed -> resolved
2011年06月13日 13:56:24sdaodensetnosy: + sdaoden
messages: + msg138245
2011年06月01日 06:30:44terry.reedysetstage: test needed
type: behavior
versions: - Python 2.6, Python 2.5, Python 3.1
2011年03月31日 16:48:39wally1980setmessages: + msg132687
2011年03月31日 14:13:49r.david.murraysetnosy: + r.david.murray
messages: + msg132671
2011年03月31日 12:04:25wally1980create

AltStyle によって変換されたページ (->オリジナル) /