This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011年03月31日 12:04 by wally1980, last changed 2022年04月11日 14:57 by admin.
| Messages (8) | |||
|---|---|---|---|
| msg132657 - (view) | Author: valera (wally1980) | Date: 2011年03月31日 12:04 | |
mailbox.mbox parser is splitting mbox files by "^From " pattern, which is wrong , in fairy it should split mbox by "\nFrom ". Illustration: ------ From bla-blah@localhost Header1 Header2 body1 body2 From blah-blah2@localhost Header1 body1 From your dear friend body3 ------ This mbox would be splitted in 3 messages instead of 2 |
|||
| msg132671 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2011年03月31日 14:13 | |
All the references I could find talk about triggering the match without the proceeding newline. That is, it is not certain that a blank line will precede the 'From ' header, and the typical quoting rules for mbox format call for any 'From ' at the start of a line (whether preceded by a blank line or not) to be quoted. This might have something to do with the fact that otherwise you have to special case the first line of the mbox, but I don't really know. What tool are you using that is producing the unquoted 'From ' lines in your mbox? I know there are variants on the mbox format, so if one of them has the format you propose, this would become a feature request to support that variant mbox format. |
|||
| msg132687 - (view) | Author: valera (wally1980) | Date: 2011年03月31日 16:48 | |
On 2011年3月31日 14:13:50 +0000 "R. David Murray" <report@bugs.python.org> wrote: > > R. David Murray <rdmurray@bitdance.com> added the comment: > > All the references I could find talk about triggering the match > without the proceeding newline. That is, it is not certain that a > blank line will precede the 'From ' header, and the typical quoting > rules for mbox format call for any 'From ' at the start of a line > (whether preceded by a blank line or not) to be quoted. This might > have something to do with the fact that otherwise you have to special > case the first line of the mbox, but I don't really know. > > What tool are you using that is producing the unquoted 'From ' lines > in your mbox? I know there are variants on the mbox format, so if > one of them has the format you propose, this would become a feature > request to support that variant mbox format. > > ---------- > nosy: +r.david.murray > Hello, David ! This is an email from netcraft mailing list - the host which accepted it is running sendmail with some antivirus software on top - mimedefang + spamassassin from what I know. Could be tat something is broken in that chain, I've spotted the error when I was writing the script for mailbox --> maildir conversion, while migrating this server. So I had to inherit mailbox.mbox and fix as I need, I'll investigate further what lead to such behaviour. Nevertheless, here is snippet from rfc4155 - In order to improve interoperability among messaging systems, this memo defines a "default" mbox database format, which MUST be supported by all implementations that claim to be compliant with this specification. The "default" mbox database format uses a linear sequence of Internet messages, with each message being immediately prefaced by a separator line, and being terminated by an empty line. --- So I think assuming that there should be an empty line before "From " separator line is fine (for the second email and further) and would help to deal with all kinds of mbox mailboxes, fix is rather trivial. Best regards, Valery Masiutsin |
|||
| msg138245 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011年06月13日 13:56 | |
Hello Valery Masiutsin, i recently stumbled over this while searching for the link to the standart i've stored in another issue. (Without being logged in, say.) The de-facto standart (http://qmail.org/man/man5/mbox.html) says: HOW A MESSAGE IS READ A reader scans through an mbox file looking for From_ lines. Any From_ line marks the beginning of a message. The reader should not attempt to take advantage of the fact that every From_ line (past the beginning of the file) is preceded by a blank line. This is however the recent version. The "mbox" manpage of my up-to-date Mac OS X 10.6.7 does not state this, for example. It's from 2002. However, all known MBOX standarts, i.e. MBOXO, MBOXRD, MBOXCL, require proper quoting of non-From_ "From " lines (by preceeding with '>'). So your example should not fail in Python. (But hey - are you sure *that* has been produced by Perl?) You're right however that Python seems to only support the old MBOXO way of un-escaping only plain "From " to/from ">From ", which is not even mentioned anymore in the current standart - that only describes MBOXRD ("(>*From )" -> ">"+match.group(1)). (Lucky me: i own Mac OS X, otherwise i wouldn't even know.) Thus you're in trouble if the unescaping is performed before the split.. This is another issue, though: "MBOX parser uses MBOXO algorithm". ;> - Ciao, Steffen |
|||
| msg163812 - (view) | Author: Petri Lehtinen (petri.lehtinen) * (Python committer) | Date: 2012年06月24日 17:41 | |
It seems to me that "^From " is the correct way to match the start of each message. This is also what the qmail manual page (linked in the previous message) says. So closing as invalid. |
|||
| msg163872 - (view) | Author: valera (wally1980) | Date: 2012年06月24日 23:03 | |
Hello Petri Qmail manpage does not sound as a valid reference for me, I've pointed relevant RFC (which dictates correct behaviour) as a reference, python mbox parser does not conform to it. Best regards, Valery Masiutsin On Sun, Jun 24, 2012 at 6:41 PM, Petri Lehtinen <report@bugs.python.org>wrote: > > Petri Lehtinen <petri@digip.org> added the comment: > > It seems to me that "^From " is the correct way to match the start of each > message. This is also what the qmail manual page (linked in the previous > message) says. So closing as invalid. > > ---------- > nosy: +petri.lehtinen > resolution: -> invalid > stage: test needed -> committed/rejected > status: open -> closed > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue11728> > _______________________________________ > |
|||
| msg163902 - (view) | Author: Petri Lehtinen (petri.lehtinen) * (Python committer) | Date: 2012年06月25日 06:15 | |
Actually, you're right. Sorry for overlooking the RFC. But that said, the RFC itself refers to the same manpage as a reference that's "mostly authoritative for those variations that are otherwise only documented in anecdotal form". So I guess it's quite a good reference after all :) In Appendix A, RFC 4155 defines a set of rules for a "default" mbox format that maximizes interoperability between different mbox implementations. The important things in the RFC concerning this issue are: * There MUST be an empty line after each message. * The RFC does not specify any escape syntax for message body lines starting with "From ". It says: "Recipient systems are expected to parse full separator lines as they are documented above." Because the RFC states that there must be an empty line after each message, and it aims for maximum interoperability, I think we can assume that there always is an empty line there. But looking for "\n\nFrom " is not enough for finding the starting points of messages. We should actually parse the whole separator line which consists of "From ", an email address (addr-spec in RFC 2822), a timestamp (in UNIX ctime format without timezone), and a newline character. I think this should be the default mode for reading mbox files. See #13698 for adding support for other formats. |
|||
| msg164636 - (view) | Author: Petri Lehtinen (petri.lehtinen) * (Python committer) | Date: 2012年07月04日 04:24 | |
Some thoughts on doing "clever tricks" to enhance mbox parsing: http://www.jwz.org/doc/content-length.html |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:15 | admin | set | github: 55937 |
| 2020年11月10日 18:18:35 | iritkatriel | set | versions: + Python 3.8, Python 3.9, Python 3.10, - Python 2.7, Python 3.2, Python 3.3, Python 3.4 |
| 2012年07月04日 04:24:49 | petri.lehtinen | set | messages: + msg164636 |
| 2012年06月25日 06:15:54 | petri.lehtinen | set | status: closed -> open components: + email nosy: + barry messages: + msg163902 resolution: not a bug -> stage: resolved -> |
| 2012年06月24日 23:03:53 | wally1980 | set | messages: + msg163872 |
| 2012年06月24日 17:41:08 | petri.lehtinen | set | status: open -> closed nosy: + petri.lehtinen messages: + msg163812 resolution: not a bug stage: test needed -> resolved |
| 2011年06月13日 13:56:24 | sdaoden | set | nosy:
+ sdaoden messages: + msg138245 |
| 2011年06月01日 06:30:44 | terry.reedy | set | stage: test needed type: behavior versions: - Python 2.6, Python 2.5, Python 3.1 |
| 2011年03月31日 16:48:39 | wally1980 | set | messages: + msg132687 |
| 2011年03月31日 14:13:49 | r.david.murray | set | nosy:
+ r.david.murray messages: + msg132671 |
| 2011年03月31日 12:04:25 | wally1980 | create | |