Issue 21448: Email Parser use 100% CPU

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65647

classification

Title:	Email Parser use 100% CPU
Type:	performance	Stage:	patch review
Components:	email	Versions:	Python 3.5

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	barry, christian.heimes, jader.fabiano, pitrou, python-dev, r.david.murray, rhettinger, serhiy.storchaka, tshepang
Priority:	normal	Keywords:	patch

Created on 2014年05月06日 19:20 by jader.fabiano, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
email_parser_long_lines.patch	serhiy.storchaka, 2014年08月02日 13:26	review
fix_email_parse.diff	rhettinger, 2014年08月02日 21:56	Clean-up code in push()	review
fix_email_parse2.diff	rhettinger, 2014年08月03日 09:48	Revise patch to add splitlines.	review
test_parser.diff	rhettinger, 2014年08月03日 10:20	Extra test	review
fix_prepending2.diff	rhettinger, 2014年08月03日 10:40	Speed-up insertion by using a deque	review
email_parser_long_lines_2.patch	serhiy.storchaka, 2014年08月10日 17:56	Raymond's patch + tests	review

Messages (30)
msg218008 - (view)	Author: jader fabiano (jader.fabiano)	Date: 2014年05月06日 19:20
Use email.parser to catch the mime's header,when a mime has attachments the process is consuming 100% of the CPU. And It can take until four minutes to finish the header parser.
msg218010 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2014年05月06日 19:31
Can you provide more details on how to reproduce the problem, please? For example, a sample message and the sequence of python calls you use to parse it.
msg218011 - (view)	Author: jader fabiano (jader.fabiano)	Date: 2014年05月06日 19:51
I am openning a file and I am passing the File Descriptor to this function Parse().parse( fp ): This file has two attachments Example: self.fileDescriptor( file, 'rb') headers = Parser().parse(self.fileDescriptor ) #Here the process starts to consume 100% of the CPU and It takes around four minutes to go the next line. print 'Headers OK' The File's size is 12M Thanks. 2014年05月06日 16:31 GMT-03:00 R. David Murray <report@bugs.python.org>: > > R. David Murray added the comment: > > Can you provide more details on how to reproduce the problem, please? For > example, a sample message and the sequence of python calls you use to parse > it. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue21448> > _______________________________________ >
msg218012 - (view)	Author: jader fabiano (jader.fabiano)	Date: 2014年05月06日 19:53
Sorry! Correct line self.fileDescriptor = open( file, 'rb') 2014年05月06日 16:51 GMT-03:00 jader fabiano <report@bugs.python.org>: > > jader fabiano added the comment: > > I am openning a file and I am passing the File Descriptor to this function > Parse().parse( fp ): > This file has two attachments > Example: > self.fileDescriptor( file, 'rb') > headers = Parser().parse(self.fileDescriptor ) > #Here the process starts to consume 100% of the CPU and It takes around > four minutes to go the next line. > print 'Headers OK' > > The File's size is 12M > > Thanks. > > 2014年05月06日 16:31 GMT-03:00 R. David Murray <report@bugs.python.org>: > > > > > R. David Murray added the comment: > > > > Can you provide more details on how to reproduce the problem, please? > For > > example, a sample message and the sequence of python calls you use to > parse > > it. > > > > ---------- > > > > _______________________________________ > > Python tracker <report@bugs.python.org> > > <http://bugs.python.org/issue21448> > > _______________________________________ > > > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue21448> > _______________________________________ >
msg218013 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2014年05月06日 20:03
We'll need the data file as well. This is going to be a data-dependent issue. With a 12MB body, I'm guessing there's some decoding pathology involved, which may or may not have been already fixed in python3. To confirm this you could use HeaderParser instead of Parser, which won't try to decode the body.
msg218014 - (view)	Author: jader fabiano (jader.fabiano)	Date: 2014年05月06日 20:15
No, The file has 12Mb, because It has attachments. I am going to show an example. You can use a file thus: Date: Tue, May 10:27:17 6 -0300 (BRT) From: email@email.com.br MIME-Version: 1.0 To: example@example.com Subject:example Content-Type: multipart/mixed; boundary=24f59adc-d522-11e3-a531-00265a0f1361 --24f59adc-d522-11e3-a531-00265a0f1361 Content-Type: multipart/alternative; boundary=24f59a28-d522-11e3-a531-00265a0f1361 --24f59a28-d522-11e3-a531-00265a0f1361^M Content-Type: text/html; charset="iso-8859-1" ^M Content-Transfer-Encoding: 7bit <br/><font color="#00000" face="verdana" size="3">Test example</b> --24f59a28-d522-11e3-a531-00265a0f1361-- --24f59adc-d522-11e3-a531-00265a0f1361 Content-Type: application/pdf; name=Example.pdf Content-Disposition: attachment; filename=Example.pdf Content-Transfer-Encoding: base64 attachment content in base64...... --24f59adc-d522-11e3-a531-00265a0f1361-- 2014年05月06日 17:03 GMT-03:00 R. David Murray <report@bugs.python.org>: > > R. David Murray added the comment: > > We'll need the data file as well. This is going to be a data-dependent > issue. With a 12MB body, I'm guessing there's some decoding pathology > involved, which may or may not have been already fixed in python3. > > To confirm this you could use HeaderParser instead of Parser, which won't > try to decode the body. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue21448> > _______________________________________ >
msg218015 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2014年05月06日 20:23
Sorry, I was using RFC-speak. A message is divided into 'headers' and 'body', and all of the attachments are part of the body in RFC terms. But think of it as 'initial headers' and 'everything else'. Please either attach the full file, and/or try your test using HeaderParser and report the results. However, it occurs to me that the attachments aren't decoded until you retrieve them, so whatever is going on it must be something other than a decoding issue. Nevertheless, Parser actually parses the whole message, attachments included, so we'll need the actual message in order to reproduce this (unless you can reproduce it with a smaller message).
msg218016 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2014年05月06日 20:25
Also to clarify: HeaderParser will also read the entire message, it just won't look for MIME attachments in the 'everything else', it will just treat the 'everything else' as arbitrary data and record it as the payload of the top level Message object.
msg218123 - (view)	Author: jader fabiano (jader.fabiano)	Date: 2014年05月08日 20:38
Hi. I undestood this problem that It was happening, I was writting the mime wrong in the attachments. I read a file with size 4M and I've converted to Base64, so I've written in the mime the content. But i wasn't put the lines with 76 ccharacters plus ""/r/n". I was writing the every in the only line. I think this did the Email Parser uses 100% of the CPU and It delay mora time. I packed up and I was sending email very fast. Thanks 2014年05月06日 17:25 GMT-03:00 R. David Murray <report@bugs.python.org>: > > R. David Murray added the comment: > > Also to clarify: HeaderParser will also read the entire message, it just > won't look for MIME attachments in the 'everything else', it will just > treat the 'everything else' as arbitrary data and record it as the payload > of the top level Message object. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue21448> > _______________________________________ >
msg224516 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年08月01日 19:55
Therefore the bug is that email parser is dramatically slow for abnormal long lines. It has quadratic complexity from line size. Minimal example: import email.parser import time data = 'From: example@example.com\n\n' + 'x' * 10000000 start = time.time() email.parser.Parser().parsestr(data) print(time.time() - start)
msg224562 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年08月02日 13:26
Parser reads from input file small chunks (8192 churacters) and feed FeedParser which pushes data into BufferedSubFile. In BufferedSubFile.push() chunks of incomplete data are accumulated in a buffer and repeatedly scanned for newlines. Every push() has linear complexity from the size of accumulated buffer, and total complexity is quadratic. Here is a patch which fixes problem with parsing long lines. Feel free to add comments if they are needed (there is an abundance of comments in the module).
msg224591 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2014年08月02日 21:56
I think the push() code can be a little cleaner. Attaching a revised patch that simplifies push() a bit.
msg224610 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年08月03日 06:31
fix_email_parse.diff is not work when one chunk ends with '\r' and next chunk doesn't start with '\n'.
msg224623 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2014年08月03日 09:48
Attaching revised patch. I forgot to reapply splitlines.
msg224624 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2014年08月03日 10:20
Attaching a more extensive test
msg224631 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年08月03日 12:56
fix_email_parse2.diff slightly changes behavior. See my comments on Rietveld. As for fix_prepending2.diff, could you please provide any benchmark results? And there is yet one bug in current code. str.splitlines() splits a string not only breaking it at '\r', '\n' or '\r\n', but at any Unicode line break character (e.g. '\x85', '\u2028', etc). And when a chunk ends with such line break character, it will not break a line. Definitely something should fixed: either lines should be broken only at '\r', '\n' or '\r\n', or other line break characters should be handled correctly when they happen at the end of the chunk. What would you say about this, David?
msg224647 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2014年08月03日 17:08
> As for fix_prepending2.diff, could you please provide > any benchmark results> No. Inserting at the beginning of a list is always O(n) and inserting at the beginning of a deque is always O(1).
msg224649 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年08月03日 18:28
Yes, but if n is limited, O(n) becomes O(1). In our case n is the number of fed but not read lines. I suppose the worst case is a number of empty lines, in this case n=8192. I tried following microbenchmark and did not noticed significant difference. $ ./python -m timeit -s "from email.parser import Parser; d = 'From: example@example.com\n\n' + '\n' * 100000" -- "Parser().parsestr(d)"
msg224667 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2014年08月03日 23:03
A deque is typically the right data structure when you need to append, pop, and extend on both the left and right side. It is designed specifically for that task. Also, it nicely cleans-up the code by removing the backwards line list and the list reversal prior to insertion on the left (that's what we had to do to achieve decent performance before the introduction of deques in Python 2.4, now you hardly ever see code like "self._lines[:0] = lines[::-1]"). I think fix_prepending2 would be a nice improvement for Py3.5. For the main patches that directly address the OP's performance issue, feel free to apply either my or yours. They both work. Either way, please add test_parser.diff since the original test didn't cover all the cases and because it didn't make clear the relationship between push() and splitlines().
msg224863 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2014年08月05日 17:31
Serhiy: there was an issue with /r/n going across a chunk boundary that was fixed a while back, so there should be a test for that (I hope). As for how to handle line breaks, backward compatibility applies: we have to continue to do what we did before, and it doesn't look like this patch changes that. That is, it sounds like you are saying there is a pre-existing bug that we may want to address? In which case it should presumably be a separate issue.
msg224866 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2014年08月05日 18:00
Should this be categorized as a security issue? You could easily DoS a server with that (email.parser is used by http.client to parse HTTP headers, it seems).
msg224894 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2014年08月06日 01:25
> Should this be categorized as a security issue? > You could easily DoS a server with that > (email.parser is used by http.client to parse HTTP > headers, it seems). I think it makes sense to treat this as a security issue. I don't have a preference about whether to use Serhiy's email_parser_long_lines.patch or my fix_email_parse2.diff but we should include the extra tests in test_parser.diff.
msg224910 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年08月06日 05:43
I found a bug in my patch. Following code from email.parser import Parser BLOCKSIZE = 8192 s = 'From: <e@example.com>\nFoo: ' s += 'x' * ((-len(s) - 1) % BLOCKSIZE) + '\rBar: ' s += 'y' * ((-len(s) - 1) % BLOCKSIZE) + '\x85Baz: ' s += 'z' * ((-len(s) - 1) % BLOCKSIZE) + '\n\n' print(Parser().parsestr(s).keys()) outputs ['From', 'Foo', 'Bar', 'Baz'] on current code and ['From', 'Foo', 'Bar'] with my patch. Neither current code, nor Reimonds patch are not affected by similar bugs. It is possible to fix my patch, but then it will become too complicated and slower. I have one doubt about one special case with Raymond's patch, but looking at current code on highter level, this doesn't matter. Current code in FeedParser in any case is not very efficient and smoothes out any implementation details in BufferedSubFile. That is why fix_prepending2.diff has no visible effect on email parsing. I'll provided additional tests which cover current issue and a bug in my patch. > That is, it sounds like you are saying there is a pre-existing bug that we may want to address? In which case it should presumably be a separate issue. I can't create an example. May be higher level code is tolerant to it. I'll created separate issue if found an example. > Should this be categorized as a security issue? Yes, but not very important. You need send tens or hundreds of megabytes to hang a server more than a second.
msg225155 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年08月10日 17:56
Here is a patch which combines fixed Raymond's patch and FeedParser tests. These tests cover this issue, a bug in my patch, and (surprisingly) a bug in Raymond's patch. I didn't include Raymond's test because looks as it doesn't catch any bug. If there are no objections, I'll commit this patch.
msg225158 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2014年08月10日 18:37
The test_parser.diff file catches the bug in fix_email_parse.diff and it provides some assurance that push() functions as an incremental version of str.splitlines(). I would like to have this test included. It does some good and does no harm.
msg225228 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2014年08月12日 11:04
New changeset ba90bd01c5f1 by Serhiy Storchaka in branch '2.7': Issue #21448: Fixed FeedParser feed() to avoid O(N2) behavior when parsing long line. http://hg.python.org/cpython/rev/ba90bd01c5f1 New changeset 1b1f92e39462 by Serhiy Storchaka in branch '3.4': Issue #21448: Fixed FeedParser feed() to avoid O(N2) behavior when parsing long line. http://hg.python.org/cpython/rev/1b1f92e39462 New changeset f296d7d82675 by Serhiy Storchaka in branch 'default': Issue #21448: Fixed FeedParser feed() to avoid O(N**2) behavior when parsing long line. http://hg.python.org/cpython/rev/f296d7d82675
msg225229 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2014年08月12日 11:15
> The test_parser.diff file catches the bug in fix_email_parse.diff I don't see this. But well, it does no harm. Please commit fix_prepending2.diff yourself.
msg225233 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2014年08月12日 17:26
New changeset 71cb8f605f77 by Serhiy Storchaka in branch '2.7': Decreased memory requirements of new tests added in issue21448. http://hg.python.org/cpython/rev/71cb8f605f77 New changeset c19d3465965f by Serhiy Storchaka in branch '3.4': Decreased memory requirements of new tests added in issue21448. http://hg.python.org/cpython/rev/c19d3465965f New changeset f07b17de3b0d by Serhiy Storchaka in branch 'default': Decreased memory requirements of new tests added in issue21448. http://hg.python.org/cpython/rev/f07b17de3b0d
msg240670 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2015年04月13日 17:22
Raymond, are you gong to apply the deque patch (maybe after doing performance measurement) or should we close this?
msg243873 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2015年05月23日 00:23
New changeset 830bcf4fb29b by Raymond Hettinger in branch 'default': Issue #21448: Improve performance of the email feedparser https://hg.python.org/cpython/rev/830bcf4fb29b

History
Date	User	Action	Args
2022年04月11日 14:58:03	admin	set	github: 65647
2015年05月23日 00:24:13	rhettinger	set	status: open -> closed assignee: rhettinger -> resolution: fixed
2015年05月23日 00:23:48	python-dev	set	messages: + msg243873
2015年04月13日 17:22:32	r.david.murray	set	messages: + msg240670
2014年08月12日 17:26:13	python-dev	set	messages: + msg225233
2014年08月12日 11:15:13	serhiy.storchaka	set	assignee: serhiy.storchaka -> rhettinger messages: + msg225229 versions: - Python 2.7, Python 3.4
2014年08月12日 11:04:14	python-dev	set	nosy: + python-dev messages: + msg225228
2014年08月10日 18:37:06	rhettinger	set	messages: + msg225158
2014年08月10日 17:56:28	serhiy.storchaka	set	files: + email_parser_long_lines_2.patch messages: + msg225155
2014年08月06日 05:43:38	serhiy.storchaka	set	messages: + msg224910
2014年08月06日 01:25:59	rhettinger	set	messages: + msg224894
2014年08月05日 18:00:33	pitrou	set	nosy: + christian.heimes, pitrou messages: + msg224866
2014年08月05日 17:31:59	r.david.murray	set	messages: + msg224863
2014年08月03日 23:03:32	rhettinger	set	messages: + msg224667
2014年08月03日 18:28:13	serhiy.storchaka	set	messages: + msg224649
2014年08月03日 17:08:30	rhettinger	set	messages: + msg224647
2014年08月03日 12:56:30	serhiy.storchaka	set	messages: + msg224631
2014年08月03日 10:40:17	rhettinger	set	files: - fix_prepending.diff
2014年08月03日 10:40:06	rhettinger	set	files: + fix_prepending2.diff
2014年08月03日 10:33:18	rhettinger	set	files: + fix_prepending.diff
2014年08月03日 10:20:59	rhettinger	set	files: + test_parser.diff messages: + msg224624
2014年08月03日 09:48:59	rhettinger	set	files: + fix_email_parse2.diff messages: + msg224623
2014年08月03日 06:31:03	serhiy.storchaka	set	messages: + msg224610
2014年08月03日 03:42:27	rhettinger	set	messages: - msg224577
2014年08月02日 21:56:55	rhettinger	set	files: + fix_email_parse.diff assignee: rhettinger -> serhiy.storchaka messages: + msg224591
2014年08月02日 17:46:25	rhettinger	set	assignee: rhettinger messages: + msg224577 nosy: + rhettinger
2014年08月02日 13:27:41	serhiy.storchaka	set	stage: patch review
2014年08月02日 13:26:55	serhiy.storchaka	set	files: + email_parser_long_lines.patch keywords: + patch messages: + msg224562
2014年08月01日 19:55:31	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg224516 versions: + Python 3.4, Python 3.5
2014年05月09日 20:29:21	tshepang	set	nosy: + tshepang
2014年05月08日 20:38:38	jader.fabiano	set	messages: + msg218123
2014年05月06日 20:25:31	r.david.murray	set	messages: + msg218016
2014年05月06日 20:23:14	r.david.murray	set	messages: + msg218015
2014年05月06日 20:15:43	jader.fabiano	set	messages: + msg218014
2014年05月06日 20:03:51	r.david.murray	set	messages: + msg218013
2014年05月06日 19:53:49	jader.fabiano	set	messages: + msg218012
2014年05月06日 19:51:14	jader.fabiano	set	messages: + msg218011
2014年05月06日 19:31:23	r.david.murray	set	messages: + msg218010
2014年05月06日 19:20:13	jader.fabiano	create

homepage