Issue 11271: concurrent.futures.ProcessPoolExecutor.map() doesn't batch function arguments by chunks

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/55480

classification

Title:	concurrent.futures.ProcessPoolExecutor.map() doesn't batch function arguments by chunks
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.5

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	bquinlan, dan.oreilly, josh.r, neologix, pitrou, python-dev, sbt, tbrink
Priority:	normal	Keywords:	patch

Created on 2011年02月21日 15:04 by tbrink, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
map_comparison.py	tbrink, 2011年02月21日 15:04	Demonstration and workaround
new_processpoolexecutor.py	tbrink, 2011年02月21日 16:41	Improved the workaround
map_chunksize.patch	dan.oreilly, 2014年07月24日 16:22	Adds chunksize parameter to ProcessPoolExecutor.map.	review
map_chunksize_with_test.patch	dan.oreilly, 2014年07月31日 23:01	review
test_mult.py	dan.oreilly, 2014年07月31日 23:04	benchmark script
map_chunksize_with_docs.patch	dan.oreilly, 2014年08月07日 23:25	Updated patch with doc changes.	review
map_chunksize_docs_update.patch	dan.oreilly, 2014年08月12日 00:19	More tests, small change to how we "yield from" the result chunks	review

Messages (19)
msg128963 - (view)	Author: Tobias Brink (tbrink)	Date: 2011年02月21日 15:04
I tested the new concurrent.futures.ProcessPoolExecutor.map() in 3.2 with the is_prime() function from the documentation example. This was significantly slower than using multiprocessing.Pool.map(). Quick look at the source showed that multiprocessing sends the iterable in chunks to the worker process while futures sends always only one entry of the iterable to the worker. Functions like is_prime() which finish relatively fast make the communication overhead (at least I guess that is the culprit) very big in comparison. Attached is a file which demonstrates the problem and a quick workaround. The workaround uses the chunk idea from multiprocessing. The problem is that it requires the iterables passed to map() to have a length and be indexable with a slice. I believe this limitation could be worked around.
msg128970 - (view)	Author: Tobias Brink (tbrink)	Date: 2011年02月21日 16:41
Playing around a bit I wrote the attached implementation which works with all iterables.
msg137351 - (view)	Author: Brian Quinlan (bquinlan) * (Python committer)	Date: 2011年05月31日 09:42
On my crappy computer, ProcessPoolExecutor.map adds <3ms of added execution time per call using your test framework. What is your use case where that is too much? That being said, I don't have any objections to making improvements. If you want to pursue this, could you attach a working map_comprison.py?
msg137359 - (view)	Author: Tobias Brink (tbrink)	Date: 2011年05月31日 13:25
I can confirm an overhead of 2 ms to 3 ms using a relatively recent Intel Core i5 CPU. I (personally) believe these 3 ms to be a pretty big overhead on modern computers and I also believe that it would be relatively simple to reduce it. I don't have much time at the moment but I will try to produce a complete proof of concept patch for the futures module in the next weeks. I'd be happy to get some comments when it is finished.
msg137692 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2011年06月05日 13:14
Using your test script fixed (on Python 3.3), I get the following numbers: Starting multiproc...done in 2.1014609336853027 s. Starting futures...done in 20.209479093551636 s. Starting futures "fixed"...done in 2.026125907897949 s. So there's a 0.2ms overhead per remote function call here (20/(100100000-100000000)). Can't your chunks() function use itertools.islice()? Also, the chunksize can't be anything else than 1 by default, since your approach is increasing latency of returning results.
msg155114 - (view)	Author: Brian Quinlan (bquinlan) * (Python committer)	Date: 2012年03月07日 20:09
I'm closing this since tbrink didn't respond to pitrou's comments.
msg223802 - (view)	Author: Dan O'Reilly (dan.oreilly) *	Date: 2014年07月24日 01:59
I'm seeing an even larger difference between multiprocessing.Pool and ProcessPoolExecutor on my machine, with Python 3.4: Starting multiproc...done in 2.160644769668579 s. Starting futures...done in 67.953957319259644 s. Starting futures "fixed"...done in 2.134932041168213 s. I've updated the initial patch to address the comments Antoine made; the chunksize now defaults to 1, and itertools is used to chunk the input iterables, rather than building a list. Attached is an updated benchmark script: Starting multiproc...done in 2.2295100688934326 s. Starting futures...done in 68.5991039276123 s. Starting futures "fixed" (no chunking)...done in 69.18992304801941 s. Starting futures "fixed" (with chunking)...done in 2.352942705154419 s. The new implementation of map has essentially identical performance to the original with chunksize=1, but it performs much better with a larger chunksize provided.
msg223803 - (view)	Author: Dan O'Reilly (dan.oreilly) *	Date: 2014年07月24日 02:00
Here's a patch that adds the new map implementation from the benchmark script to concurrent.futures.process.
msg223855 - (view)	Author: Dan O'Reilly (dan.oreilly) *	Date: 2014年07月24日 16:24
I've added new versions of the patch/demonstration that ensures we actually submit all the futures before trying to yield from the returned iterator. Previously we were just returning a generator object when map was called, without actually submitting any futures.
msg223966 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2014年07月25日 16:44
Thank you for posting this, I'm reopening the issue.
msg224369 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2014年07月31日 01:42
Thank you for the patch. I've posted some review comments. Two further remarks: - this makes the ProcessPool API slightly different from the ThreadPool one. I wonder if this is acceptable or not. Thoughts? - the patch would need unit tests for the additional functionality
msg224372 - (view)	Author: Dan O'Reilly (dan.oreilly) *	Date: 2014年07月31日 02:00
re: Diverging ThreadPoolExecutor and ProcessPoolExecutor APIs. I thought about this as well. It would of course be possible to make ThreadPoolExecutor's API match, but it would serve no useful purpose that I can think of. I guess we could make the ThreadPoolExecutor API accept the chunksize argument just so the APIs match, but then not actually use it (and document that it's not used, of course). That would make it easier to switch between the two APIs, at least. On the other hand, the two APIs are slightly different already: ThreadPoolExecutor requires max_workers be supplied, while ProcessPoolExecutor will default to multiprocessing.cpu_count(). So this wouldn't completely be without precedent. Anyway, I will review your comments on the patch, add unit tests, and re-submit. Thanks!
msg224451 - (view)	Author: Dan O'Reilly (dan.oreilly) *	Date: 2014年07月31日 23:01
I've attached an updated patch based on your review comments; there's now a unit test with a non-default chunksize, the chunking algorithm is more readable, and the same code path is used no matter what chunksize is provided. I've also updated the test script to match the implementation changes.
msg225045 - (view)	Author: Dan O'Reilly (dan.oreilly) *	Date: 2014年08月07日 23:25
Here's an updated patch that adds documentation and Antoine's requested code changes.
msg225143 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2014年08月10日 15:46
Thanks for the update! This looks basically good, I'll wait a bit to see if other people have comments, otherwise I'll commit.
msg225214 - (view)	Author: Dan O'Reilly (dan.oreilly) *	Date: 2014年08月12日 00:19
A couple of small updates based on comments from Charles-François Natali: * Use itertools.chain.from_iterable to "yield from" the result chunks instead of a for loop. * Add more tests with varying chunksizes.
msg228472 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2014年10月04日 18:21
New changeset f87c2c4f03da by Antoine Pitrou in branch 'default': Issue #11271: concurrent.futures.Executor.map() now takes a chunksize https://hg.python.org/cpython/rev/f87c2c4f03da
msg228473 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2014年10月04日 18:28
Sorry for the delay. I simplified the patch a tiny bit ("yield from" wasn't needed, it was sufficient to result the itertools.chain.from_iterable() result), and committed it. Thank you!
msg228514 - (view)	Author: Dan O'Reilly (dan.oreilly) *	Date: 2014年10月05日 00:28
Hey, my first committed patch :) Thanks for helping to push this through, Antoine!

History
Date	User	Action	Args
2022年04月11日 14:57:13	admin	set	github: 55480
2014年10月05日 00:28:50	dan.oreilly	set	messages: + msg228514
2014年10月04日 18:28:06	pitrou	set	status: open -> closed resolution: fixed messages: + msg228473 stage: patch review -> resolved
2014年10月04日 18:21:31	python-dev	set	nosy: + python-dev messages: + msg228472
2014年08月12日 00:19:17	dan.oreilly	set	files: + map_chunksize_docs_update.patch messages: + msg225214
2014年08月10日 15:46:56	pitrou	set	messages: + msg225143
2014年08月10日 15:42:33	pitrou	set	nosy: + neologix stage: patch review versions: + Python 3.5, - Python 3.3
2014年08月07日 23:25:13	dan.oreilly	set	files: + map_chunksize_with_docs.patch messages: + msg225045
2014年08月02日 01:14:09	josh.r	set	nosy: + josh.r
2014年07月31日 23:04:01	dan.oreilly	set	files: + test_mult.py
2014年07月31日 23:02:17	dan.oreilly	set	files: - test_mult.py
2014年07月31日 23:01:51	dan.oreilly	set	files: + map_chunksize_with_test.patch messages: + msg224451
2014年07月31日 02:00:37	dan.oreilly	set	messages: + msg224372
2014年07月31日 01:42:56	pitrou	set	messages: + msg224369
2014年07月25日 16:44:34	pitrou	set	status: closed -> open nosy: + sbt messages: + msg223966 resolution: out of date -> (no value)
2014年07月24日 16:24:38	dan.oreilly	set	messages: + msg223855
2014年07月24日 16:22:17	dan.oreilly	set	files: + map_chunksize.patch
2014年07月24日 16:21:41	dan.oreilly	set	files: - map_chunksize.patch
2014年07月24日 16:20:26	dan.oreilly	set	files: + test_mult.py
2014年07月24日 16:19:25	dan.oreilly	set	files: - test_mult.py
2014年07月24日 16:18:29	dan.oreilly	set	files: + map_chunksize.patch
2014年07月24日 16:17:46	dan.oreilly	set	files: - map_chunksize.patch
2014年07月24日 02:01:00	dan.oreilly	set	files: + map_chunksize.patch keywords: + patch messages: + msg223803
2014年07月24日 01:59:07	dan.oreilly	set	files: + test_mult.py nosy: + dan.oreilly messages: + msg223802
2012年10月21日 09:19:50	bquinlan	set	status: open -> closed
2012年03月07日 20:09:12	bquinlan	set	resolution: out of date messages: + msg155114
2011年06月05日 13:14:44	pitrou	set	nosy: + pitrou title: concurrent.futures.ProcessPoolExecutor.map() slower than multiprocessing.Pool.map() for fast function argument -> concurrent.futures.ProcessPoolExecutor.map() doesn't batch function arguments by chunks messages: + msg137692 versions: + Python 3.3, - Python 3.2
2011年05月31日 13:25:46	tbrink	set	messages: + msg137359
2011年05月31日 09:42:24	bquinlan	set	messages: + msg137351
2011年05月30日 23:55:10	ned.deily	set	nosy: + bquinlan
2011年02月21日 16:41:22	tbrink	set	files: + new_processpoolexecutor.py messages: + msg128970
2011年02月21日 15:04:48	tbrink	create

homepage