Issue 23992: multiprocessing: MapResult shouldn't fail fast upon exception

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68180

classification

Title:	multiprocessing: MapResult shouldn't fail fast upon exception
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.4, Python 3.5, Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	davin, neologix, pitrou, python-dev, sbt, vstinner
Priority:	normal	Keywords:	needs review, patch

Created on 2015年04月18日 09:00 by neologix, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
mp_map_fail_fast_27.diff	neologix, 2015年04月22日 18:50	review
mp_map_fail_fast_default.diff	neologix, 2015年04月22日 18:50	review

Messages (8)
msg241404 - (view)	Author: Charles-François Natali (neologix) * (Python committer)	Date: 2015年04月18日 09:00
hanger.py """ from time import sleep def hang(i): sleep(i) raise ValueError("x" * 1024**2) """ The following code will deadlock on pool.close(): """ from multiprocessing import Pool from time import sleep from hanger import hang with Pool() as pool: try: pool.map(hang, [0,1]) finally: sleep(0.5) pool.close() pool.join() """ The problem is that when one of the tasks comprising a map result fails with an exception, the corresponding MapResult is removed from the result cache: def _set(self, i, success_result): success, result = success_result if success: [snip] else: self._success = False self._value = result if self._error_callback: self._error_callback(self._value) <=== del self._cache[self._job] self._event.set() ===> Which means that when the pool is closed, the result handler thread terminates right away, because it doesn't see any task left to wait for. Which means that it doesn't drain the result queue, and if some worker process is trying to write a large result to it (hence the large valuerrror to fill the socket/pipe buffer), it will hang, and the pool won't shut down (unless you call terminate()). Although I can see the advantage of fail-fast behavior, I don't think it's correct because it breaks the invariant where results won't be deleted from the cache until they're actually done. Also, the current fail-fast behavior breaks the semantics that the call only returns when it has completed. Returning while some jobs part of the map are still running is potentially very bad, e.g. if the user call retries the same call, assuming that all the jobs are done. Retrying jobs that are idempotent but not parallel execution-safe would break with the current code. The fix is trivial, use the same logic as in case of success to only signal failure when all jobs are done. I'll provide a patch if it seems sensible :-)
msg241459 - (view)	Author: Davin Potts (davin) * (Python committer)	Date: 2015年04月18日 20:58
This is a nice example demonstrating what I agree is a problem with the current implementation of close. A practical concern with what I believe is being proposed in your trivial fix: if the workers are engaged in very long-running tasks (and perhaps slowly writing their overly large results to the results queue) then we would have to wait for quite a long time for these other workers to reach their natural completion. That said, I believe close should in fact behave just that way and have us subsequently wait for the others to be completed. It is not close's job to attempt to address the general concern I bring up. This change could be felt by people who have written their code to expect the result handler's immediate shutdown if there are no other visible results -- it is difficult to imagine what the impact would be. This is my long-winded way of saying it seems very sensible and welcome to me if you took the time to prepare a patch.
msg241823 - (view)	Author: Charles-François Natali (neologix) * (Python committer)	Date: 2015年04月22日 18:40
Patches for 2.7 and default.
msg246045 - (view)	Author: Charles-François Natali (neologix) * (Python committer)	Date: 2015年07月01日 19:23
Barring any objections, I'll commit within the next few days.
msg250138 - (view)	Author: Davin Potts (davin) * (Python committer)	Date: 2015年09月07日 23:12
@neologix: Budgeting time this week to have a proper look -- sorry I haven't gotten back to it sooner.
msg250523 - (view)	Author: Davin Potts (davin) * (Python committer)	Date: 2015年09月12日 16:05
The patches make good sense to me -- I have no comments to add in a review. I spent more time than I care to admit concerned with the idea that error_callback (exposed by map_async which map sits on top of) should perhaps be called not just once at the end but each time an exception occurs. Motivated by past jobs which failed overall to yield any results because one out of a million of the inputs triggered an error, I thought the idea very appealing and experimented with implementing it (with happy results). Googling for it though, I found plenty of examples of people asking questions about how callback and error_callback are intended to work -- though the documentation is not explicit on this particular point, most of those search results correctly document in the wild that error_callback is called only once at the end just like callback. I think it best to leave that functionality just as you have it now. Thanks for creating the patch -- looks great to me.
msg250537 - (view)	Author: Davin Potts (davin) * (Python committer)	Date: 2015年09月12日 21:56
As an aside: issue24948 seems to show there are others who would find the immediate-multiple-error_callback idea attractive.
msg260055 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2016年02月10日 22:58
New changeset 1ba0deb52223 by Charles-François Natali in branch 'default': Issue #23992: multiprocessing: make MapResult not fail-fast upon exception. https://hg.python.org/cpython/rev/1ba0deb52223

History
Date	User	Action	Args
2022年04月11日 14:58:15	admin	set	github: 68180
2016年02月12日 22:56:42	neologix	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2016年02月10日 22:58:42	python-dev	set	nosy: + python-dev messages: + msg260055
2015年09月12日 21:56:16	davin	set	messages: + msg250537
2015年09月12日 16:05:48	davin	set	stage: needs patch -> patch review
2015年09月12日 16:05:40	davin	set	messages: + msg250523
2015年09月07日 23:12:55	davin	set	messages: + msg250138
2015年07月01日 19:23:49	neologix	set	messages: + msg246045
2015年06月13日 15:11:24	neologix	set	keywords: + needs review nosy: + vstinner
2015年04月22日 18:50:42	neologix	set	files: + mp_map_fail_fast_27.diff, mp_map_fail_fast_default.diff
2015年04月22日 18:43:12	neologix	set	files: - mp_map_fail_fast_default.diff
2015年04月22日 18:43:03	neologix	set	files: - mp_map_fail_fast_27.diff
2015年04月22日 18:40:34	neologix	set	files: + mp_map_fail_fast_27.diff, mp_map_fail_fast_default.diff keywords: + patch messages: + msg241823
2015年04月18日 20:58:00	davin	set	messages: + msg241459 stage: needs patch
2015年04月18日 19:11:08	ned.deily	set	nosy: + sbt, davin
2015年04月18日 09:00:21	neologix	create

homepage