This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014年09月11日 22:33 by dan.oreilly, last changed 2022年04月11日 14:58 by admin.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| multiproc_broken_pool.diff | dan.oreilly, 2014年09月11日 22:33 | Abort running task and close down a pool if a worker is unexpectedly terminated. | review | |
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 10441 | open | oesteban, 2018年11月09日 20:51 | |
| PR 16103 | open | davin, 2019年09月13日 13:38 | |
| Messages (11) | |||
|---|---|---|---|
| msg226805 - (view) | Author: Dan O'Reilly (dan.oreilly) * | Date: 2014年09月11日 22:33 | |
This is essentially a dupe of issue9205, but it was suggested I open a new issue, since that one ended up being used to fix this same problem in concurrent.futures, and was subsequently closed. Right now, should a worker process in a Pool unexpectedly get terminated while a blocking Pool method is running (e.g. apply, map), the method will hang forever. This isn't a normal occurrence, but it does occasionally happen (either because someone sends a SIGTERM, or because of a bug in the interpreter or a C-extension). It would be preferable for multiprocessing to follow the lead of concurrent.futures.ProcessPoolExecutor when this happens, and abort all running tasks and close down the Pool. Attached is a patch that implements this behavior. Should a process in a Pool unexpectedly exit (meaning, *not* because of hitting the maxtasksperchild limit), the Pool will be closed/terminated and all cached/running tasks will raise a BrokenProcessPool exception. These changes also prevent the Pool from going into a bad state if the "initializer" function raises an exception (previously, the pool would end up infinitely starting new processes, which would immediately die because of the exception). One concern with the patch: The way timings are altered with these changes, the Pool seems to be particularly susceptible to issue6721 in certain cases. If processes in the Pool are being restarted due to maxtasksperchild just as the worker is being closed or joined, there is a chance the worker will be forked while some of the debug logging inside of Pool is running (and holding locks on either sys.stdout or sys.stderr). When this happens, the worker deadlocks on startup, which will hang the whole program. I believe the current implementation is susceptible to this as well, but I could reproduce it much more consistently with this patch. I think its rare enough in practice that it shouldn't prevent the patch from being accepted, but thought I should point it out. (I do think issue6721 should be addressed, or at the very least internal I/O locks should always reset after forking.) |
|||
| msg294968 - (view) | Author: Francis Bolduc (Francis Bolduc) | Date: 2017年06月01日 20:53 | |
This problem also happens simply by calling sys.exit from one of the child processes. The following script exhibits the problem: import multiprocessing import sys def test(value): if value: sys.exit(123) if __name__ == '__main__': pool = multiprocessing.Pool(4) cases = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0] pool.map(test, cases) |
|||
| msg315684 - (view) | Author: Oscar Esteban (oesteban) * | Date: 2018年04月24日 02:18 | |
We use multiprocessing to parallelize many tasks that run either python code or call subprocess.run that are memory hungry. At times the OOM Killer kicks in. When one of the workers is killed, the queue never returns an error for the task being run by the worker. Are there any plans to merge the patch proposed in this issue? |
|||
| msg315687 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2018年04月24日 05:47 | |
Oscar, the patch posted here needs updating for the latest git master. If you want to avoid this issue, you can also use concurrent.futures where the issue is fixed. |
|||
| msg329381 - (view) | Author: Oscar Esteban (oesteban) * | Date: 2018年11月06日 20:02 | |
Hi Antoine, I may take a stab at it. Before I start, should I branch from master or from 3.7.1 (as 3.7 is still accepting bugfixes). Best, Oscar |
|||
| msg329383 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2018年11月06日 20:28 | |
You should start from master. Bugfixes can backported afterwards if appropriate. Thanks! |
|||
| msg329770 - (view) | Author: Oscar Esteban (oesteban) * | Date: 2018年11月12日 22:57 | |
I tried to reuse as much as I could from the patch, but it didn't solve the issue at first. I have changed the responsibility of identifying and prescribing a solution when a worker got killed. In the proposed patch, the thread handling results (i.e. tasks queued by one worker as done) was responsible. In the PR, the responsibility is reassigned to the thread handling workers (since, basically, one or more workers suddenly die). The patch defined a new BROKEN state that was assigned to the results handler thread. I transferred this behavior to the worker handler thread. But, I'm guessing that the BROKEN state should be assigned to the Pool object instead, to be fully semantic. Although that would require passing the reference to the object around and complicate unnecessarily the implementation. Happy to reconsider though. I added three tests, one that was present with the patch, a variation of it adding some wait before killing the worker, and the one that Francis Bolduc posted here (https://bugs.python.org/issue22393#msg294968). Please let me know whether any conversation about this bug should take place in GitHub, with the PR instead of here. Thanks a lot for the guidance, Antoine. |
|||
| msg333895 - (view) | Author: Chris Markiewicz (cjmarkie) * | Date: 2019年01月17日 19:39 | |
Just a bump to note that the PR (10441) is ready for another round of review. |
|||
| msg351754 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2019年09月10日 22:50 | |
I just marked bpo-38084 as duplicate of this issue. I manually merged the nosy lists. |
|||
| msg390775 - (view) | Author: Marko (kormang) | Date: 2021年04月11日 11:57 | |
I've created issue43805. I think it would be better to have universal solution. And not specific ones, like in issue9205. Haven't checked the PRs, though. |
|||
| msg390780 - (view) | Author: Marko (kormang) | Date: 2021年04月11日 12:18 | |
Somewhat related issue43806 with asyncio.StreamReader |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:07 | admin | set | github: 66587 |
| 2021年10月18日 03:11:08 | myles.steinhauser | set | nosy:
+ myles.steinhauser |
| 2021年08月30日 08:39:06 | rkm | set | nosy:
+ rkm |
| 2021年08月18日 14:02:24 | shnizzedy | set | nosy:
+ shnizzedy |
| 2021年04月12日 09:57:23 | vstinner | set | nosy:
- vstinner |
| 2021年04月11日 12:18:48 | kormang | set | messages: + msg390780 |
| 2021年04月11日 11:57:23 | kormang | set | nosy:
+ kormang messages: + msg390775 |
| 2019年09月13日 13:38:06 | davin | set | pull_requests: + pull_request15722 |
| 2019年09月10日 22:50:04 | vstinner | set | nosy:
+ vstinner messages: + msg351754 |
| 2019年09月10日 22:49:35 | vstinner | link | issue38084 superseder |
| 2019年01月17日 19:39:29 | cjmarkie | set | nosy:
+ cjmarkie messages: + msg333895 |
| 2018年11月12日 22:57:52 | oesteban | set | messages: + msg329770 |
| 2018年11月09日 20:51:31 | oesteban | set | stage: needs patch -> patch review pull_requests: + pull_request9713 |
| 2018年11月06日 20:28:40 | pitrou | set | messages: + msg329383 |
| 2018年11月06日 20:02:10 | oesteban | set | messages: + msg329381 |
| 2018年04月24日 05:47:01 | pitrou | set | stage: needs patch messages: + msg315687 versions: + Python 3.8, - Python 3.5 |
| 2018年04月24日 02:18:15 | oesteban | set | nosy:
+ oesteban messages: + msg315684 |
| 2017年06月01日 20:53:39 | Francis Bolduc | set | nosy:
+ Francis Bolduc messages: + msg294968 |
| 2015年12月27日 17:11:24 | davin | link | issue25908 dependencies |
| 2015年10月11日 17:21:58 | davin | set | nosy:
+ davin |
| 2015年09月16日 17:01:10 | berker.peksag | link | issue24927 superseder |
| 2015年09月16日 12:22:35 | brianboonstra | set | nosy:
+ brianboonstra |
| 2014年09月12日 16:57:22 | cvrebert | set | nosy:
+ cvrebert |
| 2014年09月11日 22:33:06 | dan.oreilly | create | |