This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2019年09月10日 08:58 by steve.dower, last changed 2022年04月11日 14:59 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| mp_exit.py | vstinner, 2019年09月10日 15:30 | |||
| Messages (10) | |||
|---|---|---|---|
| msg351594 - (view) | Author: Steve Dower (steve.dower) * (Python committer) | Date: 2019年09月10日 08:58 | |
Imitation repro: import os from multiprocessing import Pool def f(x): os._exit(0) return "success" if __name__ == '__main__': with Pool(1) as p: print(p.map(f, [1])) Obviously a process may crash for various other reasons besides os._exit(). I believe this is the cause of issue37245. |
|||
| msg351690 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2019年09月10日 14:52 | |
> multiprocessing cannot recover from crashed worker This issue has been seen on the macOS job of the Azure Pipeline: bpo-37245. I don't know if other platforms are affected. |
|||
| msg351691 - (view) | Author: Steve Dower (steve.dower) * (Python committer) | Date: 2019年09月10日 14:54 | |
Windows is definitely affected, and you can run the repro in my first post to check other platforms. |
|||
| msg351702 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2019年09月10日 15:30 | |
I converted the example into attached file mp_exit.py and I added a call to faulthandler to see what is going on. Output with the master branch of Python: vstinner@apu$ ~/python/master/python ~/mp_exit.py Timeout (0:00:05)! Thread 0x00007ff40139a700 (most recent call first): File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 379 in _recv File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 414 in _recv_bytes File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 250 in recv File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 576 in _handle_results File "/home/vstinner/python/master/Lib/threading.py", line 882 in run File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap Thread 0x00007ff401b9b700 (most recent call first): File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 528 in _handle_tasks File "/home/vstinner/python/master/Lib/threading.py", line 882 in run File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap Thread 0x00007ff40239c700 (most recent call first): File "/home/vstinner/python/master/Lib/selectors.py", line 415 in select File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 930 in wait File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 499 in _wait_for_updates File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 519 in _handle_workers File "/home/vstinner/python/master/Lib/threading.py", line 882 in run File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap Thread 0x00007ff4102cf740 (most recent call first): File "/home/vstinner/python/master/Lib/threading.py", line 303 in wait File "/home/vstinner/python/master/Lib/threading.py", line 565 in wait File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 759 in wait File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 762 in get File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 364 in map File "/home/vstinner/mp_exit.py", line 12 in <module> In the main process, Pool._handle_results() thread is blocked on os.read() which never completes, even if the child process died and so the other end of the pipe should be closed. |
|||
| msg351703 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2019年09月10日 15:32 | |
> Windows is definitely affected, and you can run the repro in my first post to check other platforms. Oh right, I can also reproduce the issue on Linux. But I don't understand why test_multiprocessing_spawn works on all platforms, but only fails on macOS when run on Azure Pipelines. Aaaaah, multiprocesing mysteries... |
|||
| msg351705 - (view) | Author: Davin Potts (davin) * (Python committer) | Date: 2019年09月10日 15:38 | |
Sharing for the sake of documenting a few things going on in this particular example: * When a PoolWorker process exits in this way (os._exit(anything)), the PoolWorker never gets the chance to send a signal of failure (normally sent via the outqueue) to the MainProcess. * In the current logic of the MainProcess, Pool._maintain_pool() detects the termination of that PoolWorker process and starts a new PoolWorker process to replace it, maintaining the desired size of Pool. * The infinite hang observed in this example comes from the original p.map() call performing an unlimited-timeout wait for a result to appear on the outqueue, hence an infinite wait. This wait is performed in MapResult.get() which does expose a timeout parameter though it is not possible to control through Pool.map(). It is not at all a correct, general solution, but exposing the control on this timeout and setting it to 1.0 seconds permits Steve's repro code snippet to run to completion (no infinite hang, raises a multiprocessing.context.TimeoutError). |
|||
| msg351710 - (view) | Author: Davin Potts (davin) * (Python committer) | Date: 2019年09月10日 15:50 | |
Thanks to Pablo's good work with implementing the use of multiprocessing's Process.sentinel, the logic for handling PoolWorkers that die has been centralized into Pool._maintain_pool(). If _maintain_pool() can also identify which job died with the dead PoolWorker, then it should be possible to put a corresponding message on the outqueue to indicate an exception occurred but pool can otherwise continue its work. The question of whether Pool.map() should expose a timeout parameter deserves a separate discussion and should not be considered a path forward on this issue as it would require that users always specify and somehow know beforehand how long it should take for results to be returned from workers. Exposing the timeout control may have other practical benefits elsewhere but not here. |
|||
| msg351746 - (view) | Author: (ppperry) | Date: 2019年09月10日 22:24 | |
Is this not a duplicate of issue22393? |
|||
| msg351752 - (view) | Author: Davin Potts (davin) * (Python committer) | Date: 2019年09月10日 22:45 | |
Agreed with @ppperry that this is a duplicate of issue22393. The proposed patch in issue22393 is, for the moment, out of sync with more recent changes. That patch's approach would result in the loss of all partial results from a Pool.map, but it may be faster to update and review. |
|||
| msg351753 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2019年09月10日 22:49 | |
> Agreed with @ppperry that this is a duplicate of issue22393. Ok, in that case I close this issue as a duplicate of bpo-22393. There is no need to duplicate the discussion here :-) |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:59:20 | admin | set | github: 82265 |
| 2019年09月10日 22:49:35 | vstinner | set | status: open -> closed superseder: multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly messages: + msg351753 resolution: duplicate stage: resolved |
| 2019年09月10日 22:45:53 | davin | set | messages: + msg351752 |
| 2019年09月10日 22:24:25 | ppperry | set | nosy:
+ ppperry messages: + msg351746 |
| 2019年09月10日 15:50:24 | davin | set | messages: + msg351710 |
| 2019年09月10日 15:38:20 | davin | set | messages: + msg351705 |
| 2019年09月10日 15:32:12 | vstinner | set | messages: + msg351703 |
| 2019年09月10日 15:30:58 | vstinner | set | files:
+ mp_exit.py messages: + msg351702 |
| 2019年09月10日 14:54:30 | steve.dower | set | messages: + msg351691 |
| 2019年09月10日 14:52:19 | vstinner | set | messages: + msg351690 |
| 2019年09月10日 14:51:36 | vstinner | set | nosy:
+ vstinner, pablogsal |
| 2019年09月10日 08:58:29 | steve.dower | create | |