This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2012年01月18日 18:00 by glaubich, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| subprocess_deadlock.py | glaubich, 2012年01月18日 18:00 | |||
| reinit_tls.diff | neologix, 2012年01月19日 14:45 | |||
| reinit_tls_test.diff | neologix, 2012年01月28日 14:40 | |||
| reinit_tls_27.diff | neologix, 2012年02月01日 22:12 | |||
| Messages (9) | |||
|---|---|---|---|
| msg151556 - (view) | Author: Christoph Glaubitz (glaubich) | Date: 2012年01月18日 18:00 | |
Starting several threads, each just starting a subprocess.Popen and use communicate cause a deadlock in subprocess. (see attached file) I can only reproduce this with python 2.7.2, not with any other versions. 2.6.5, 2.6.7 and 3.2.2 are working fine. Also 2.4.6 is working, with a problem fixed by http://bugs.python.org/issue1731717. The attached script does not dead lock all the time, but at least every second call. I am on a linux x86_64. The python's are compiled by hand, except of 2.6.5. |
|||
| msg151621 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2012年01月19日 13:36 | |
Here's the backtrace: """ #0 0x0000003bfb20c9b1 in sem_wait () from /lib64/libpthread.so.0 #1 0x000000000051a7c3 in PyThread_acquire_lock (lock=0x17db0750, waitflag=1) at Python/thread_pthread.h:321 #2 0x000000000051a9b4 in find_key (key=1, value=0x0) at Python/thread.c:268 #3 0x000000000051abdc in PyThread_get_key_value (key=1) at Python/thread.c:360 #4 0x00000000005025b1 in PyGILState_GetThisThreadState () at Python/pystate.c:598 #5 0x00000000005024f5 in _PyGILState_Reinit () at Python/pystate.c:547 #6 0x0000000000521fc7 in PyOS_AfterFork () at ./Modules/signalmodule.c:979 #7 0x000000000052461d in posix_fork (self=0x0, noargs=0x0) at ./Modules/posixmodule.c:3695 """ It's stuck in _PyGILState_Reinit(), when calling PyGILState_GetThisThreadState(). That's because in 2.7, TLS is emulated (see Python/thread.c), and it uses a global mutex. If this mutex is locked at the time of fork(), then the next call to TLS primitives (even PyGILState_GetThisThreadState()) will deadlock. Now, this particular bug is fixed in 2.7 since #13156, which backed-out _PyGILState_Reinit() because it was only relevant for native TLS implementations. The code is still present in 3.2 and and default, but this problem doesn't affect native TLS implementations. Just to be extra safe, we PyThread_ReInitTLS() - which resets this global mutex on emulated implementations, and is just a no-op on pthread and windows - should be moved earlier in PyOS_AfterFork(), to avoid this type of deadlock (I mean, PyGILState_GetThisThreadState() deadlock after fork() is bad). Patch attached. |
|||
| msg151623 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年01月19日 14:16 | |
I think you forgot to attach the patch :) |
|||
| msg151630 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2012年01月19日 14:45 | |
Here's the patch. It's probably possible to add a test for this, however I don't have access to my development machine, so I can't write it now. |
|||
| msg151683 - (view) | Author: Martijn van Oosterhout (Martijn.van.Oosterhout) | Date: 2012年01月20日 10:51 | |
What a concidence, we had this problem as well and just today tracked it down, and turns out someone filed this bug yesterday. I note the referenced bug (#13156) says that it's been fixed in the release branch but not yet released. I don't suppose there will be a release of 2.7 in the near future? |
|||
| msg152162 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2012年01月28日 14:40 | |
And here's the test. To sum up: - reinit_tls.diff makes sure to call PyThread_ReInitTLS() at the beginning of PyOS_AfterFork(), so that the TLS API is usable within PyOS_AfterFork() (e.g. in _after_fork()). It would be applied to 3.2 and default, even though they are not affected (at least on platform withs a native TLS implementation) - reinit_tls_test.diff would be applied to 2.7, 3.2 and default: it should not trigger a failure on any of the active branches (2.7, 3.2 and default), it's just to prevent potential future regressions. Notes regarding the test: - I replaced subprocess with bare fork(), since it's way cheaper than spawning an entire subprocess - I reduced the number of threads compared to Christoph's reproducer, because I feel uncomfortable forking up to 1024 processes in parallel (even though in practise they won't be that much, because child processes will terminate before the other threads call fork()) - even with 2.7.2, it won't deadlock 100%, since the race window is really short (but having a failure on one of the buildbots once would be enough to detect the problem, should it ever resurface) |
|||
| msg152445 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2012年02月01日 22:12 | |
I thought about this a bit more, and I realized that a slight variation of this bug also affects 2.7, and also older versions (i.e. before _PyGILState_Reinit() was introduced), because any code that gets called from PyOS_AfterFork() and uses the TLS API suffers from the same problem. I had the new test deadlock once, in theading._after_fork(): when trying to acquire an (unlocked) lock, it got stuck in Py_END_ALLOW_THREADS, which calls indirectly PyGILState_GetThisThreadState(), which calls find_key(). Although less likely, this bug has been present for a *long* time, it's surprising it had never been noticed before. So the fix - calling PyThread_ReInitTLS() right at the beginning of PyOS_AfterFork() - should also be applied to 2.7. I'll commit it tomorrow. |
|||
| msg152474 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年02月02日 19:39 | |
New changeset c3649173d093 by Charles-François Natali in branch '2.7': Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix http://hg.python.org/cpython/rev/c3649173d093 New changeset 7b24dd587a7b by Charles-François Natali in branch '3.2': Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix http://hg.python.org/cpython/rev/7b24dd587a7b New changeset a0100852b6fe by Charles-François Natali in branch 'default': Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix http://hg.python.org/cpython/rev/a0100852b6fe |
|||
| msg152480 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2012年02月02日 21:56 | |
Committed. Christoph, thanks for the report. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:25 | admin | set | github: 58025 |
| 2012年02月15日 21:44:07 | doughellmann | set | nosy:
+ doughellmann |
| 2012年02月02日 21:56:43 | neologix | set | status: open -> closed resolution: fixed messages: + msg152480 stage: commit review -> resolved |
| 2012年02月02日 19:39:18 | python-dev | set | nosy:
+ python-dev messages: + msg152474 |
| 2012年02月02日 16:15:40 | Stephen.White | set | nosy:
+ Stephen.White |
| 2012年02月01日 22:12:07 | neologix | set | files:
+ reinit_tls_27.diff messages: + msg152445 |
| 2012年01月31日 17:35:07 | neologix | set | stage: patch review -> commit review |
| 2012年01月29日 13:00:08 | rosslagerwall | set | nosy:
+ rosslagerwall |
| 2012年01月28日 14:40:29 | neologix | set | files:
+ reinit_tls_test.diff messages: + msg152162 versions: + Python 2.7 |
| 2012年01月20日 10:51:04 | Martijn.van.Oosterhout | set | nosy:
+ Martijn.van.Oosterhout messages: + msg151683 |
| 2012年01月19日 14:45:39 | neologix | set | files:
+ reinit_tls.diff messages: + msg151630 |
| 2012年01月19日 14:16:31 | pitrou | set | nosy:
+ pitrou messages: + msg151623 |
| 2012年01月19日 13:36:59 | neologix | set | keywords:
+ patch, needs review stage: patch review messages: + msg151621 versions: + Python 3.2, Python 3.3, - Python 2.7 |
| 2012年01月19日 12:54:46 | pitrou | set | nosy:
+ neologix |
| 2012年01月18日 18:00:35 | glaubich | create | |