homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: deadlock in subprocess while running several threads using Popen
Type: behavior Stage: resolved
Components: Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Martijn.van.Oosterhout, Stephen.White, doughellmann, glaubich, neologix, pitrou, python-dev, rosslagerwall
Priority: normal Keywords: needs review, patch

Created on 2012年01月18日 18:00 by glaubich, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
subprocess_deadlock.py glaubich, 2012年01月18日 18:00
reinit_tls.diff neologix, 2012年01月19日 14:45
reinit_tls_test.diff neologix, 2012年01月28日 14:40
reinit_tls_27.diff neologix, 2012年02月01日 22:12
Messages (9)
msg151556 - (view) Author: Christoph Glaubitz (glaubich) Date: 2012年01月18日 18:00
Starting several threads, each just starting a subprocess.Popen and use communicate cause a deadlock in subprocess. (see attached file)
I can only reproduce this with python 2.7.2, not with any other versions. 2.6.5, 2.6.7 and 3.2.2 are working fine. Also 2.4.6 is working, with a problem fixed by http://bugs.python.org/issue1731717.
The attached script does not dead lock all the time, but at least every second call.
I am on a linux x86_64. The python's are compiled by hand, except of 2.6.5.
msg151621 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年01月19日 13:36
Here's the backtrace:
"""
#0 0x0000003bfb20c9b1 in sem_wait () from /lib64/libpthread.so.0
#1 0x000000000051a7c3 in PyThread_acquire_lock (lock=0x17db0750, waitflag=1)
 at Python/thread_pthread.h:321
#2 0x000000000051a9b4 in find_key (key=1, value=0x0) at Python/thread.c:268
#3 0x000000000051abdc in PyThread_get_key_value (key=1) at Python/thread.c:360
#4 0x00000000005025b1 in PyGILState_GetThisThreadState () at Python/pystate.c:598
#5 0x00000000005024f5 in _PyGILState_Reinit () at Python/pystate.c:547
#6 0x0000000000521fc7 in PyOS_AfterFork () at ./Modules/signalmodule.c:979
#7 0x000000000052461d in posix_fork (self=0x0, noargs=0x0) at ./Modules/posixmodule.c:3695
"""
It's stuck in _PyGILState_Reinit(), when calling PyGILState_GetThisThreadState().
That's because in 2.7, TLS is emulated (see Python/thread.c), and it uses a global mutex.
If this mutex is locked at the time of fork(), then the next call to TLS primitives (even PyGILState_GetThisThreadState()) will deadlock.
Now, this particular bug is fixed in 2.7 since #13156, which backed-out _PyGILState_Reinit() because it was only relevant for native TLS implementations.
The code is still present in 3.2 and and default, but this problem doesn't affect native TLS implementations.
Just to be extra safe, we PyThread_ReInitTLS() - which resets this global mutex on emulated implementations, and is just a no-op on pthread and windows - should be moved earlier in PyOS_AfterFork(), to avoid this type of deadlock (I mean, PyGILState_GetThisThreadState() deadlock after fork() is bad).
Patch attached.
msg151623 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年01月19日 14:16
I think you forgot to attach the patch :)
msg151630 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年01月19日 14:45
Here's the patch.
It's probably possible to add a test for this, however I don't have access to my development machine, so I can't write it now.
msg151683 - (view) Author: Martijn van Oosterhout (Martijn.van.Oosterhout) Date: 2012年01月20日 10:51
What a concidence, we had this problem as well and just today tracked it down, and turns out someone filed this bug yesterday.
I note the referenced bug (#13156) says that it's been fixed in the release branch but not yet released. I don't suppose there will be a release of 2.7 in the near future?
msg152162 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年01月28日 14:40
And here's the test.
To sum up:
- reinit_tls.diff makes sure to call PyThread_ReInitTLS() at the beginning of PyOS_AfterFork(), so that the TLS API is usable within PyOS_AfterFork() (e.g. in _after_fork()). It would be applied to 3.2 and default, even though they are not affected (at least on platform withs a native TLS implementation)
- reinit_tls_test.diff would be applied to 2.7, 3.2 and default: it should not trigger a failure on any of the active branches (2.7, 3.2 and default), it's just to prevent potential future regressions.
Notes regarding the test:
- I replaced subprocess with bare fork(), since it's way cheaper than spawning an entire subprocess
- I reduced the number of threads compared to Christoph's reproducer, because I feel uncomfortable forking up to 1024 processes in parallel (even though in practise they won't be that much, because child processes will terminate before the other threads call fork())
- even with 2.7.2, it won't deadlock 100%, since the race window is really short (but having a failure on one of the buildbots once would be enough to detect the problem, should it ever resurface)
msg152445 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年02月01日 22:12
I thought about this a bit more, and I realized that a slight variation
of this bug also affects 2.7, and also older versions (i.e. before
_PyGILState_Reinit() was introduced), because any code that gets called
from PyOS_AfterFork() and uses the TLS API suffers from the same
problem.
I had the new test deadlock once, in theading._after_fork(): when trying
to acquire an (unlocked) lock, it got stuck in Py_END_ALLOW_THREADS,
which calls indirectly PyGILState_GetThisThreadState(), which calls
find_key().
Although less likely, this bug has been present for a *long* time, it's
surprising it had never been noticed before.
So the fix - calling PyThread_ReInitTLS() right at the beginning of PyOS_AfterFork() - should also be applied to 2.7.
I'll commit it tomorrow.
msg152474 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012年02月02日 19:39
New changeset c3649173d093 by Charles-François Natali in branch '2.7':
Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix
http://hg.python.org/cpython/rev/c3649173d093
New changeset 7b24dd587a7b by Charles-François Natali in branch '3.2':
Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix
http://hg.python.org/cpython/rev/7b24dd587a7b
New changeset a0100852b6fe by Charles-François Natali in branch 'default':
Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix
http://hg.python.org/cpython/rev/a0100852b6fe 
msg152480 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年02月02日 21:56
Committed.
Christoph, thanks for the report.
History
Date User Action Args
2022年04月11日 14:57:25adminsetgithub: 58025
2012年02月15日 21:44:07doughellmannsetnosy: + doughellmann
2012年02月02日 21:56:43neologixsetstatus: open -> closed
resolution: fixed
messages: + msg152480

stage: commit review -> resolved
2012年02月02日 19:39:18python-devsetnosy: + python-dev
messages: + msg152474
2012年02月02日 16:15:40Stephen.Whitesetnosy: + Stephen.White
2012年02月01日 22:12:07neologixsetfiles: + reinit_tls_27.diff

messages: + msg152445
2012年01月31日 17:35:07neologixsetstage: patch review -> commit review
2012年01月29日 13:00:08rosslagerwallsetnosy: + rosslagerwall
2012年01月28日 14:40:29neologixsetfiles: + reinit_tls_test.diff

messages: + msg152162
versions: + Python 2.7
2012年01月20日 10:51:04Martijn.van.Oosterhoutsetnosy: + Martijn.van.Oosterhout
messages: + msg151683
2012年01月19日 14:45:39neologixsetfiles: + reinit_tls.diff

messages: + msg151630
2012年01月19日 14:16:31pitrousetnosy: + pitrou
messages: + msg151623
2012年01月19日 13:36:59neologixsetkeywords: + patch, needs review

stage: patch review
messages: + msg151621
versions: + Python 3.2, Python 3.3, - Python 2.7
2012年01月19日 12:54:46pitrousetnosy: + neologix
2012年01月18日 18:00:35glaubichcreate

AltStyle によって変換されたページ (->オリジナル) /