homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Multiprocess Process does not always exit when run from a thread.
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.1, Python 3.2
process
Status: closed Resolution:
Dependencies: Superseder: Locks in the standard library should be sanitized on fork
View: 6721
Assigned To: jnoller Nosy List: asksol, avian, jnoller, neologix, pajs@fodder.org.uk, pitrou
Priority: normal Keywords:

Created on 2009年10月14日 12:14 by pajs@fodder.org.uk, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
testing.py pajs@fodder.org.uk, 2009年10月14日 12:14
dtrace.txt pajs@fodder.org.uk, 2009年10月14日 14:17
testing.py pajs@fodder.org.uk, 2009年10月14日 15:47
results1.txt pajs@fodder.org.uk, 2009年10月15日 10:00
testing.py avian, 2011年06月25日 14:17 Cleaned up test case
Messages (14)
msg93978 - (view) Author: Peter Saunders (pajs@fodder.org.uk) Date: 2009年10月14日 12:14
I have an example code that works fine on Python 2.6.3, but when run in
Python 3.1.1 - after a very short period of time, will go wrong.
Summary:
We have a queue, and when the queue has something in it (a list), we
start a thread to deal with that entry.
The thread then runs a Process for every entry in the list we popped off
the queue. The Process target takes a Pipe, and it sends "True" back
down the pipe, and then exits. However, this Process never goes defunct,
and thus the Thread that started the Process never reaps it. When doing
a truss on the Process, it sits there in:
lwp_park(0x00000000, 0) (sleeping...)
msg93981 - (view) Author: Peter Saunders (pajs@fodder.org.uk) Date: 2009年10月14日 12:51
Further information: it doesn't fail everytime in Python 3.1 - usually 1
in 4, or 1 in 5 times. It never fails with Python 2.6.3
Example output from the script when its failing (python 3.1):
Starting data1
Starting data2
Started subproc: PID: 20209 : args: data1
poll
Started subproc: PID: 20210 : args: data2
poll
Child has sent:True
poll
Child has sent:True
poll
isalive: False
isalive: True
Reaping PID: 20210
isalive: True
poll
Finishing: data2
isalive: True
poll
isalive: True
poll
isalive: True
poll
isalive: True
poll
<repeat isalive and poll lines until>:
isalive: True
poll
isalive: True
terminating
Child had to be killed due to a timeout
Finishing: data1
True
Working output (consistantly works on Python 2.6.3):
Starting data1
Starting data2
Started subproc: PID: 20252 : args: data1
poll
Started subproc: PID: 20253 : args: data2
poll
Child has sent:True
Child has sent:True
poll
Child has sent:True
poll
isalive: False
isalive: False
isalive: False
Reaping PID: 20252
Reaping PID: 20253
Finishing: data1
Finishing: data2
True
msg93987 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009年10月14日 14:03
Acccording to http://blogs.sun.com/chrisg/entry/lwp_park_and_lwp_unpark,
the lwp_park() call could point to a mutex which is waiting to be acquired.
msg93988 - (view) Author: Peter Saunders (pajs@fodder.org.uk) Date: 2009年10月14日 14:17
Well, if it helps, here is the output of the dtrace script from starting
of a loop with the failure, and stopping during the failure.
msg93990 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009年10月14日 14:19
Does it not happen if you call your checkAlive() function directly from
your main() function?
msg93995 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009年10月14日 15:12
As for the dtrace output: I'm not a Solaris expert unfortunately, I was
just trying to suggest a possible direction for diagnosing this problem.
msg93996 - (view) Author: Peter Saunders (pajs@fodder.org.uk) Date: 2009年10月14日 15:18
If you mean, in main() instead of doing:
while True:
 q.put(["data1", "data2"])
 t = Process(target=popJobs, args=(q, ))
 t.start()
 t.join()
and doing:
while True:
 q.put(["data1", "data2"])
 popJobs(q)
instead. Then, the bug does indeed occur the same way. It did take more
iterations before it occured however. 
If however, you meant:
while True:
 fail=failureObject()
 for x in ["data1", "data2"]:
 checkAlive(fail, x)
 print(fail.status())
Then the bug never occurs.
msg93997 - (view) Author: Peter Saunders (pajs@fodder.org.uk) Date: 2009年10月14日 15:21
Same thing occurs, as you would expect, when I do:
while True:
 fail=failureObject()
 tlist = []
 for x in ["data1", "data2"]:
 t = threading.Thread(target=checkAlive, args = (fail, x), name=x)
 t.start()
 tlist.append(t)
 for x in tlist:
 t.join()
msg93998 - (view) Author: Peter Saunders (pajs@fodder.org.uk) Date: 2009年10月14日 15:47
Further oddness:
When running the script (i've reduced its size further now, see attached):
I get the following output:
Reaping PID: 23215
True
Started subproc: PID: 23216 : args: data1
Started subproc: PID: 23216 : args: data1
Started subproc: PID: 23217 : args: data2
poll
isalive: True
isalive: True
poll
Notice the Started subprod of the SAME PID and the Same args twice, yet
this print only occurs once in the code, and I can't see how this should
happen?
msg94000 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009年10月14日 15:54
> Notice the Started subprod of the SAME PID and the Same args twice, yet
> this print only occurs once in the code, and I can't see how this should
> happen?
This is a thread-safety issue in sys.stdout/stderr, it will be fixed in
3.1.2 (see issue6750).
msg94002 - (view) Author: Peter Saunders (pajs@fodder.org.uk) Date: 2009年10月14日 16:04
Sorry for the spam on the updates :) - but, its the same thread printing
this out too. I changed the print line to:
curThread = threading.current_thread()
print("Started subproc: PID: %d : args: %s Thread ID: %s" %(newJob.pid,
str(args), str(curThread.ident)))
and got the output:
Reaping PID: 23707
True
Started subproc: PID: 23709 : args: data1 Thread ID: 12
Started subproc: PID: 23709 : args: data1 Thread ID: 12
poll
isalive: True
Started subproc: PID: 23710 : args: data2 Thread ID: 13
poll
isalive: True
Again, I can't see how this could pring it out twice? Could this be
related to the problem we are having? Altgough I do seem to get the line
twice on 2.6.3 as well - but nothing like as often..
msg94081 - (view) Author: Peter Saunders (pajs@fodder.org.uk) Date: 2009年10月15日 10:00
Tested on Linux, and repeated the issue - just to confirm this isn't a
Solaris specific issue.
Tested using dummy_threading - bug does not appear when that is used.
Added _verbose=True to threads, and log_to_stderr(SUBDEBUG) set for Process.
Attached is the output of the script with those set. 
Other odd things to notice, and some point, more than 2 children
sometimes exist.. 
isalive PID: 7646 : True
isalive PID: 7652 : True
isalive PID: 7653 : True
isalive PID: 7646 : True
isalive PID: 7652 : True
isalive PID: 7653 : True
isalive PID: 7646 : True
isalive PID: 7653 : True
The code starts 2 threads, which start 2 processes. Both threads are
joined before new ones are started.. So, in theroy, only 2 should be
able to exist at once.. ?
msg139068 - (view) Author: Tomaž Šolc (avian) Date: 2011年06月25日 14:17
I am pretty sure this is another instance of issue 6721.
multiprocessing is forking a new process and if that happens while the other thread is holding the lock for stdout, the process will deadlock because the lock state is replicated in the subprocess while the thread that is supposed unlock it isn't (even if you remove all prints, the deadlock will happen on a stdout flush).
Attaching a gdb to a hung subprocess confirms that it is waiting for a lock in the bufferedio module.
I've reproduced this on Python trunk running on Linux (using pthreads) and can confirm that the patch attached to issue 6721 fixes it.
I'm also attaching a shorter test case that omits unused code from the original example.
msg139679 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011年07月03日 09:41
> I am pretty sure this is another instance of issue 6721.
Indeed.
Closing as duplicate.
History
Date User Action Args
2022年04月11日 14:56:53adminsetgithub: 51372
2011年07月03日 09:41:58neologixsetstatus: open -> closed
superseder: Locks in the standard library should be sanitized on fork
messages: + msg139679
2011年06月30日 18:37:01pitrousetnosy: + neologix
2011年06月25日 14:17:17aviansetfiles: + testing.py
nosy: + avian
messages: + msg139068

2010年08月27日 14:11:51asksolsetnosy: + asksol
2009年10月15日 10:00:18pajs@fodder.org.uksetfiles: + results1.txt

messages: + msg94081
2009年10月14日 16:04:27pajs@fodder.org.uksetmessages: + msg94002
2009年10月14日 15:54:44pitrousetmessages: + msg94000
2009年10月14日 15:47:23pajs@fodder.org.uksetfiles: + testing.py

messages: + msg93998
2009年10月14日 15:21:23pajs@fodder.org.uksetmessages: + msg93997
2009年10月14日 15:18:26pajs@fodder.org.uksetmessages: + msg93996
2009年10月14日 15:12:41pitrousetmessages: + msg93995
2009年10月14日 14:19:56pitrousetmessages: + msg93990
2009年10月14日 14:17:17pajs@fodder.org.uksetfiles: + dtrace.txt

messages: + msg93988
2009年10月14日 14:03:42pitrousetnosy: + pitrou
messages: + msg93987
2009年10月14日 13:50:10pitrousetnosy: + jnoller
versions: + Python 3.2
priority: normal
assignee: jnoller
components: + Library (Lib)
type: behavior
2009年10月14日 12:51:19pajs@fodder.org.uksetmessages: + msg93981
2009年10月14日 12:14:51pajs@fodder.org.ukcreate

AltStyle によって変換されたページ (->オリジナル) /