homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: test_asyncio hanging for 1 hour (AIX version, hangs in test_subprocess_interactive)
Type: behavior Stage: resolved
Components: Tests Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: David.Edelsohn, db3l, flox, gvanrossum, larry, ncoghlan, neologix, pitrou, python-dev, sbt, skrah, tim.peters, vstinner
Priority: normal Keywords: patch

Created on 2013年10月19日 04:20 by David.Edelsohn, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue19293.patch David.Edelsohn, 2013年10月19日 15:40 Skip some signal tests in test_events.py review
asyncio_signal.diff neologix, 2013年10月19日 19:53 review
sockpair.diff gvanrossum, 2013年10月21日 17:08
issue19293.patch2 David.Edelsohn, 2013年10月22日 02:47 review
issue19293.patch3 David.Edelsohn, 2013年10月22日 03:55 review
aixfix2.diff gvanrossum, 2013年10月22日 04:27 review
Messages (72)
msg200364 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月19日 04:20
test_asyncio times out after one hour on AIX and leaves a process consuming 100% of a thread.
[145/382/3] test_asyncio
Timeout (1:00:00)!
Thread 0x00000001:
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/selectors.py", line 265 in select
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/asyncio/base_events.py", line 576 in _run_once
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/asyncio/base_events.py", line 153 in run_forever
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/asyncio/base_events.py", line 172 in run_until_complete
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/test_asyncio/test_events.py", line 1012 in test_subprocess_interactive
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/case.py", line 571 in run
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/case.py", line 610 in __call__
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 117 in run
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 117 in run
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 117 in run
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/runner.py", line 168 in run
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/support/__init__.py", line 1661 in _run_suite
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/support/__init__.py", line 1695 in run_unittest
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/test_asyncio/__init__.py", line 26 in test_main
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 1276 in runtest_inner
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 965 in runtest
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 761 in main
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/__main__.py", line 3 in <module>
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/runpy.py", line 73 in _run_code
 File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/runpy.py", line 160 in _run_module_as_main
make: 1254-004 The error code from the last command is 1.
msg200372 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月19日 05:07
Is that traceback upside down compared to normal Python tracebacks?
msg200374 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月19日 05:12
Assuming it is, it is hanging in the poll() syscall, here:
http://hg.python.org/cpython/file/f3a2dab1623b/Lib/selectors.py#265
Could you see if it works better with select()? Comment out lines 402, 403 in that file.
msg200426 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月19日 15:40
Commenting out poll in selectors.py does not help.
AIX has its own epoll/kqueue efficient I/O event polling feature "pollset", which I will open as a separate enhancement request issue.
Some of the subprocess tests in test_asyncio/test_events.py related to signals are hanging on AIX: test_subprocess_interactive, test_subprocess_kill and test_subprocess_send_signal all hang. This probably is related to Issue #18238 and the tests skipped in test_signal.py.
The attached patch skips those tests on AIX.
msg200433 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013年10月19日 16:10
New changeset 41a841fac7fb by Guido van Rossum in branch 'default':
Disable some subprocess tests that hang on AIX.
http://hg.python.org/cpython/rev/41a841fac7fb 
msg200435 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月19日 16:18
Thanks for the investigation and the patch.
As to the deeper issue, I'm not sure I understand what's going on -- is this a bug in Python or in AIX? Is there a better work-around than disabling the test? The issue 18238 that you mention seems to be closed by disabling tests too, but that seems unsatisfactory.
Is the real issue that select() and poll() on AIX cannot be interrupted by signals? That sounds really sad.
msg200437 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013年10月19日 16:25
Does AIX have an equivalent to Linux's strace?
If yes, having a sample output would be helpful.
Otherwise, maybe just adding manual debug traces to PollSelector.select() would help to see which events are reported.
msg200438 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013年10月19日 16:26
I think the failure is not only on AIX. The FreeBSD bots had similar
failures, and here is one on PPC Linux:
http://buildbot.python.org/all/builders/PPC64%20PowerLinux%203.x/builds/881/steps/test/logs/stdio
Timeout (1:00:00)!
Thread 0x0000008074e46670:
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/selectors.py", line 313 in select
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/asyncio/base_events.py", line 576 in _run_once
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/asyncio/base_events.py", line 153 in run_forever
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/asyncio/base_events.py", line 172 in run_until_complete
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/test_asyncio/test_events.py", line 1078 in test_subprocess_close_after_finish
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/case.py", line 571 in run
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/case.py", line 610 in __call__
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 117 in run
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 117 in run
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 117 in run
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/runner.py", line 168 in run
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/support/__init__.py", line 1661 in _run_suite
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/support/__init__.py", line 1695 in run_unittest
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/test_asyncio/__init__.py", line 26 in test_main
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 1276 in runtest_inner
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 965 in runtest
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 761 in main
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/__main__.py", line 3 in <module>
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/runpy.py", line 73 in _run_code
 File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/runpy.py", line 160 in _run_module_as_main
make: *** [buildbottest] Error 1
msg200440 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月19日 16:40
[Off-topic: Heck, another one of those upside-down tracebacks. WHY???]
I'll need a lot of help from folks on the affected platforms in debugging this, since these tests work fine when I run them on the platforms to which I have access (Ubuntu and OS X).
I would also like to point out http://code.google.com/p/tulip/issues/detail?id=67 (and the thread referenced in my last message there) as well as http://code.google.com/p/tulip/issues/detail?id=68. Perhaps there are clues there.
I hope that we'll be able to sort this out before the beta release.
msg200449 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月19日 17:36
My understanding of the issue is POSIX and The Open Group specification are ambiguous about the handling of Unix socket API calls when interrupted by a signal. Linux implements a more liberal, accommodating behavior and the Python Async I/O implementation is assuming those semantics. AIX inherited a lot from OSF, Mac OS X/Darwin inherited a lot from FreeBSD, both of which adopted more "unforgiving behavior".
The following web page seems to summarize the issue:
http://www.madore.org/~david/computers/connect-intr.html
I don't know if the description still is accurate for BSDs, but AIX likely implemented more pedantic, unforgiving semantics.
The standard seems to allow the behavior seen in AIX and FreeBSD, regardless of whether it is less useful / pleasant. From a pedantic standpoint, I think this is a bug in the Python implementation of Async I/O because it assumes behavior (implemented by Linux) not required by an implementation that conforms to the POSIX and Open Group specifications.
msg200452 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月19日 17:47
Also, the Python buildbot for AIX is running on a POWER AIX 7.1 system at OSU OSL that is a member of the "GCC Compile Farm" cluster on which FOSS developers can obtain free accounts. If you want access to an AIX system for testing, it's available.
msg200456 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013年10月19日 18:03
Wait, I think there's a misunderstanding here.
POSIX does indeed leave undefined the behavior of syscalls when a
signal is received: whether or not it fails with EINTR depends on the
operating system, the SA_RESTART flag, etc.
But the tests we're talking about send signals whose default actions
are to terminate the process (e.g. SIGTERM, SIGHUP, etc): I cannot
beleive AIX doesn't terminate the processes in these situations, I
think it's a bug in asyncio which prevent the termination from being
detected (David, you can just use "ps" to check that the processes are
indeed terminated).
msg200461 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月19日 18:09
Check that which processes are terminated? The test eventually timed out after 1 hour and the process running echo.py continued to run.
msg200463 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月19日 18:11
Actually, I think we're talking about the actual select() syscal not
returning when SIGCHLD is received and handled. Running the C-level handler
isn't enough, the syscall must return with EINTR so the Python handler can
run (which then writes a byte to the self-pipe to ensure select() returns,
but that seems too late). Maybe twiddling with SA_RESTART can fix this?
msg200466 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月19日 18:17
poll() vs select() does not make a difference. Using the default poll(), if I run python in GDB, test_process_interactive hangs in poll(). The top of the traceback looks like:
#0 0xd02389d4 in __fd_poll () from /usr/lib/libc.a(shr.o)
#1 0xd717e604 in poll (__listptr=0x2023e150, __nfds=1, __timeout=-1)
 at /usr/include/sys/poll.h:107
#2 0xd71800f8 in poll_poll (self=0x30696d10, args=0x304723b4)
 at /home/dje/src/cpython/Modules/selectmodule.c:543
msg200470 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013年10月19日 18:24
> Guido van Rossum added the comment:
>
> Actually, I think we're talking about the actual select() syscal not
> returning when SIGCHLD is received and handled. Running the C-level handler
> isn't enough, the syscall must return with EINTR so the Python handler can
> run (which then writes a byte to the self-pipe to ensure select() returns,
> but that seems too late). Maybe twiddling with SA_RESTART can fix this?
Hum, I just check asyncio's code, and it's not doing what I thought it
was doing.
SA_RESTART won't help: PyOS_setsig() already doesn't set it, and also
calls siginterrupt() to ensure the syscall will be interrupted, but as
can been seen here, that's not portable.
The correct way to do this, IMO, would be to simply change
BaseSelectorEventLoop._read_from_self() to read the signal number from
the wakeup FD (Python's signal handler writes the signal number to
this FD): then, we could call directly call the relevant signal
handler. This way, this would succeed even if select()/poll()/whatever
doesn't fail with EINTR (it'll also be a bit simpler/faster).
I thought that's how it was done in tulip!
msg200475 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013年10月19日 18:44
> [Off-topic: Heck, another one of those upside-down tracebacks. WHY???]
These are tracebacks emitted by the faulthandler module. They don't use the normal traceback printing machinery because it is impossible to do so in the conditions where those tracebacks are printed (e.g. from a signal handler).
msg200487 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013年10月19日 19:53
Here's a patch changing asyncio to not rely on a Python-level signal handler at all: it reads the signal number from the self-pipe.
It should therefore work even if the select()/poll()/etc syscall doesn't fail with EINTR, and incidentally require only one wakup per signal.
Also, it gets rid of BaseEventLoop._add_callback_signalsafe().
I just had to update a test which was timing-dependent (basically, before when a signal was received you had to add the handler to the ready list and issue a self-write to wakeup the main loop, now the handler is added to the ready list and the loop is automatically woken up).
This should fix the PPC Linux and FreeBSD failures.
I'm not so sure about AIX, since the error looks different.
msg200494 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月19日 20:35
The patch cleans up a lot of the other failures for asyncio on AIX, but does not fix the hang.
msg200505 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013年10月19日 22:09
> The patch cleans up a lot of the other failures for asyncio on AIX
Great.
> but does not fix the hang.
I was expecting this :-)
Note that the hang is really strange, especially since the first
run_until_complete() succeeds.
Without more information, it'll be really hard to debug.
If you don't have an strace equivalent, then the only option is to add
ad-hoc debug statements to see:
- if the data gets written to the pipe
- if data is read from the pipe
There's something strange in the stack you posted above: there seems
to be a single FD in the poll() set, whereas there should be at least
3 (stdin, stdout and stderr).
msg200510 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月19日 22:50
I guess it's a shame that the two issues got merged, since now it looks like they may have to be unmerged again. I sent a few comments to neologix' patch that should be easy to address; for AIX I'm afraid we're still looking for more info from David.
msg200513 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013年10月19日 23:13
The exact same hanging seems to happen under Ubuntu 8.04 LTS:
http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8803/steps/test/logs/stdio
[176/383] test_asyncio
Timeout (1:00:00)!
Thread 0x404e48c0:
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/selectors.py", line 313 in select
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/asyncio/base_events.py", line 576 in _run_once
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/asyncio/base_events.py", line 153 in run_forever
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/asyncio/base_events.py", line 172 in run_until_complete
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_asyncio/test_events.py", line 1081 in test_subprocess_close_after_finish
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 571 in run
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 610 in __call__
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 117 in run
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 79 in __call__
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 117 in run
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 79 in __call__
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 117 in run
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 79 in __call__
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/runner.py", line 168 in run
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/__init__.py", line 1661 in _run_suite
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/__init__.py", line 1695 in run_unittest
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_asyncio/__init__.py", line 31 in test_main
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 1276 in runtest_inner
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 965 in runtest
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 761 in main
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/__main__.py", line 3 in <module>
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 73 in _run_code
 File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 160 in _run_module_as_main
msg200514 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2013年10月19日 23:14
There's also a test that hangs for 1 hour on the Ubuntu x86 buildbot.
Sample failure:
http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8803/ 
msg200516 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月19日 23:26
If CF's patch fixes the Ubuntu hang, commit it! We can iterate on my nits post-alpha.
msg200518 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013年10月19日 23:34
> Guido van Rossum added the comment:
>
> Actually, I think we're talking about the actual select() syscal not
> returning when SIGCHLD is received and handled. Running the C-level handler
> isn't enough, the syscall must return with EINTR so the Python handler can
> run (which then writes a byte to the self-pipe to ensure select() returns,
> but that seems too late).
I just realized the following: there's no need for EINTR to break out
of select(), that's what the self-pipe is for: since the write-end is
registered as wakeup FD, and the read-end is registered in the
selector, when the C-level signal handler is executed, it writes to
the wakeup FD, therefore the select() syscall should return with the
read-end ready, or more likely fail with EINTR. And then, the
Python-level signal handler should be executed (from the main loop),
scheduling a callback, and re-trigerring a wakup through the
self-pipe.
If this doesn't work as-is, I only see two explanations:
- the C-level signal handler doesn't get called, which seems impossible
- the python-level signal handler doesn't get executed before the main
loop re-enters select(), but since the new GIL signal handlers are
called at each eval loop
In short, I'm puzzled. I can tell why my patch should work, but I'm
not sure why the current approach doesn't.
msg200520 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月19日 23:42
Hm... Maybe the wakeup_fd machinery is broken on some systems?
Has anybody been able to confirm that CF's patch fixes the hang on any non-AIX system? If so, maybe we should commit first, understand later...
msg200521 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013年10月19日 23:43
I don't have access to an Ubuntu 8.04 box, but if David Bolen can test the patch it would be great.
msg200525 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月19日 23:55
Separate question for David: if CF's patch were committed, would we still need your patch to disable three subprocess-related tests on AIX? (That can be dealt with after the alpha release though -- in fact I propose not to wait for AIX if we can get the other platforms to stop hanging.)
msg200530 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013年10月20日 01:29
RHEL 6 too:
http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2911/steps/test/logs/stdio 
msg200531 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月20日 01:31
AIX has an equivalent to strace (called truss). I have recorded all AIX system calls and signals for test_process_interactive, which hangs, following all children created by fock. The uncompressed file is 82MB or 939KB compressed. The highlights are:
test_subprocess_interactive (test.test_asyncio.test_events.PollEventLoopTests) ... 
4588380: 64422135: close(3) = 0
4588380: 64422135: statx("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", 0x2FF20870, 128, 010) = 0
4588380: 64422135: kopen("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", O_RDONLY|O_LARGEFILE) = 3
4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
4588380: 64422135: kfcntl(3, F_SETFD, 0x00000001) = 0
4588380: 64422135: fstatx(0x00000003, 0x2FF229E8, 0x00000080, 0x00000008) = 0x00000000
4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
4588380: 64422135: lseek(0x00000003, 0x00000000, 0x00000001) = 0x00000000
4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 110
4588380: 64422135: klseek(3, 4294967295, -108, 0x00000001) = 0
4588380: 64422135: klseek(3, 0, 0, 0x00000000) = 0
4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 110
4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 0
4588380: 64422135: close(3) = 0
4588380: 64422135: kread(0, " P y t h o n � � � � �".., 1024) = 7
4588380: 64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000007) = 0x00000007
4588380: 0x00000000: " P y t h o n � � � � �"..
59244954: 33161535: _poll(0x2067A970, 3, -1) = 1
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380: 64422135: kwrite(0x00000001, 0x200CD0F0, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
59244954: 33161535: __libc_sbrk(0x00000000) = 0x20691220
4588380: 64422135: kwrite(0x00000001, 0x200CD140, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380: 64422135: kwrite(0x00000001, 0x200CD190, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
59244954: 33161535: kread(9, " P y t h o n � � � � �".., 262144) = 7
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380: 64422135: kwrite(0x00000001, 0x200CD1E0, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
59244954: 33161535: _poll(0x2067A970, 3, 0) = 0
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380: 64422135: kwrite(0x00000001, 0x200D66C0, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
59244954: 33161535: _poll(0x2067A970, 3, 0) = 0
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380: 64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
59244954: 33161535: _poll(0x2067A970, 3, 0) = 04588380: 64422135: kread(0, " � � � � � � � � � � � �"..
, 1024)
 = 0
4588380: 64422135: kwrite(0x00000001, 0x200CD0F0, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
59244954: 33161535: _poll(0x2067A970, 3, 0) = 0
4588380: 64422135: kwrite(0x00000001, 0x200CD140, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380: 64422135: kwrite(0x00000001, 0x200CD190, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380: 64422135: kwrite(0x00000001, 0x200CD1E0, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380: 64422135: kwrite(0x00000001, 0x200D66C0, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380: 64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000000) = 0x00000000
4588380: 0x00000000: " � � � � � � � � � � � �"..
And it sits in this read/write loop with some intermittent _poll timeouts until I send SIGINT.
msg200532 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月20日 01:35
Separate question for David: if CF's patch were committed, would we still need your patch to disable three subprocess-related tests on AIX? (That can be dealt with after the alpha release though -- in fact I propose not to wait for AIX if we can get the other platforms to stop hanging.)
CF's patch does not fix the hangs on AIX and my patch still is needed. And I never was suggesting that correct functionality on AIX is a blocker.
msg200533 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013年10月20日 01:40
> test_subprocess_interactive (test.test_asyncio.test_events.PollEventLoopTests) ... 
> 4588380: 64422135: close(3) = 0
> 4588380: 64422135: statx("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", 0x2FF20870, 128, 010) = 0
> 4588380: 64422135: kopen("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", O_RDONLY|O_LARGEFILE) = 3
> 4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
> 4588380: 64422135: kfcntl(3, F_SETFD, 0x00000001) = 0
> 4588380: 64422135: fstatx(0x00000003, 0x2FF229E8, 0x00000080, 0x00000008) = 0x00000000
> 4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
> 4588380: 64422135: lseek(0x00000003, 0x00000000, 0x00000001) = 0x00000000
> 4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
> 4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 110
> 4588380: 64422135: klseek(3, 4294967295, -108, 0x00000001) = 0
> 4588380: 64422135: klseek(3, 0, 0, 0x00000000) = 0
> 4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 110
> 4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 0
> 4588380: 64422135: close(3) = 0
> 4588380: 64422135: kread(0, " P y t h o n � � � � �".., 1024) = 7
> 4588380: 64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000007) = 0x00000007
> 4588380: 0x00000000: " P y t h o n � � � � �"..
> 59244954: 33161535: _poll(0x2067A970, 3, -1) = 1
> 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
> 4588380: 64422135: kwrite(0x00000001, 0x200CD0F0, 0x00000000) = 0x00000000
> 4588380: 0x00000000: " � � � � � � � � � � � �"..
It seems echo.py should at least exit on EOF, which would prevent the
hangs (right now it's stuck in a busy loop calling os.read(0) which
always returns immediately).
(same for echo3.py)
msg200534 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月20日 02:45
I added
 if not buf: break
after
 buf = os.read()
to echo.py, echo2.py and echo3.py. They now exit and do not spin, but the test still hangs.
msg200541 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月20日 03:56
I wonder if subprocess pipes have a version of the race condition I fixed
for sockets yesterday:
http://code.google.com/p/tulip/source/detail?r=2d392599fa067038b5fc9eeaf051011307aeadee
The pipe code seems derived from the socket code long ago.
msg200543 - (view) Author: David Bolen (db3l) * Date: 2013年10月20日 04:20
I've been trying to test this on the Ubuntu 8.04 buildbot, but so far, using the latest hg default head, haven't gotten test_asyncio to hang even before the patch. It also looks like the last 4 regular buildbot builds were fine too. I even tried rolling back to one of the revisions that did fail a buildbot build, but it went ok as well.
I'm guessing it's just a timing or race condition but without some way to reproduce reliably, I can't say much about the patch.
msg200544 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月20日 04:30
Let's just release the alpha with this as it stands; I will look at the race condition later, and also at CF's patch (which may be a good idea regardless of what it fixes).
I did add the breaks to echo and echo3 since the busy-waiting just makes tracing this that much harder.
msg200547 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月20日 06:37
I'm trying to let go of the AIX hang. Here's a brain dump of what I've figured out so far.
* There were a lot of red herrings in the early discussion. This hang doesn't seem to have anything to do with nonblocking connect() or sockets, nor even signals.
* Summary of what the test (test_subprocess_interactive) tries to do: it starts an echo subprocess, writes a string to it, reads the string back, writes another string to it, reads that back, and then closes the transport.
* The test hangs after seeing the first string echoed back but not the second, and in between somehow the stdin pipe is broken.
* If I read David's truss log correctly, the following things have happened:
- the parent wrote 'Python ' to the pipe for the subprocess's stdin (this is not shown in the extract but it must have happened because we see the string arrive in the subprocess)
- the echo.py subprocess started and began to read from stdin
- the subprocess read 'Python ' from its stdin
- the subprocess wrote 'Python ' back to its stdout
- poll() in the parent woke up
- the parent allocated some memory and read 'Python ' from the pipe for the subprocess's stdout
At this point apparently the pipe for the subprocess stdin got closed so the subprocess received an EOF (over and over due to the missing test+break).
We also know that the parent now hangs in the last run_until_complete() call, which means that it has at least attempted to write 'The Winner' -- but there is no evidence of this in the truss extract so it is possible that that string is still in the transport's write buffer. It is also possible that David simply missed it in the endless stream of ineffective calls due to the looping bug.
I'm actually curious why it seems that poll() keeps returning 0 in the parent -- shouldn't it have an infinite timeout, since there's nothing left to do?
Another theory is one or more *connection_lost() methods on the protocol are actually being called but the test stubbornly keeps waiting until proto.got_data[1] becomes set.
I'd be very interested in the truss output with the fix to echo.py in place (which is now in the repo).
msg200551 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013年10月20日 08:26
> David Edelsohn added the comment:
>
> AIX has an equivalent to strace (called truss). I have recorded all AIX system calls and signals for test_process_interactive, which hangs, following all children created by fock. The uncompressed file is 82MB or 939KB compressed. The highlights are:
Apparently, the stdout pipe was closed by the parent process: it would
be nice to see where/how (it should be FD 7 according to strace on my
Linux box).
msg200552 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月20日 08:42
> Apparently, the stdout pipe was closed by the parent process
Could it be that selecting for *read* on the *write* end of a pipe is
always ready? In _UnixWritePipeTransport there's a read handler that
immediately closes the pipe as soon as it called. I vaguely remember a
discussion on python-tulip that this might be Linux-specific behavior. (The
reason is that otherwise you can't find out whether the other end was
closed unless you attempt to write to the pipe.)
To test this theory, it should be sufficient to comment out line 280
self._loop.add_reader(self._fileno, self._read_ready)
from unix_events.py. This will make test_subprocess_kill() hang but it
should not affect test_subprocess_interactive. (So it is not a fix, just a
way to confirm the theory.)
msg200553 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013年10月20日 08:52
> Guido van Rossum added the comment:
>
>> Apparently, the stdout pipe was closed by the parent process
>
> Could it be that selecting for *read* on the *write* end of a pipe is
> always ready?
That's exactly what I was thinking when I read the code below: that's
definitely a possibility on AIX.
> In _UnixWritePipeTransport there's a read handler that
> immediately closes the pipe as soon as it called. I vaguely remember a
> discussion on python-tulip that this might be Linux-specific behavior. (The
> reason is that otherwise you can't find out whether the other end was
> closed unless you attempt to write to the pipe.)
Normally, if the pipe is closed, it should be reported by a POLLHUP/write ready.
Then you only consider it closed when write fails with EPIPE.
msg200582 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月20日 13:44
> To test this theory, it should be sufficient to comment out
> self._loop.add_reader(self._fileno, self._read_ready)
When I comment out this line, test_subprocess_interactive succeeds on AIX.
msg200589 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月20日 14:08
For completeness, the highlights of the new truss trace output after the echo.py change and only tracing the main process to avoid confusion from the interleaved output:
test_subprocess_interactive (test.test_asyncio.test_events.PollEventLoopTests) ...
socketpair(1, 1, 0, 0x2FF112F0) = 0
kfcntl(5, F_SETFD, 0x00000001) = 0
kfcntl(6, F_SETFD, 0x00000001) = 0
kfcntl(5, F_GETFL, 0x00000000) = 2
kioctl(5, -2147195266, 0x2FF12610, 0x00000000) = 0
kioctl(5, -2147195267, 0x2FF12610, 0x00000000) = 0
kfcntl(5, F_SETFL, 0x00000006) = 0
kfcntl(6, F_GETFL, 0x00000000) = 2
kioctl(6, -2147195266, 0x2FF12610, 0x00000000) = 0
kioctl(6, -2147195267, 0x2FF12610, 0x00000000) = 0
kfcntl(6, F_SETFL, 0x00000006) = 0
_poll(0x201A9B00, 1, 0) = 0
fstatx(6, 0x2FF0F0F0, 128, 010) = 0
sigprocmask(2, 0xF0296960, 0x2FF0F0A0) = 0
_sigaction(20, 0x2FF0F0F8, 0x2FF0F108) = 0
thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0x14260153, 0x00000003, 0x00000110, 0x00000000, 0x00000000) = 0x00000000
pipe(0x2FF0E2F0) = 0
kfcntl(7, F_SETFD, 0x00000001) = 0
kfcntl(8, F_SETFD, 0x00000001) = 0
pipe(0x2FF0E2F0) = 0
kfcntl(9, F_SETFD, 0x00000001) = 0
kfcntl(10, F_SETFD, 0x00000001) = 0
pipe(0x2FF0E2F0) = 0
kfcntl(11, F_SETFD, 0x00000001) = 0
kfcntl(12, F_SETFD, 0x00000001) = 0
fstatx(8, 0x2FF0E748, 128, 010) = 0
fstatx(8, 0x2FF0E748, 128, 010) = 0
kioctl(8, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
fstatx(9, 0x2FF0E748, 128, 010) = 0
fstatx(9, 0x2FF0E748, 128, 010) = 0
kioctl(9, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
fstatx(11, 0x2FF0E748, 128, 010) = 0
fstatx(11, 0x2FF0E748, 128, 010) = 0
kioctl(11, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
pipe(0x2FF0E1C0) = 0
kfcntl(13, F_SETFD, 0x00000001) = 0
kfcntl(14, F_SETFD, 0x00000001) = 0
sigprocmask(0, 0xF0296960, 0xF0296968) = 0
kfork() = 51773770
thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0xD0528F00, 0x00000000, 0x14260153, 0x14260153, 0x00000000) = 0x00000000
_getpid() = 65077750
close(14) = 0
close(7) = 0
close(10) = 0
close(12) = 0
kread(13, " � � � � � � � � � � � �".., 50000) = 0
close(13) = 0
fstatx(8, 0x2FF0DB58, 128, 010) = 0
kfcntl(8, F_GETFL, 0x00000000) = 1
kioctl(8, -2147195266, 0x2FF0CDD0, 0x00000000) = 0
kioctl(8, -2147195267, 0x2FF0CDD0, 0x00000000) = 0
kfcntl(8, F_SETFL, 0x00000005) = 0
_poll(0x204DBB90, 2, 0) = 0
_poll(0x204DBB90, 2, 0) = 0
kfcntl(9, F_GETFL, 0x00000000) = 0
kioctl(9, -2147195266, 0x2FF0C360, 0x00000000) = 0
kioctl(9, -2147195267, 0x2FF0C360, 0x00000000) = 0
kfcntl(9, F_SETFL, 0x00000004) = 0
_poll(0x204DBB90, 3, 0) = 0
_poll(0x204DBB90, 3, 0) = 0
kfcntl(11, F_GETFL, 0x00000000) = 0
kioctl(11, -2147195266, 0x2FF0C360, 0x00000000) = 0
kioctl(11, -2147195267, 0x2FF0C360, 0x00000000) = 0
kfcntl(11, F_SETFL, 0x00000004) = 0
_poll(0x2050B8C0, 4, 0) = 0
_poll(0x2050B8C0, 4, 0) = 0
_poll(0x2050B8C0, 4, 0) = 0
_poll(0x2050B8C0, 4, 0) = 0
kwrite(8, " P y t h o n ", 7) = 7
_poll(0x2050B8C0, 4, 0) = 1
_poll(0x2050B8C0, 3, 0) = 0
close(8) = 0
_poll(0x2050B8C0, 3, 0) = 0
_poll(0x2050B8C0, 3, -1) = 1
__libc_sbrk(0x00000000) = 0x20541130
kread(9, " P y t h o n � � � � �".., 262144) = 7
_poll(0x2050B8C0, 3, 0) = 0
_poll(0x2050B8C0, 3, 0) = 0
_poll(0x2050B8C0, 3, 0) = 0
_poll(0x2050B8C0, 3, 0) = 0
_poll(0x2050B8C0, 3, -1) = 2
kread(9, " � � � � � � � � � � � �".., 262144) = 0
kread(11, " � � � � � � � � � � � �".., 262144) = 0
_poll(0x2050B8C0, 1, 0) = 0
close(9) = 0
close(11) = 0
 Received signal #20, SIGCHLD [caught]
_getpid() = 65077750
kwrite(6, "14", 1) = 1
ksetcontext_sigreturn(0x2FF11D70, 0x200868EC, 0x00000008, 0xD0528CA4, 0x00000000, 0x14260153, 0x2FF22FF8, 0x00000000)
_poll(0x2050B8C0, 1, 0) = 1
_erecv(5, 0x3075C820, 1, 0, 0x00000000) = 1
_poll(0x2050B8C0, 1, 0) = 0
kwaitpid(0x2FF12134, 0, 5, 0x00000000, 0x00000000) = 51773770
_poll(0x2050B8C0, 1, 0) = 0
_poll(0x2050B8C0, 1, -1) (sleeping...)
msg200597 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月20日 16:55
> > Could it be that selecting for *read* on the *write* end of a pipe is
> > always ready?
>
> That's exactly what I was thinking when I read the code below: that's
> definitely a possibility on AIX.
David confirmed that it is the _read_ready() that closes the pipe
prematurely. (Thanks David!)
> > In _UnixWritePipeTransport there's a read handler that
> > immediately closes the pipe as soon as it called. I vaguely remember a
> > discussion on python-tulip that this might be Linux-specific behavior. (The
> > reason is that otherwise you can't find out whether the other end was
> > closed unless you attempt to write to the pipe.)
>
> Normally, if the pipe is closed, it should be reported by a POLLHUP/write ready.
> Then you only consider it closed when write fails with EPIPE.
That's no good though unless the app wants to write -- the write end
also reports write ready when it is empty (or nearly so), so this
won't help an app that has done all the writing it wants to (at least
for a while) but still wants to be notified when the other end has
closed it. Now, there are other indications, such as the pipe in the
other direction giving an EOF, or the process exiting, but sometimes
neither of those happens, and then we're stuck.
I guess we'll have to write platform-dependent code and make this an
optional feature. (Essentially, on platforms like AIX, for a
write-pipe, connection_lost() won't be called unless you try to write
some more bytes to it.)
I do believe that so far this problem only occurs on AIX so I am
tempted to make it an explicit test for AIX -- if it's AIX, don't
register the _read_ready handler. We'll also have to skip or adjust
one or two tests that will fail if this feature is missing.
msg200598 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013年10月20日 17:36
> I guess we'll have to write platform-dependent code and make this an
> optional feature. (Essentially, on platforms like AIX, for a
> write-pipe, connection_lost() won't be called unless you try to write
> some more bytes to it.)
>
> I do believe that so far this problem only occurs on AIX so I am
> tempted to make it an explicit test for AIX -- if it's AIX, don't
> register the _read_ready handler. We'll also have to skip or adjust
> one or two tests that will fail if this feature is missing.
Hmm...
Not calling connection_lost() when the read-end of the pipe is closed
is OK, since there's no portable way to detect this on e.g. AIX, but
AFAICT the current process-termination logic is buggy:
When SIGCHLD is received, _sig_chld() is executed:
 def _sig_chld(self):
 [...]
 transp = self._subprocesses.get(pid)
 if transp is not None:
 transp._process_exited(returncode)
Then, here's _process_exited():
 def _process_exited(self, returncode):
 assert returncode is not None, returncode
 assert self._returncode is None, self._returncode
 self._returncode = returncode
 self._loop._subprocess_closed(self)
 self._call(self._protocol.process_exited)
 self._try_finish()
And here's _try_finish():
 def _try_finish(self):
 assert not self._finished
 if self._returncode is None:
 return
 if all(p is not None and p.disconnected
 for p in self._pipes.values()):
 self._finished = True
 self._loop.call_soon(self._call_connection_lost, None)
Thus, _UnixSubprocessTransport protocol's connection_lost is only
called if the all() expression is true:
and it's true only if all the subprocess pipes have been disconnected
(or if we didn't setup any pipe).
Unfortunately, this might very well never happen: imagine that the
subprocess forks a process: this grand-child process inherits the
child process's pipes, so when the child process exits, we won't
receive any notification, since this grand-child process still has
open FDs pointing to the original child's stdin/stdout/stderr.
The following simple test will hang until the background 'sleep' exits:
"""
diff -r 47618b00405b Lib/test/test_asyncio/test_events.py
--- a/Lib/test/test_asyncio/test_events.py Sat Oct 19 10:45:48 2013 +0300
+++ b/Lib/test/test_asyncio/test_events.py Sun Oct 20 19:32:37 2013 +0200
@@ -1059,6 +1059,23 @@
 @unittest.skipIf(sys.platform == 'win32',
 "Don't support subprocess for Windows yet")
+ def test_subprocess_inherit_fds(self):
+ proto = None
+
+ @tasks.coroutine
+ def connect():
+ nonlocal proto
+ transp, proto = yield from self.loop.subprocess_shell(
+ functools.partial(MySubprocessProtocol, self.loop),
+ 'sleep 60 &')
+ self.assertIsInstance(proto, MySubprocessProtocol)
+
+ self.loop.run_until_complete(connect())
+ self.loop.run_until_complete(proto.completed)
+ self.assertEqual(0, proto.returncode)
+
+ @unittest.skipIf(sys.platform == 'win32',
+ "Don't support subprocess for Windows yet")
 def test_subprocess_close_after_finish(self):
 proto = None
 transp = None
"""
If waitpid() returns a process's PID, then the process is done,
there's no reason to further wait for pipe's disconnection: they can
be used as a hint that the process terminated, but there's definitely
not required...
msg200599 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013年10月20日 17:37
> I guess we'll have to write platform-dependent code and make this an
> optional feature. (Essentially, on platforms like AIX, for a
> write-pipe, connection_lost() won't be called unless you try to write
> some more bytes to it.)
If we are not capturing stdout/stderr then we could "leak" the write end of a pipe to the child. When the read end becomes readable we can call the process protocol's connection_lost().
Or we could just call connection_lost() when reaping the pid.
msg200615 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月20日 19:01
No, there's a use case for reading after the child exited, if there is a
grandchild still writing.
--Guido van Rossum (sent from Android phone)
On Oct 20, 2013 10:37 AM, "Richard Oudkerk" <report@bugs.python.org> wrote:
>
> Richard Oudkerk added the comment:
>
> > I guess we'll have to write platform-dependent code and make this an
> > optional feature. (Essentially, on platforms like AIX, for a
> > write-pipe, connection_lost() won't be called unless you try to write
> > some more bytes to it.)
>
> If we are not capturing stdout/stderr then we could "leak" the write end
> of a pipe to the child. When the read end becomes readable we can call the
> process protocol's connection_lost().
>
> Or we could just call connection_lost() when reaping the pid.
>
> ----------
> nosy: +sbt
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue19293>
> _______________________________________
>
msg200666 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013年10月21日 00:19
[Off-topic: Heck, another one of those upside-down tracebacks. WHY???]
faulthandler is really a very low-level module. I wrote it to analyze deadlocks, memory corruptions and other sad cases. faulthandler is designed to provide as much information as possible even if Python internals are corrupted. So it only reads data structures, without processing them too much. For example, frames are stored from the most recent to the oldest. faulthandler also truncates long traceback to 100 frames (write "..." if there are more), to stop the loop if there is a cycle in the chained list of frames.
The documentation also mentions other limitations:
http://docs.python.org/dev/library/faulthandler.html 
msg200671 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013年10月21日 00:28
> Hm... Maybe the wakeup_fd machinery is broken on some systems?
Hey! I added many tests when I added this feature to Python: see test_signal.py. By the way, it uses select.select() :-) Does test_signal pass successfully on AIX?
msg200672 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013年10月21日 00:33
Just wondering: why is the timeout-to-failure set to a whole hour? Based on all the messages I've seen this weekend, we could save the buildbots countless years of time by reducing it to - say - 59 minutes ;-)
msg200689 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月21日 03:06
> I added many tests when I added this feature to Python: see test_signal.py. By the way, it uses select.select() :-) Does test_signal pass successfully on AIX?
@haypo: test_signal skips some tests on AIX. See Issue #18238 .
msg200713 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013年10月21日 08:09
2013年10月21日 David Edelsohn <report@bugs.python.org>:
>> I added many tests when I added this feature to Python: see test_signal.py. By the way, it uses select.select() :-) Does test_signal pass successfully on AIX?
>
> @haypo: test_signal skips some tests on AIX. See Issue #18238 .
Ah yes, the test on select+signal is skipped, but the basic feature
(writing a signal number in a pipe using wakeup fd) works on AIX.
msg200720 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013年10月21日 08:37
I'm completely lost: how many issues is this thread conflating?
For example, David, I think you said that my patch regarding signal
handling did fix some tests.
Is that still the case?
If yes, could you open a specific issue for the remaining failing tests on AIX?
AFAICT, the original hang is due to AIX always reporting the write-end
of a pipe as read-ready: we should only keep discussing this problem
here.
The above FreeBSD and Ubuntu failures seem completely unrelated, they
should get there own issues.
msg200745 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013年10月21日 10:23
Would it make sense to use socketpair() instead of pipe() on AIX? We could check for the "bug" directly rather than checking specifically for AIX.
msg200791 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月21日 14:50
OK, I've created issue 19334 to track test_asyncio hangs on other platforms. Let's please keep this issue restricted to AIX. I am aware of the following issue:
- select for read on the write end of a pipe returns immediately
Any other AIX issues also please open separate issues (and assign to me to get my immediate attention).
I wonder if we can at least dynamically detect that this does not work in the _read_ready() handler or even in the setup? ISTM that reading from the write end should immediately fail with EBADF. (It does on my system, but I don't want to generalize. :-)
msg200794 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月21日 14:55
@Richard: socketpair() seems to work fine on AIX (we use it internally). Maybe we can always use that for subprocess management? But will the subprocess module support this?
msg200806 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013年10月21日 16:19
The following uses socketpair() instead of pipe() for stdin, and works for me on Linux:
diff -r 7d94e4a68b91 asyncio/unix_events.py
--- a/asyncio/unix_events.py Sun Oct 20 20:25:04 2013 -0700
+++ b/asyncio/unix_events.py Mon Oct 21 17:15:19 2013 +0100
@@ -272,8 +272,6 @@
 self._loop = loop
 self._pipe = pipe
 self._fileno = pipe.fileno()
- if not stat.S_ISFIFO(os.fstat(self._fileno).st_mode):
- raise ValueError("Pipe transport is for pipes only.")
 _set_nonblocking(self._fileno)
 self._protocol = protocol
 self._buffer = []
@@ -442,9 +440,16 @@
 self._finished = False
 self._returncode = None
+ if stdin == subprocess.PIPE:
+ stdin_w, stdin_r = socket.socketpair()
+ else:
+ stdin_w = stdin_r = None
 self._proc = subprocess.Popen(
- args, shell=shell, stdin=stdin, stdout=stdout, stderr=stderr,
+ args, shell=shell, stdin=stdin_r, stdout=stdout, stderr=stderr,
 universal_newlines=False, bufsize=bufsize, **kwargs)
+ if stdin_r is not None:
+ stdin_r.close()
+ self._proc.stdin = open(stdin_w.detach(), 'rb', buffering=bufsize)
 self._extra['subprocess'] = self._proc
 def close(self):
msg200808 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月21日 16:28
Brilliant. It works for me too on OS X.
David, can you check this on AIX?
msg200811 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月21日 16:34
Attached is a version of Richard's patch that I like better (sockpair.diff).
- instead of dropping the ValueError, check for ISFIFO and ISSOCK
- pass the read end of the socket pair to Popen()
msg200814 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月21日 17:08
New, improved version of sockpair.diff.
msg200860 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月22日 02:47
The testsuite is hanging in test_asyncio again and test_subprocess_close_client_stream needs to be skipped.
msg200861 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月22日 02:48
Other than the skipped tests, test_asyncio now passes on AIX except for:
FAIL: test_write_pipe (test.test_asyncio.test_events.PollEventLoopTests)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/home/dje/src/cpython/Lib/test/test_asyncio/test_events.py", line 875, in test_write_pipe
 self.assertEqual(b'2345', data)
AssertionError: b'2345' != b''
msg200867 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月22日 03:31
Thanks; I've applied and pushed your fix for the hang.
I think it's time to commit the sockpair.diff change; the test_write_pipe() failure looks related -- as you recall the problem doesn't really cause a hang, it causes a premature closing of a write pipe, the hangs were just due to the test structure (unfortunate, really).
I'll let you know when.
msg200868 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013年10月22日 03:37
New changeset 2a0bda8d283d by Guido van Rossum in branch 'default':
Switch subprocess stdin to a socketpair, attempting to fix issue #19293 (AIX hang).
http://hg.python.org/cpython/rev/2a0bda8d283d 
msg200869 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月22日 03:38
Let's see how this fares. If it passes on the AIX buildbot, maybe see which of the tests we skipped on AIX can be re-enabled again?
msg200872 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月22日 03:55
With the latest changeset applied to default, the test_subprocess tests no longer hang, but test_write_pipe_disconnect_on_close now hangs.
The attached patch changes test_event.py to not skip test_subprocess_* but now skips test_write_pipe_disconnect_on_close.
msg200873 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月22日 03:58
I think I know what that is -- the test should use a socketpair now. I'll see how quickly I can give you a patch for that. The other one is now committed.
msg200874 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月22日 04:09
David, try this fix (aixfix.diff).
msg200876 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月22日 04:23
test_asyncio no longer hangs on AIX with the aixfix.diff patch. Thanks!
msg200879 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月22日 04:27
Third try's a charm. Corrected aix fix.
msg200880 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013年10月22日 04:29
New changeset c2e018c54689 by Guido van Rossum in branch 'default':
Fix asyncio issue #19293 (hangs on AIX).
http://hg.python.org/cpython/rev/c2e018c54689 
msg200881 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013年10月22日 04:30
"And there was much rejoicing."
msg200882 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013年10月22日 04:31
You have to try harder to break it again. aixfix2.diff still works. ;-)
History
Date User Action Args
2022年04月11日 14:57:52adminsetgithub: 63492
2013年10月22日 04:31:54David.Edelsohnsetmessages: + msg200882
2013年10月22日 04:30:21gvanrossumsetstatus: open -> closed
resolution: fixed
messages: + msg200881

stage: resolved
2013年10月22日 04:29:19python-devsetmessages: + msg200880
2013年10月22日 04:27:17gvanrossumsetfiles: - aixfix.diff
2013年10月22日 04:27:01gvanrossumsetfiles: + aixfix2.diff

messages: + msg200879
2013年10月22日 04:25:47gvanrossumsetmessages: - msg200878
2013年10月22日 04:24:21gvanrossumsetmessages: + msg200878
2013年10月22日 04:23:40gvanrossumsetmessages: - msg200877
2013年10月22日 04:23:02gvanrossumsetmessages: + msg200877
2013年10月22日 04:23:01David.Edelsohnsetmessages: + msg200876
2013年10月22日 04:09:16gvanrossumsetfiles: + aixfix.diff

messages: + msg200874
2013年10月22日 03:58:42gvanrossumsetmessages: + msg200873
2013年10月22日 03:55:12David.Edelsohnsetfiles: + issue19293.patch3

messages: + msg200872
2013年10月22日 03:38:53gvanrossumsetmessages: + msg200869
2013年10月22日 03:37:23python-devsetmessages: + msg200868
2013年10月22日 03:31:45gvanrossumsetmessages: + msg200867
2013年10月22日 02:48:00David.Edelsohnsetmessages: + msg200861
2013年10月22日 02:47:20David.Edelsohnsetfiles: + issue19293.patch2

messages: + msg200860
2013年10月21日 17:08:17gvanrossumsetfiles: - sockpair.diff
2013年10月21日 17:08:04gvanrossumsetfiles: + sockpair.diff

messages: + msg200814
2013年10月21日 16:34:47gvanrossumsetfiles: + sockpair.diff

messages: + msg200811
2013年10月21日 16:28:23gvanrossumsetmessages: + msg200808
2013年10月21日 16:19:05sbtsetmessages: + msg200806
2013年10月21日 15:09:12floxsetnosy: + flox
2013年10月21日 14:55:34gvanrossumsetmessages: + msg200794
2013年10月21日 14:50:52gvanrossumsetmessages: + msg200791
title: test_asyncio hanging for 1 hour -> test_asyncio hanging for 1 hour (AIX version, hangs in test_subprocess_interactive)
2013年10月21日 10:23:03sbtsetmessages: + msg200745
2013年10月21日 08:37:01neologixsetmessages: + msg200720
2013年10月21日 08:09:05vstinnersetmessages: + msg200713
2013年10月21日 03:06:05David.Edelsohnsetmessages: + msg200689
2013年10月21日 00:33:29tim.peterssetnosy: + tim.peters
messages: + msg200672
2013年10月21日 00:28:37vstinnersetmessages: + msg200671
2013年10月21日 00:19:32vstinnersetnosy: + vstinner
messages: + msg200666
2013年10月20日 19:01:42gvanrossumsetmessages: + msg200615
2013年10月20日 17:37:10sbtsetnosy: + sbt
messages: + msg200599
2013年10月20日 17:36:37neologixsetmessages: + msg200598
2013年10月20日 16:55:54gvanrossumsetmessages: + msg200597
2013年10月20日 14:08:36David.Edelsohnsetmessages: + msg200589
2013年10月20日 13:44:20David.Edelsohnsetmessages: + msg200582
2013年10月20日 08:52:08neologixsetmessages: + msg200553
2013年10月20日 08:42:09gvanrossumsetmessages: + msg200552
2013年10月20日 08:26:43neologixsetmessages: + msg200551
2013年10月20日 06:37:40gvanrossumsetmessages: + msg200547
2013年10月20日 04:30:40gvanrossumsetmessages: + msg200544
2013年10月20日 04:20:58db3lsetmessages: + msg200543
2013年10月20日 03:56:17gvanrossumsetmessages: + msg200541
2013年10月20日 02:45:40David.Edelsohnsetmessages: + msg200534
2013年10月20日 01:40:52pitrousetmessages: + msg200533
2013年10月20日 01:35:59David.Edelsohnsetmessages: + msg200532
2013年10月20日 01:31:12David.Edelsohnsetmessages: + msg200531
2013年10月20日 01:29:55pitrousetnosy: + ncoghlan
messages: + msg200530
2013年10月19日 23:55:54gvanrossumsetmessages: + msg200525
2013年10月19日 23:43:55pitrousetnosy: + db3l
messages: + msg200521
2013年10月19日 23:42:31gvanrossumsetmessages: + msg200520
2013年10月19日 23:34:07neologixsetmessages: + msg200518
2013年10月19日 23:26:56gvanrossumsetmessages: + msg200516
2013年10月19日 23:14:20larrysetnosy: + larry
messages: + msg200514
2013年10月19日 23:13:51pitrousetmessages: + msg200513
2013年10月19日 22:50:22gvanrossumsetmessages: + msg200510
2013年10月19日 22:09:28neologixsetmessages: + msg200505
2013年10月19日 20:35:54David.Edelsohnsetmessages: + msg200494
2013年10月19日 19:53:57neologixsetfiles: + asyncio_signal.diff

messages: + msg200487
2013年10月19日 18:44:36pitrousetnosy: + pitrou
messages: + msg200475
2013年10月19日 18:24:17neologixsetmessages: + msg200470
2013年10月19日 18:17:26David.Edelsohnsetmessages: + msg200466
2013年10月19日 18:11:35gvanrossumsetmessages: + msg200463
2013年10月19日 18:09:35David.Edelsohnsetmessages: + msg200461
2013年10月19日 18:03:28neologixsetmessages: + msg200456
2013年10月19日 17:47:40David.Edelsohnsetmessages: + msg200452
2013年10月19日 17:36:21David.Edelsohnsetmessages: + msg200449
2013年10月19日 16:41:06gvanrossumsetassignee: gvanrossum
2013年10月19日 16:40:25gvanrossumsetmessages: + msg200440
2013年10月19日 16:31:06skrahsettitle: test_asyncio failures on AIX -> test_asyncio hanging for 1 hour
2013年10月19日 16:29:42skrahlinkissue19285 superseder
2013年10月19日 16:26:33skrahsetnosy: + skrah
messages: + msg200438
2013年10月19日 16:25:38neologixsetnosy: + neologix
messages: + msg200437
2013年10月19日 16:18:25gvanrossumsetmessages: + msg200435
2013年10月19日 16:10:46python-devsetnosy: + python-dev
messages: + msg200433
2013年10月19日 15:40:45David.Edelsohnsetfiles: + issue19293.patch
keywords: + patch
messages: + msg200426

components: + Tests
2013年10月19日 05:12:28gvanrossumsetmessages: + msg200374
2013年10月19日 05:07:22gvanrossumsetnosy: + gvanrossum
messages: + msg200372
2013年10月19日 04:20:34David.Edelsohncreate

AltStyle によって変換されたページ (->オリジナル) /