This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011年07月28日 13:58 by gabriele.trombetti, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue_12650_default.diff | neologix, 2011年07月28日 21:30 | review | ||
| issue_12650_27.diff | neologix, 2011年07月28日 21:30 | review | ||
| issue_12650_32.diff | neologix, 2011年07月28日 21:30 | review | ||
| Messages (13) | |||
|---|---|---|---|
| msg141295 - (view) | Author: (gabriele.trombetti) | Date: 2011年07月28日 13:58 | |
There seems to be a file descriptor (fd) leak in subprocess upon call to kill() and then destroying the current subprocess object (e.g. by scope exit) if the pipe functionality is being used.
This is a reproducer:
(Linux 2.6.25, Python 2.7.1 (r271:86832))
import subprocess, time
def leaktest():
subp = subprocess.Popen("echo hi; sleep 200; echo bye", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
#you can do something here
subp.terminate()
#subp.wait() #Fixes the bug
#time.sleep(0.001) #fixes the bug
#time.sleep(0.0000001) #doesn't fix the bug
return True
Launch the function multiple times interactively, each time the number of open file descriptors for the python process will increase by 2. You can see that by "ls -l /proc/<pid>/fd"
This seems to be a race condition because adding a time.sleep(0.001) before the return fixes the problem. Probably some kernel delay is responsible for the race.
This bug is significant for daemons because the daemon will die once the number of open file descriptors hits the ulimit, usually 1024, so please fix.
Note: until the bug is fixed, a simple workaround (not documented in module docs though) is to call Popen.wait() after Popen.kill()
Thank you
|
|||
| msg141307 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2011年07月28日 16:02 | |
There's indeed a leak in your code, because you don't close the stdout and stderr file descriptors. Try with: subp.terminate() subp.stdout.close() subp.stderr.close() return True And you'll be just fine. The reason why this works with a short delay is subtle however, and there's actually a bug: - when a subprocess.Popen is created, its stdout and stderr are wrapped with os.fdopen() - file objects returned by fdopen() are automatically closed when the object is deallocated (i.e. when the file object goes out of scope in cPython) - so in theory, when your subprocess.Popen goes out of scope, the corresponding FDs should be closed, and you shoudn't have a leak (note that this assumes cPython, and will maybe not work with other implementations - you shouldn't rely on this) - the problem is that when subprocess.Popen's __del__ method is called before the process has exited (i.e. if you return from leaktest() before the process has exited), the Popen object is added to a list (subprocess._active), so that it can be waited (to avoid zombies) - Popen objects from this list are collected (i.e. the process is waited for, and if terminated it's removed from the list) synchronously when a new Popen() object is created (_cleanup funtion) The problem is that there's a bug in the collection code: def _cleanup(): for inst in _active[:]: res = inst._internal_poll(_deadstate=sys.maxint) if res is not None and res >= 0: try: _active.remove(inst) except ValueError: # This can happen if two threads create a new Popen instance. # It's harmless that it was already removed, so ignore. pass res = inst._internal_poll(_deadstate=sys.maxint) if res is not None and res >= 0: If the process exit code is negative (in your case, -SIGTERM) then the Popen object is not removed from the list. Two consequences: - the object is not deallocated, the corresponding FDs neither, and you hit RLIMIT_NOFILE - even if stdout and stderr are closed manually, the object itself is not deallocated, so you have an ever-growing _active list, and a memory leak (and walking the list takes longer and longer) Changing if res is not None and res >= 0: to if res is not None: fixes this. I'll look at this in more detail and provide a patch. |
|||
| msg141317 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2011年07月28日 21:30 | |
Alright. I tested this on default, and couldn't reproduce the FD leak. It turned out to be due to another bug, affecting only the code path which calls pure C _posixsubprocess (which is the only implementation left in 3.3, but 3.2 still has the old pure-Python version). The code just forgets to set Popen._child_created to true after fork(), so when Popen.__del__() gets called before the process has exited, the object is not added to the _active list, and gets deallocated immediately. While this accidentaly "fixes" the FD leak, this has another - worse - side effet: the process remains a zombie. I'm thus attaching three patches, with tests: - one for 2.7, which fixes the original problem (i.e. remove the process from _active once exited, even if it got killed by a signal) - one for default, which also sets _child_created to True after fork() - another one for 3.2, which does the same thing as the one for default (but the code is a little different because 3.2 has both pure-Python and C implementation) Reviews welcome! |
|||
| msg141329 - (view) | Author: Ross Lagerwall (rosslagerwall) (Python committer) | Date: 2011年07月29日 05:43 | |
The patches look good and seem to fix the issue. |
|||
| msg142337 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2011年08月18日 15:18 | |
New changeset 9ee802642d86 by Charles-François Natali in branch '2.7': Issue #12650: Fix a race condition where a subprocess.Popen could leak http://hg.python.org/cpython/rev/9ee802642d86 |
|||
| msg142348 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2011年08月18日 15:54 | |
New changeset 7d358379c5fb by Charles-François Natali in branch '2.7': Issue #12650: only run the tests on Unix. http://hg.python.org/cpython/rev/7d358379c5fb |
|||
| msg142366 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2011年08月18日 16:48 | |
New changeset 8ca7f109ce79 by Charles-François Natali in branch '3.2': Issue #12650: Fix a race condition where a subprocess.Popen could leak http://hg.python.org/cpython/rev/8ca7f109ce79 |
|||
| msg142373 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2011年08月18日 17:12 | |
New changeset 148d75d106f1 by Charles-François Natali in branch 'default': Issue #12650: Fix a race condition where a subprocess.Popen could leak http://hg.python.org/cpython/rev/148d75d106f1 |
|||
| msg142391 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2011年08月18日 18:21 | |
Patch committed. This should also fix the original problem. Gabriele, thanks for the report. |
|||
| msg142402 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2011年08月18日 21:43 | |
Reopening. The new tests are failing on solaris and Debian parallel buildbots: http://www.python.org/dev/buildbot/all/builders/x86 debian parallel 3.x/builds/2734/steps/test/logs/stdio http://www.python.org/dev/buildbot/all/builders/sparc solaris10 gcc 3.x/builds/3528/steps/test/logs/stdio """ ====================================================================== FAIL: test_zombie_fast_process_del (test.test_subprocess.POSIXProcessTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home2/buildbot2/slave/3.x.loewis-parallel/build/Lib/test/test_subprocess.py", line 1519, in test_zombie_fast_process_del self.assertRaises(OSError, os.waitpid, pid, 0) AssertionError: OSError not raised by waitpid """ I've no idea what's going on here... This means that the child process didn't get reaped by _cleanup(). It could also maybe be due to the first child's PID being reused immediately by the second. |
|||
| msg142404 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2011年08月18日 22:25 | |
New changeset 122f952add3c by Charles-François Natali in branch '3.2': Issue #12650: fix failures on some buildbots, when a subprocess takes a long http://hg.python.org/cpython/rev/122f952add3c New changeset a32ae2749cd1 by Charles-François Natali in branch 'default': Issue #12650: fix failures on some buildbots, when a subprocess takes a long http://hg.python.org/cpython/rev/a32ae2749cd1 New changeset 5333ed548257 by Charles-François Natali in branch '2.7': Issue #12650: fix failures on some buildbots, when a subprocess takes a long http://hg.python.org/cpython/rev/5333ed548257 |
|||
| msg142405 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2011年08月18日 22:32 | |
> This means that the child process didn't get reaped by _cleanup(). The most likely cause is that when _cleanup() calls waitpid(pid, WNOHANG), the child process hasn't exited yet. Since I had already set a rather high timeout to avoid this problem and we're only interested in checking that the child is correctly added to the _active list, I've removed the part of test_zombie_fast_process_del which checked that the child got collected when the next Popen instance is allocated (I added this check in test_leak_fast_process_del_killed, which doesn't have this timeout problem because the process gets killed). I'll watch the buildbots. |
|||
| msg142422 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2011年08月19日 10:27 | |
The test now passes on the buildbots, closing. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:20 | admin | set | github: 56859 |
| 2011年08月19日 10:27:03 | neologix | set | status: open -> closed messages: + msg142422 |
| 2011年08月18日 22:32:06 | neologix | set | messages: + msg142405 |
| 2011年08月18日 22:25:42 | python-dev | set | messages: + msg142404 |
| 2011年08月18日 21:43:04 | neologix | set | status: closed -> open messages: + msg142402 |
| 2011年08月18日 18:21:59 | neologix | set | status: open -> closed resolution: fixed messages: + msg142391 stage: resolved |
| 2011年08月18日 17:12:56 | python-dev | set | messages: + msg142373 |
| 2011年08月18日 16:48:47 | python-dev | set | messages: + msg142366 |
| 2011年08月18日 15:54:50 | python-dev | set | messages: + msg142348 |
| 2011年08月18日 15:18:11 | python-dev | set | nosy:
+ python-dev messages: + msg142337 |
| 2011年07月29日 05:43:06 | rosslagerwall | set | messages: + msg141329 |
| 2011年07月28日 21:30:25 | neologix | set | files:
+ issue_12650_default.diff, issue_12650_27.diff, issue_12650_32.diff keywords: + patch messages: + msg141317 |
| 2011年07月28日 16:22:37 | rosslagerwall | set | nosy:
+ rosslagerwall |
| 2011年07月28日 16:08:56 | vstinner | set | nosy:
+ vstinner |
| 2011年07月28日 16:02:01 | neologix | set | nosy:
+ neologix messages: + msg141307 |
| 2011年07月28日 14:13:45 | r.david.murray | set | nosy:
+ gregory.p.smith |
| 2011年07月28日 13:58:59 | gabriele.trombetti | create | |