This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2008年07月19日 20:00 by donmez, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| multibug3.py | mark.dickinson, 2008年08月07日 21:10 | script demonstrating incref bug | ||
| mp_nohang_jnoller.patch | jnoller, 2008年08月11日 18:02 | |||
| issue3419.patch | mark.dickinson, 2008年08月17日 13:29 | Possible fix for incref bug | ||
| Messages (23) | |||
|---|---|---|---|
| msg70053 - (view) | Author: Ismail Donmez (donmez) * | Date: 2008年07月19日 20:00 | |
Tested on MacOSX 10.5.4, running test_multiprocessing in a tight loop :
[~/Sources/py3k]> while true;do ./python ./Lib/test/regrtest.py
test_multiprocessing;done
test_multiprocessing
1 test OK.
test_multiprocessing
1 test OK.
test_multiprocessing
1 test OK.
test_multiprocessing
Process Process-48:
Traceback (most recent call last):
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py",
line 720, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py",
line 232, in _bootstrap
self.run()
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py",
line 88, in run
self._target(*self._args, **self._kwargs)
File "/Users/cartman/Sources/py3k/Lib/test/test_multiprocessing.py",
line 601, in f
cond.acquire()
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py",
line 949, in acquire
return self._callmethod('acquire', (blocking,))
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py",
line 724, in _callmethod
self._connect()
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py",
line 711, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py",
line 133, in Client
c = SocketClient(address)
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py",
line 254, in SocketClient
s.connect(address)
socket.error: [Errno 61] Connection refused
Exception in thread Thread-58:
Traceback (most recent call last):
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py",
line 720, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/cartman/Sources/py3k/Lib/threading.py", line 492, in
_bootstrap_inner
self.run()
File "/Users/cartman/Sources/py3k/Lib/threading.py", line 447, in run
self._target(*self._args, **self._kwargs)
File "/Users/cartman/Sources/py3k/Lib/test/test_multiprocessing.py",
line 601, in f
cond.acquire()
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py",
line 949, in acquire
return self._callmethod('acquire', (blocking,))
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py",
line 724, in _callmethod
self._connect()
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py",
line 711, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py",
line 133, in Client
c = SocketClient(address)
File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py",
line 254, in SocketClient
s.connect(address)
socket.error: [Errno 61] Connection refused
|
|||
| msg70056 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2008年07月19日 21:01 | |
When running test_multiprocessing in a loop, I see: test test_multiprocessing failed -- Traceback (most recent call last): File "/temp/python/trunk/Lib/test/test_multiprocessing.py", line 1157, in test_remote queue = manager2.get_queue() File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 635, in temp authkey=self._authkey, exposed=exp File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 887, in AutoProxy incref=incref) File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 696, in __init__ self._incref() File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 743, in _incref dispatch(conn, None, 'incref', (self._id,)) File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 79, in dispatch raise convert_to_error(kind, result) RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 181, in handle_request result = func(c, *args, **kwds) File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 397, in incref self.id_to_refcount[ident] += 1 KeyError: '1033760' --------------------------------------------------------------------------- |
|||
| msg70060 - (view) | Author: Jesse Noller (jnoller) * (Python committer) | Date: 2008年07月19日 22:13 | |
This also happens under trunk |
|||
| msg70539 - (view) | Author: Jesse Noller (jnoller) * (Python committer) | Date: 2008年08月01日 01:22 | |
Ben's incref error != the connection refused error. Here's some information from Scott Leerssen who is using the original .52 code: Hi Jesse, I am having some trouble with the processing module (0.52) and thought you might be of some help since you were so involved with it in the PEP process. I have a queue of Process objects from which a runner thread pulls and starts processes. The producers for the queue are many, many threads. I'm seeing some very odd behavior when lots of processes try to start while others are being reaped at the same time. I see tracebacks that look like the one below, and also occasionally get "hangs" where the process module "seems" to take over the GIL and never give it back. I did find a race condition bug in process.py where a _cleaner function is run in start(), potentially joining on processes that other threads may also be joining at the same time, which leads to KeyError exceptions coming from _cleaner. I mutexed all my starts and joins to get around that for now, but am stumped on the issue below. Does it look familiar to you? Thanks, and feel free to shrug and reply "beats me". Scott [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,609 SYSMESG Process Process-24: [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,612 SYSMESG [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,613 SYSMESG Traceback (most recent call last): [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,613 SYSMESG File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/process.py", line 224, in _bootstrap [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,615 SYSMESG _runAfterForkers() [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,616 SYSMESG File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/process.py", line 300, in _runAfterForkers [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,617 SYSMESG func(obj) [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,617 SYSMESG File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/managers.py", line 806, in _afterFork [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,620 SYSMESG self._incref() [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,621 SYSMESG File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/managers.py", line 760, in _incref [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,621 SYSMESG dispatch(connection, None, 'incref', (self._id,)) [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,622 SYSMESG File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/managers.py", line 92, in dispatch [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,623 SYSMESG raise result [dmbd:32451:-1236096096] 2008年07月31日 16:52:10,624 SYSMESG RemoteError: --------------------------------------------------------------------------- Remote Traceback (most recent call last): File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/managers.py", line 190, in handleRequest result = func(c, *args, **kwds) File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/managers.py", line 404, in incref self.id_to_refcount[ident] += 1 KeyError: 2880341100L |
|||
| msg70570 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月01日 16:40 | |
I'm also seeing the incref error occasionally, on OS X 10.5.4: ====================================================================== ERROR: test_remote (__main__.WithManagerTestRemoteManager) ---------------------------------------------------------------------- Traceback (most recent call last): File "Lib/test/test_multiprocessing.py", line 1157, in test_remote queue = manager2.get_queue() File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 635, in temp authkey=self._authkey, exposed=exp File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 887, in AutoProxy incref=incref) File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 696, in __init__ self._incref() File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 743, in _incref dispatch(conn, None, 'incref', (self._id,)) File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 79, in dispatch raise convert_to_error(kind, result) RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 181, in handle_request result = func(c, *args, **kwds) File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 397, in incref self.id_to_refcount[ident] += 1 KeyError: '6fdb78' --------------------------------------------------------------------------- |
|||
| msg70603 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月01日 22:32 | |
For the connection refused error, the following script fails for me (again, on OS X 10.5.4) with the message: errno: [Errno 61] Connection refused on most runs. (If xrange(10) is replaced by a smaller range like xrange(4) then the script usually runs without error.) Looks like something in semaphore.c needs fixing, but I've no idea what. from multiprocessing import Process, Manager sleeping = Manager().Semaphore(0) def f(): sleeping.release() for _ in xrange(10): Process(target=f).start() for _ in xrange(10): sleeping.acquire() |
|||
| msg70608 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月01日 23:50 | |
> Looks like something in semaphore.c needs fixing, but I've no idea what. I take that back. It's nothing to do with semaphore.c. I'll keep trying. |
|||
| msg70609 - (view) | Author: Jesse Noller (jnoller) * (Python committer) | Date: 2008年08月02日 00:01 | |
Are you looking at the conn refused or the incref error? |
|||
| msg70631 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月02日 13:44 | |
> Are you looking at the conn refused or the incref error? The connection refused error. The attached patch fixes the problem, for me. On my machine, the connection refused error code was 61 rather than 10061. With this patch, I'm no longer seeing any hangs in test_multiprocessing.py (at least, not in the last 500 runs :-)). (Though I am still seeing the incref error occasionally.) If anyone's prepared to answer a stupid question: I'm curious why failed socket connections occur at all. Is connecting to a socket generally considered an unreliable operation, or is there some aspect of the multiprocessing module that makes it potentially unreliable? |
|||
| msg70632 - (view) | Author: Jesse Noller (jnoller) * (Python committer) | Date: 2008年08月02日 14:23 | |
On Aug 2, 2008, at 9:44 AM, Mark Dickinson <report@bugs.python.org> wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: > >> Are you looking at the conn refused or the incref error? > > The connection refused error. > > The attached patch fixes the problem, for me. On my machine, the > connection refused error code was 61 rather than 10061. With this > patch, > I'm no longer seeing any hangs in test_multiprocessing.py (at least, > not in > the last 500 runs :-)). (Though I am still seeing the incref error > occasionally.) > > If anyone's prepared to answer a stupid question: I'm curious why > failed > socket connections occur at all. Is connecting to a socket generally > considered an unreliable operation, or is there some aspect of the > multiprocessing module that makes it potentially unreliable? > I believe the conn refused error is another race, the child processes are connecting to a manager which is shutting down/gone Thanks again mark- when I get a chance today I'll review/test/apply the patch |
|||
| msg70803 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月06日 20:54 | |
> I believe the conn refused error is another race, the child processes > are connecting to a manager which is shutting down/gone After some research, I think it actually has to do with the value of the 'backlog' parameter when creating a Listener instance: if there are too many queued requests to a socket then further connection requests are refused; the 'backlog' parameter appears to set the maximum size of the queue. ...or something like that. I'm a novice here... :-) Anyway, in the __init__ method of the Server class, in managers.py, the Server creates a Listener instance with the line self.listener = Listener(address=address, backlog=5) When I change backlog to 50 instead of 5, I don't see the connection refused error any more. I'm not suggesting that the backlog value should be changed---it probably shouldn't. But this does at least explain why connections can be refused, and why they should be retried in that case. I do still think that the patch, or something like it, should be applied. |
|||
| msg70809 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月07日 01:33 | |
Here's what's going on with the incref error:
Look in the Server class, in managers.py: consider a shared object with
id 'id'. When a reference to the shared object is created, its id is
added to the id_to_refcount dictionary:
{id: None}
and *shortly afterwards*, the refcount is incremented to get:
{id: 1}
When the object is deleted or goes out of scope later on, the refcount
is decref'd, and id is removed entirely from id_to_refcount.
The problem occurs when there are two processes simultaneously asking
for access to the same object. If a second process happens to decref
'id' in between the first process creating the reference and
incrementing it, then the incref from the first process will fail. This
is exactly what's happening (some of the time) in test_remote in the
test_suite. The failing sequence of operations is:
initial state: id not in id_to_refcount
(process 1): create id_to_refcount[id] is None
(process 1): incref id_to_refcount[id] == 1
(process 2): create id_to_refcount[id] == 1
(process 1): decref id not in id_to_refcount
(process 2): incref KeyError!
(process 2): decref (never get here...)
I'm not really familiar enough with the whole multiprocessing module to
know what the right fix for this is.
|
|||
| msg70857 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月07日 21:10 | |
Hmmm. That last message wasn't too coherent; I really shouldn't try to post at 2:30am. Summary: the refcounting logic in the Server class is flawed. In Server.create(), the initial refcount of a newly-created shared object is set to None. This is dangerous: the moment another thread gets a look-in, that refcount can be incremented to 1, then decremented to 0, at which point the shared object gets disposed of (by the code in Server.decref). And all this can happen before the Proxy object for the shared object gets initialized, at which point the KeyError occurs. (Not much better.) Can anyone suggest a way to fix this? I can't see any easy fix. Nevertheless, I consider this a serious flaw that should be fixed before 2.6 and 3.0 are released. I've attached a minimal file that produces the incref error (on about 1 run out of every 5) for me. |
|||
| msg70859 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月07日 21:26 | |
Adding Richard Oudkerk to the nosy list in case he can shed any light on this. Should this be considered a release-blocker? |
|||
| msg70860 - (view) | Author: Jesse Noller (jnoller) * (Python committer) | Date: 2008年08月07日 21:36 | |
On Aug 7, 2008, at 5:26 PM, Mark Dickinson <report@bugs.python.org> wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: > > Adding Richard Oudkerk to the nosy list in case he can shed any > light on > this. > > Should this be considered a release-blocker? > I believe Richard is on summer break - I've dropped him a line on the open issues I need his input on. As for if this should block the release, it should be resolved before final, but should not block the next beta. |
|||
| msg71026 - (view) | Author: Jesse Noller (jnoller) * (Python committer) | Date: 2008年08月11日 18:02 | |
Hey Mark - I took a look at the mp_nohang patch you added - and you're right, the connection refused error numbers are different from platform to platform. So rather than use a static list, I switched the code to use the errno module. Can you apply this and confirm it works for you? |
|||
| msg71030 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月11日 18:53 | |
> So rather than use a static list, I switched the code to use the > errno module. Yup. That's definitely a better solution. Your patch fixed the problem for me. Thanks! |
|||
| msg71031 - (view) | Author: Jesse Noller (jnoller) * (Python committer) | Date: 2008年08月11日 19:00 | |
Thanks Mark, I've checked in the connection refused fix in r65645 on trunk |
|||
| msg71068 - (view) | Author: Ismail Donmez (donmez) * | Date: 2008年08月12日 19:12 | |
With trunk when running test_multiprocessing in a tight loop I saw another problem: test_multiprocessing Process Process-61: Traceback (most recent call last): File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", line 229, in _bootstrap Process Process-60: Traceback (most recent call last): File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", line 229, in _bootstrap Process Process-62: Traceback (most recent call last): File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", line 229, in _bootstrap util._run_after_forkers() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/util.py", line 138, in _run_after_forkers util._run_after_forkers() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/util.py", line 138, in _run_after_forkers util._run_after_forkers() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/util.py", line 138, in _run_after_forkers items = list(_afterfork_registry.items()) items = list(_afterfork_registry.items()) File "/Users/cartman/Sources/py3k/Lib/weakref.py", line 103, in items File "/Users/cartman/Sources/py3k/Lib/weakref.py", line 103, in items items = list(_afterfork_registry.items()) File "/Users/cartman/Sources/py3k/Lib/weakref.py", line 103, in items for key, wr in self.data.items(): RuntimeError: dictionary changed size during iteration for key, wr in self.data.items(): RuntimeError: dictionary changed size during iteration for key, wr in self.data.items(): RuntimeError: dictionary changed size during iteration The original problem itself seems to be fixed, so cheers for that! |
|||
| msg71273 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月17日 13:29 | |
Here's a patch that fixes the incref problem for me. This *definitely* needs review from someone more familiar with the multiprocessing module than I am; I'm not at all confident that it won't break something else. The original problem: the Server.create method creates shared objects with an effective reference count of 0, making them vulnerable to premature garbage collection. The solution in the patch: create shared objects with a reference count of 1, and make it the responsibility of the caller of Server.create() to do a decref later (once a ProxyObject for the created shared object has been initialized). In effect, Server.create creates and returns a reference, and the caller then owns that reference and is responsible for disposing of it. |
|||
| msg71275 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2008年08月17日 13:42 | |
[jnoller] > As for if this should block the release, it should be resolved before > final, but should not block the next beta. So the priority should be 'deferred blocker', no? [cartman] > With trunk when running test_multiprocessing in a tight loop I saw > another problem: I think this really belongs in a separate issue. See issue 3578. |
|||
| msg72358 - (view) | Author: Jesse Noller (jnoller) * (Python committer) | Date: 2008年09月02日 18:15 | |
Looks like Mark's patch was submitted as part of r66115 by Ben accidentally (as part of reverting r66114). I can confirm this patch resolves the incref issue as-is. I've run test_multiprocessing in a loop for about an hour now with no failures. Marking closed/fixed |
|||
| msg72361 - (view) | Author: Jesse Noller (jnoller) * (Python committer) | Date: 2008年09月02日 19:12 | |
r66161 on py3k merges forward the fix for this |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:36 | admin | set | nosy:
+ barry github: 47669 |
| 2008年09月02日 19:12:48 | jnoller | set | messages: + msg72361 |
| 2008年09月02日 18:15:48 | jnoller | set | status: open -> closed resolution: fixed messages: + msg72358 |
| 2008年08月21日 19:55:49 | mark.dickinson | set | keywords: + needs review |
| 2008年08月21日 03:06:09 | barry | set | priority: deferred blocker -> release blocker |
| 2008年08月17日 13:42:42 | mark.dickinson | set | priority: deferred blocker messages: + msg71275 |
| 2008年08月17日 13:29:25 | mark.dickinson | set | files:
+ issue3419.patch messages: + msg71273 |
| 2008年08月12日 19:12:38 | donmez | set | messages: + msg71068 |
| 2008年08月11日 19:00:58 | jnoller | set | files: - mp_nohang.patch |
| 2008年08月11日 19:00:38 | jnoller | set | messages: + msg71031 |
| 2008年08月11日 18:53:28 | mark.dickinson | set | messages: + msg71030 |
| 2008年08月11日 18:02:53 | jnoller | set | files:
+ mp_nohang_jnoller.patch messages: + msg71026 |
| 2008年08月07日 21:36:07 | jnoller | set | messages: + msg70860 |
| 2008年08月07日 21:26:09 | mark.dickinson | set | nosy:
+ roudkerk messages: + msg70859 |
| 2008年08月07日 21:10:56 | mark.dickinson | set | files:
+ multibug3.py messages: + msg70857 |
| 2008年08月07日 01:33:32 | mark.dickinson | set | messages: + msg70809 |
| 2008年08月06日 20:54:12 | mark.dickinson | set | messages: + msg70803 |
| 2008年08月02日 14:23:56 | jnoller | set | messages: + msg70632 |
| 2008年08月02日 13:44:20 | mark.dickinson | set | files:
+ mp_nohang.patch keywords: + patch messages: + msg70631 |
| 2008年08月02日 00:01:32 | jnoller | set | messages: + msg70609 |
| 2008年08月01日 23:50:49 | mark.dickinson | set | messages: + msg70608 |
| 2008年08月01日 22:32:19 | mark.dickinson | set | messages: + msg70603 |
| 2008年08月01日 16:40:53 | mark.dickinson | set | nosy:
+ mark.dickinson messages: + msg70570 |
| 2008年08月01日 01:22:33 | jnoller | set | messages: + msg70539 |
| 2008年07月19日 22:13:23 | jnoller | set | messages:
+ msg70060 versions: + Python 2.6 |
| 2008年07月19日 21:01:31 | benjamin.peterson | set | assignee: jnoller messages: + msg70056 nosy: + benjamin.peterson, jnoller |
| 2008年07月19日 20:00:42 | donmez | create | |