|
|
|
Rewrite the scheduler, for simplicity, and better error handling.
During a recent code audit, a number of issues with the relation hook scheduler
were discovered.
Say we get notified of three membership events from the same
child watch, two joins and a depart. if the first join hook
errors, we need to keep both it and the subsequent membership
events around.
Previously on error we'd execute the next few hooks,
regardless of the error, which is problematic. The error handler will stop
the scheduler, but we'll still end up processing the events since the
executioner doesn't check if its running till its done processing a clock.
On error we need to stop processing immediately, we also need
to keep all events for the around, including the failed
one, to properly support retry (albeit still subject to reduction via
additional events.)
On a resolved --retry, we should continue the event processing stream where
we left off (ie start execution against the failed hook.)
The branch includes expanded unit test coverage of various failure scenarios,
and concurrent activity scenarios, as well as two new implementations of the
scheduler (one buried in the history which attempts to fix the previous impl).
The latest schduler implementation, does away with the notion of clocks and
instead uses a simple list (wrapped in a deferred queue)
as the primary data structure. The additional member_versions, context_members
data structures are used only at ingest/put time, the consumer relies soley on
the queue. This simplifies the scheduler implementation, and avoided some
issues with the previous implementation that had to bookeep across multiple data
structures in the face of concurrent activity.
The alias expansion of join to modified was also removed from the lifecycle
and placed directly in the scheduler. This makes it more reliable in the face of an
error of either join/modified hook, and subjects the additional synthetic event
to reduction.
https://code.launchpad.net/~hazmat/juju/scheduler-peek-list/+merge/98104
(do not edit description out of merge proposal)
Patch Set 1 #
Total comments: 14
Total messages: 3
|
hazmat
Please take a look.
|
13 years, 9 months ago (2012年03月18日 17:17:05 UTC) #1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Please take a look.
LGTM, very nice; but it could use clearer docs/comments in one or two places. https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py File juju/hooks/scheduler.py (right): https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode35 juju/hooks/scheduler.py:35: def put(self, change, offset=1): It took me a while to figure out the connection between offset and ._running; it would be nice to document this a little better. https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode37 juju/hooks/scheduler.py:37: LIFO except for a None value which is FIFO other way round, surely? https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode51 juju/hooks/scheduler.py:51: # XXX / Needed. I'm really not at all clear what's going on here. https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode315 juju/hooks/scheduler.py:315: #assert 0, "when does this happen" Hmm, I'm not sure; did you ever hit that assert 0 when it was active? https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode331 juju/hooks/scheduler.py:331: #if unit_name in self._context_members: Commented https://codereview.appspot.com/5847059/diff/1/juju/hooks/tests/test_scheduler.py File juju/hooks/tests/test_scheduler.py (right): https://codereview.appspot.com/5847059/diff/1/juju/hooks/tests/test_scheduler... juju/hooks/tests/test_scheduler.py:30: "hook.scheduler", level=logging.DEBUG) # log_file=sys.stdout) commented https://codereview.appspot.com/5847059/diff/1/juju/unit/lifecycle.py File juju/unit/lifecycle.py (left): https://codereview.appspot.com/5847059/diff/1/juju/unit/lifecycle.py#oldcode672 juju/unit/lifecycle.py:672: break nice :) https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle.py File juju/unit/tests/test_lifecycle.py (right): https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle.... juju/unit/tests/test_lifecycle.py:519: #self.capture_output() commented code https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle.... juju/unit/tests/test_lifecycle.py:534: #"app-relation-changed", commented code https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle.... juju/unit/tests/test_lifecycle.py:930: self.capture_logging("charm.upgrade") unused?
Thanks for the review! https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py File juju/hooks/scheduler.py (right): https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode37 juju/hooks/scheduler.py:37: LIFO except for a None value which is FIFO On 2012年03月22日 10:13:16, fwereade wrote: > other way round, surely? indeed it is. https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode51 juju/hooks/scheduler.py:51: # XXX / Needed. On 2012年03月22日 10:13:16, fwereade wrote: > I'm really not at all clear what's going on here. change is None is the stop marker. if there's no one waiting, then the queue is currently being processed, and the stop will be detected at next iteration. ie. the stop value isn't needed because we're not sleeping on input. https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode315 juju/hooks/scheduler.py:315: #assert 0, "when does this happen" On 2012年03月22日 10:13:16, fwereade wrote: > Hmm, I'm not sure; did you ever hit that assert 0 when it was active? I did on tests that skip time (unit disconnected for a while), its definitely valid, i'll remove the assert. https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle.py File juju/unit/tests/test_lifecycle.py (right): https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle.... juju/unit/tests/test_lifecycle.py:930: self.capture_logging("charm.upgrade") On 2012年03月22日 10:13:16, fwereade wrote: > unused? if run in isolation without the test gets one of those no handler attached messages, the capture is merely to avoid that.