Issue 5847059: Rewrite the scheduler, for simplicity, and better error handling.

Keyboard Shortcuts

File
u :	up to issue
m :	publish + mail comments
M :	edit review message
j / k :	jump to file after / before current file
J / K :	jump to next file with a comment after / before current file
Side-by-side diff
i :	toggle intra-line diffs
e :	expand all comments
c :	collapse all comments
s :	toggle showing all comments
n / p :	next / previous diff chunk or comment
N / P :	next / previous comment
<Up> / <Down> :	next / previous line
<Enter> :	respond to / edit current comment
d :	mark current comment as done

Issue
u :	up to list of issues
m :	publish + mail comments
j / k :	jump to patch after / before current patch
o / <Enter> :	open current patch in side-by-side view
i :	open current patch in unified diff view
Issue List
j / k :	jump to issue after / before current issue
o / <Enter> :	open current issue
# :	close issue
Comment/message editing
<Ctrl> + s or <Ctrl> + Enter :	save comment
<Esc> :	cancel edit

Rietveld Code Review Tool

Help | Bug tracker | Discussion group | Source code | Sign in

(75)

Issues Repositories Search

Open Issues | Closed Issues | All Issues | Sign in with your Google Account to create issues and add comments

Issue 5847059: Rewrite the scheduler, for simplicity, and better error handling.

Can't Edit
Can't Publish+Mail
Start Review

Created:
13 years, 9 months ago by hazmat

Modified:
13 years, 9 months ago

Reviewers:
fwereade , mp+98104

Visibility:
Public.

Rewrite the scheduler, for simplicity, and better error handling. During a recent code audit, a number of issues with the relation hook scheduler were discovered. Say we get notified of three membership events from the same child watch, two joins and a depart. if the first join hook errors, we need to keep both it and the subsequent membership events around. Previously on error we'd execute the next few hooks, regardless of the error, which is problematic. The error handler will stop the scheduler, but we'll still end up processing the events since the executioner doesn't check if its running till its done processing a clock. On error we need to stop processing immediately, we also need to keep all events for the around, including the failed one, to properly support retry (albeit still subject to reduction via additional events.) On a resolved --retry, we should continue the event processing stream where we left off (ie start execution against the failed hook.) The branch includes expanded unit test coverage of various failure scenarios, and concurrent activity scenarios, as well as two new implementations of the scheduler (one buried in the history which attempts to fix the previous impl). The latest schduler implementation, does away with the notion of clocks and instead uses a simple list (wrapped in a deferred queue) as the primary data structure. The additional member_versions, context_members data structures are used only at ingest/put time, the consumer relies soley on the queue. This simplifies the scheduler implementation, and avoided some issues with the previous implementation that had to bookeep across multiple data structures in the face of concurrent activity. The alias expansion of join to modified was also removed from the lifecycle and placed directly in the scheduler. This makes it more reliable in the face of an error of either join/modified hook, and subjects the additional synthetic event to reduction. https://code.launchpad.net/~hazmat/juju/scheduler-peek-list/+merge/98104 (do not edit description out of merge proposal)

Patch Set 1 #

Total comments: 14

Created: 13 years, 9 months ago

Download [raw] [tar.bz2]

Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+627 lines, -265 lines)			Patch
M	.bzrignore	View	1 chunk	+4 lines, -0 lines	0 comments	Download
A	[revision details]	View	1 chunk	+2 lines, -0 lines	0 comments	Download
M	juju/control/status.py	View	1 chunk	+2 lines, -1 line	0 comments	Download
M	juju/hooks/scheduler.py	View	4 chunks	+239 lines, -173 lines	8 comments	Download
M	juju/hooks/tests/test_scheduler.py	View	24 chunks	+363 lines, -77 lines	1 comment	Download
M	juju/unit/lifecycle.py	View	2 chunks	+4 lines, -10 lines	1 comment	Download
M	juju/unit/tests/test_lifecycle.py	View	6 chunks	+10 lines, -3 lines	4 comments	Download
M	juju/unit/tests/test_workflow.py	View	2 chunks	+2 lines, -1 line	0 comments	Download
M	juju/unit/workflow.py	View	1 chunk	+1 line, -0 lines	0 comments	Download

Total messages: 3

hazmat

Please take a look.

13 years, 9 months ago (2012年03月18日 17:17:05 UTC) #1

Please take a look.

fwereade

LGTM, very nice; but it could use clearer docs/comments in one or two places. https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py ...

13 years, 9 months ago (2012年03月22日 10:13:16 UTC) #2

LGTM, very nice; but it could use clearer docs/comments in one or two places.
https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py
File juju/hooks/scheduler.py (right):
https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode35
juju/hooks/scheduler.py:35: def put(self, change, offset=1):
It took me a while to figure out the connection between offset and ._running; it
would be nice to document this a little better.
https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode37
juju/hooks/scheduler.py:37: LIFO except for a None value which is FIFO
other way round, surely?
https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode51
juju/hooks/scheduler.py:51: # XXX / Needed.
I'm really not at all clear what's going on here.
https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode315
juju/hooks/scheduler.py:315: #assert 0, "when does this happen"
Hmm, I'm not sure; did you ever hit that assert 0 when it was active?
https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode331
juju/hooks/scheduler.py:331: #if unit_name in self._context_members:
Commented
https://codereview.appspot.com/5847059/diff/1/juju/hooks/tests/test_scheduler.py
File juju/hooks/tests/test_scheduler.py (right):
https://codereview.appspot.com/5847059/diff/1/juju/hooks/tests/test_scheduler...
juju/hooks/tests/test_scheduler.py:30: "hook.scheduler", level=logging.DEBUG) #
log_file=sys.stdout)
commented
https://codereview.appspot.com/5847059/diff/1/juju/unit/lifecycle.py
File juju/unit/lifecycle.py (left):
https://codereview.appspot.com/5847059/diff/1/juju/unit/lifecycle.py#oldcode672
juju/unit/lifecycle.py:672: break
nice :)
https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle.py
File juju/unit/tests/test_lifecycle.py (right):
https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle....
juju/unit/tests/test_lifecycle.py:519: #self.capture_output()
commented code
https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle....
juju/unit/tests/test_lifecycle.py:534: #"app-relation-changed",
commented code
https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle....
juju/unit/tests/test_lifecycle.py:930: self.capture_logging("charm.upgrade")
unused?

hazmat

Thanks for the review! https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py File juju/hooks/scheduler.py (right): https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode37 juju/hooks/scheduler.py:37: LIFO except for a None ...

13 years, 9 months ago (2012年03月22日 13:45:50 UTC) #3

Thanks for the review!
https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py
File juju/hooks/scheduler.py (right):
https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode37
juju/hooks/scheduler.py:37: LIFO except for a None value which is FIFO
On 2012年03月22日 10:13:16, fwereade wrote:
> other way round, surely?
indeed it is.
https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode51
juju/hooks/scheduler.py:51: # XXX / Needed.
On 2012年03月22日 10:13:16, fwereade wrote:
> I'm really not at all clear what's going on here.
change is None is the stop marker. if there's no one waiting, then the queue is
currently being processed, and the stop will be detected at next iteration. ie.
the stop value isn't needed because we're not sleeping on input.
https://codereview.appspot.com/5847059/diff/1/juju/hooks/scheduler.py#newcode315
juju/hooks/scheduler.py:315: #assert 0, "when does this happen"
On 2012年03月22日 10:13:16, fwereade wrote:
> Hmm, I'm not sure; did you ever hit that assert 0 when it was active?
I did on tests that skip time (unit disconnected for a while), its definitely
valid, i'll remove the assert.
https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle.py
File juju/unit/tests/test_lifecycle.py (right):
https://codereview.appspot.com/5847059/diff/1/juju/unit/tests/test_lifecycle....
juju/unit/tests/test_lifecycle.py:930: self.capture_logging("charm.upgrade")
On 2012年03月22日 10:13:16, fwereade wrote:
> unused?
if run in isolation without the test gets one of those no handler attached
messages, the capture is merely to avoid that.