When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

jankeromnes wants to merge 1 commit into main from jx/gc-deregistered-ws

Closed

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

Changes from all commits

Commits

Show all changes

1 commit

Select commit

510c8f2

[ws-manager-bridge] When de-registering a workspace cluster, mark any...

jankeromnes Sep 13, 2022

File filter

Filter by extension

Conversations

Jump to

Jump to file

Failed to load files.

Diff view

30 changes: 30 additions & 0 deletions components/ws-manager-bridge/src/bridge.ts

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -120,6 +120,7 @@ export class WorkspaceManagerBridge implements Disposable {
		}

		public stop() {
	this.markAllRunningWorkspaceInstancesAsStopped();
Copy link Member @geropl geropl Sep 15, 2022 • edited Loading Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. I'm not sure I like this approach. So far, stopping a bridge was an operation that has absolutely no effect on workspaces. This made it a very cheap operation, and allowed for great operational flexibility. E.g., if you had to fixup a DB entry, you could always remove/re-add an entry, with very limited downsides (delay of workspace updates for a handful of seconds). Or, if you wanted for a reconnect, you could remove and re-add a DB entry. Now, it has a very destructive side-effect. 💭 I wonder what happens if we stop a `ws-manager-brige` pod during a rollout. It would stop all workspaces on that cluster, no? @jankeromnes jankeromnes reacted with eyes emoji Copy link Contributor @svenefftinge svenefftinge Sep 15, 2022 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. What do you think of having a periodic clean-up where we check and stop instances for which no ws-manager exists? easyCZ reacted with thumbs up emoji Copy link Member @easyCZ easyCZ Sep 15, 2022 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. Or a more general version: For all currently not-stopped instances, check it against ws-manager. This would catch a broader set of problems, but we already need to solve this problem anyway. Copy link Member @geropl geropl Sep 15, 2022 • edited Loading Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. What do you think of having a periodic clean-up where we check and stop instances for which no ws-manager exists? This is something we need anyway, but with different time constraints: This PR is about unblocking Team Workspace in specific cases. This is a separate PR, but same issue. For all currently not-stopped instances, check it against ws-manager. That's a layer of abstraction above this PR/issue. tl;dr: we're already doing it, this is about the implementation details Happy to provide context, outside of this PR. 🙃 jankeromnes reacted with thumbs up emoji Copy link Contributor Author @jankeromnes jankeromnes Sep 16, 2022 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. Many thanks for the great feedback here! 💯 E.g., if you had to fixup a DB entry, you could always remove/re-add an entry, with very limited downsides (delay of workspace updates for a handful of seconds). But why remove/re-add an entry when you can `gpctl clusters update`? 🤔 (E.g. to adjust the score) Or, if you wanted for a reconnect, you could remove and re-add a DB entry. Now, it has a very destructive side-effect. Can't you kill the `ws-manager-bridge` pod, or `rollout restat` its deployment to achieve this? 💭 I wonder what happens if we stop a `ws-manager-brige` pod during a rollout. It would stop all workspaces on that cluster, no? @jankeromnes I wasn't sure, so I tested it: I ran `kubectl delete pod ws-manager-bridge-[tab]` several times in a row I also ran `kubectl rollout restart deployment ws-manager-bridge` a few times My running workspace stayed alive and well all the time. Only when I ran `gpctl clusters deregister --name temp --force` did it get marked as stopped. Copy link Contributor Author @jankeromnes jankeromnes Sep 16, 2022 • edited Loading Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. Sync feedback from @geropl: Marking all instances as stopped in this `bridge.stop()` lifecycle method is a change in behavior -- does it really make sense? One the one hand (i.e. "my" view), `bridge.stop()` is only called: here when a cluster was de-registered (either via RPC with --force, or manually deleted from the DB) and 2. reconcile is called. In this case, the cluster is no longer connected, and we'll never hear again about any instances that were still running on it (so we can mark them as stopped/failed) here when we're disposing of the bridge controller (is this actually ever called? It doesn't look like it, so maybe we could delete this dead / misleading / somewhat dangerous code) So this means that `bridge.stop()` is only ever called when you're actually de-registering a cluster, and instead of leaving all the instances as they are in the DB without any further updates, maybe it makes more sense to mark them as stopped/failed. On the other hand (i.e. @geropl's view), maybe we need to be clearer or more careful about the fact that calling `bridge.stop()` will also mark all of its instances as stopped -- i.e. this doesn't seem to be called currently when you try to reconnect/restart a bridge or so, but we need to make sure that it also won't be called in the future with the assumption that instances will be kept running). Maybe this can be achieved with a comment, or by renaming the `stop` function. We could also make a bridge mark all its instances as stopped only when we receive a deregister --force RPC, but this seems a bit more complicated (need to extract/share stopping code or somehow give access to the bridge to the RPC handler), and wouldn't handle the cases where a cluster is manually dropped from the DB. My personal conclusion here would be to leave the PR as is, and simply add very explicit comments to stress the fact that calling `bridge.stop()` also marks all its instances as stopped in the DB. Or, we could even rename the method to `bridge.tearDown()` or `bridge.markAllInstancesAsStoppedAndDispose` or similar. 😊 Copy link Contributor @svenefftinge svenefftinge Sep 16, 2022 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. What you describe makes sense to me, it just seems that the lifecycle you are attaching the effect of stopping all instances is not well understood (I agree we should better understand it, and remove code that is not called in practice). Attaching the clean-up to the disconnect lifecycle also requires it to be always called, so that we don't leak/forget about workspaces. It seems more general and less complicated to clean up instances for which we don't know a manager on a regular schedule instead of on disconnect events.
		this.dispose();
		}

Expand Down Expand Up		@@ -651,6 +652,35 @@ export class WorkspaceManagerBridge implements Disposable {
		}
		}

	protected async markAllRunningWorkspaceInstancesAsStopped(): Promise<void> {
	const span = TraceContext.startSpan("markAllRunningWorkspaceInstancesAsStopped");
	try {
	const now = new Date();
	const runningInstances = await this.workspaceDB
	.trace({ span })
	.findRunningInstancesWithWorkspaces(this.cluster.name, undefined, true);
	await Promise.all(
	runningInstances.map(async (info) => {
	const logContext: LogContext = {
	userId: info.workspace.ownerId,
	workspaceId: info.workspace.id,
	instanceId: info.latestInstance.id,
	};
	log.error(logContext, "Marking still-running instance as stopped in database.", {
	creationTime: info.workspace.creationTime,
	currentPhase: info.latestInstance.status.phase,
	});
	await this.markWorkspaceInstanceAsStopped({ span }, info, now);
	}),
	);
	} catch (err) {
	TraceContext.setError({ span }, err);
	throw err;
	} finally {
	span.finish();
	}
	}

		public dispose() {
		this.disposables.dispose();
		}
Expand Down

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

Uh oh!

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

Filter by extension

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

@geropl geropl Sep 15, 2022 •

edited

Loading

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@svenefftinge svenefftinge Sep 15, 2022

Choose a reason for hiding this comment

Uh oh!

@easyCZ easyCZ Sep 15, 2022

Choose a reason for hiding this comment

Uh oh!

@geropl geropl Sep 15, 2022 •

edited

Loading

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@jankeromnes jankeromnes Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

@jankeromnes jankeromnes Sep 16, 2022 •

edited

Loading

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@svenefftinge svenefftinge Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

Uh oh!

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

@geropl geropl Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@svenefftinge svenefftinge Sep 15, 2022

Choose a reason for hiding this comment

Uh oh!

@easyCZ easyCZ Sep 15, 2022

Choose a reason for hiding this comment

Uh oh!

@geropl geropl Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@jankeromnes jankeromnes Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

@jankeromnes jankeromnes Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@svenefftinge svenefftinge Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

@geropl geropl Sep 15, 2022 •

edited

Loading

@geropl geropl Sep 15, 2022 •

edited

Loading

@jankeromnes jankeromnes Sep 16, 2022 •

edited

Loading