-
Notifications
You must be signed in to change notification settings - Fork 1.3k
When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -120,6 +120,7 @@ export class WorkspaceManagerBridge implements Disposable { | |
| } | ||
|
|
||
| public stop() { | ||
| this.markAllRunningWorkspaceInstancesAsStopped(); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I like this approach. So far, stopping a bridge was an operation that has absolutely no effect on workspaces. This made it a very cheap operation, and allowed for great operational flexibility. E.g., if you had to fixup a DB entry, you could always remove/re-add an entry, with very limited downsides (delay of workspace updates for a handful of seconds). Or, if you wanted for a reconnect, you could remove and re-add a DB entry. Now, it has a very destructive side-effect. 💭 I wonder what happens if we stop a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you think of having a periodic clean-up where we check and stop instances for which no ws-manager exists? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or a more general version: For all currently not-stopped instances, check it against ws-manager. This would catch a broader set of problems, but we already need to solve this problem anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is something we need anyway, but with different time constraints: This PR is about unblocking Team Workspace in specific cases. This is a separate PR, but same issue.
That's a layer of abstraction above this PR/issue. tl;dr: we're already doing it, this is about the implementation details Happy to provide context, outside of this PR. 🙃 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Many thanks for the great feedback here! 💯
But why remove/re-add an entry when you can
Can't you kill the
I wasn't sure, so I tested it:
My running workspace stayed alive and well all the time. Only when I ran There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sync feedback from @geropl: Marking all instances as stopped in this One the one hand (i.e. "my" view),
So this means that On the other hand (i.e. @geropl's view), maybe we need to be clearer or more careful about the fact that calling My personal conclusion here would be to leave the PR as is, and simply add very explicit comments to stress the fact that calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What you describe makes sense to me, it just seems that the lifecycle you are attaching the effect of stopping all instances is not well understood (I agree we should better understand it, and remove code that is not called in practice). Attaching the clean-up to the disconnect lifecycle also requires it to be always called, so that we don't leak/forget about workspaces. It seems more general and less complicated to clean up instances for which we don't know a manager on a regular schedule instead of on disconnect events. |
||
| this.dispose(); | ||
| } | ||
|
|
||
|
|
@@ -651,6 +652,35 @@ export class WorkspaceManagerBridge implements Disposable { | |
| } | ||
| } | ||
|
|
||
| protected async markAllRunningWorkspaceInstancesAsStopped(): Promise<void> { | ||
| const span = TraceContext.startSpan("markAllRunningWorkspaceInstancesAsStopped"); | ||
| try { | ||
| const now = new Date(); | ||
| const runningInstances = await this.workspaceDB | ||
| .trace({ span }) | ||
| .findRunningInstancesWithWorkspaces(this.cluster.name, undefined, true); | ||
| await Promise.all( | ||
| runningInstances.map(async (info) => { | ||
| const logContext: LogContext = { | ||
| userId: info.workspace.ownerId, | ||
| workspaceId: info.workspace.id, | ||
| instanceId: info.latestInstance.id, | ||
| }; | ||
| log.error(logContext, "Marking still-running instance as stopped in database.", { | ||
| creationTime: info.workspace.creationTime, | ||
| currentPhase: info.latestInstance.status.phase, | ||
| }); | ||
| await this.markWorkspaceInstanceAsStopped({ span }, info, now); | ||
| }), | ||
| ); | ||
| } catch (err) { | ||
| TraceContext.setError({ span }, err); | ||
| throw err; | ||
| } finally { | ||
| span.finish(); | ||
| } | ||
| } | ||
|
|
||
| public dispose() { | ||
| this.disposables.dispose(); | ||
| } | ||
|
|
||