Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

TQ: Want emergency API for stuck trust quorum reconfiguration #9826

Open
Assignees
@andrewjstone

Description

IMPORTANT: This is complicated, and we can totally change how the parameters work in the future. But this is how the protocol works now.

Trust quorum is designed to tolerate some number of failed nodes and complete reconfiguration.
There are a few parameters, deterministically computed from the size of the rack cluster that allow the trust quorum to decide when it has received enough prepare acknowledgements to record the configuration as "committing" in nexus. This means that a cold boot at this time would allow the reconfiguration to complete and rack unlock to occur in the new configuration.

The parameters are as follows:

  • N is the total number of sleds in a rack,
  • K is the number of sleds required to recompute the rack secret
  • Z is the commit_crash_tolerance parameter. This is a safety factor such that if K+Z sleds are prepared, we record as committing.

This means that if up to Z of the K+Z of the acked sleds permanently fails we can still unlock the rack. Additionally there are another X = N - (K+Z) sleds that we can treat as "offline". Those are capable of learning their shares even if Z of the original sleds die. As each one of them learns their share, then any permanently dead sled of the original group is no longer part of Z. In essence we've healed. So we optimize for committing quickly and safely, and then having even more safety as offline sleds come back online.

However, there is a background task in nexus that continually tries to inform the sleds that have not yet acked their commits, even though they have prepared and have a key share. Nexus stops trying once all sleds have acked commits. But in the case of permanent failure this will not be true. No worries though! As implemented, if enough sled have acked (up to Z sleds remain unacked) then an operator can expunge those sleds (one at a time) in a new trust quorum reconfiguration. Doing this reconfiguration commits the old configuration in a state called CommittedPartially. In the common case of one sled permanently failing, it can simply be expunged. Then in the new configuration all sleds will eventually ack and the the nexus bg task will be idle, and the state of that configuration will be Committed.

However, if more than Z sleds are permanently failed (Z=2) in a 16 sled cluster, and 3 in larger clusters we will not be able to commit the prior configuration. It will remain in the Committing state in nexus. Now, in general, this is a good thing. We shouldn't have large numbers of sleds permanently failing simultaneously. This can lead to data loss and all sorts of other things. We don't want those sleds expunged prematurely or by accident.
We'd like to be able to recover some of them if possible and have them commit, even if the next thing we do is copy all the data off and expunge them.

However, if more than Z sleds are truly unrecoverable, and this failure happens between the prepare and commit phase (typically milliseconds), we can end up in a stuck state. We'd have to manually intervene and fix the database to allow those sleds to be expunged. Note also, that if enough sleds are offline during the prepare phase, then this is not an issue. We can immediately abort the configuration.

Rather than having to mess with the database manually in this emergency stuck state, we should provide support with a mechanism (omdb?) to unstick the trust quorum and expunge the bad sleds.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      AltStyle によって変換されたページ (->オリジナル) /