Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Helix fails to connect with Kerberos enabled ZK#3102

Open
arshadmohammad wants to merge 1 commit intoapache:master from
arshadmohammad:zk_sasl_master
Open

Helix fails to connect with Kerberos enabled ZK #3102
arshadmohammad wants to merge 1 commit intoapache:master from
arshadmohammad:zk_sasl_master

Conversation

@arshadmohammad
Copy link

@arshadmohammad arshadmohammad commented Feb 21, 2026

Issues

Description

Refer #3101 for details on the issue

Tests

  • Verified the changes through unit test cases
  • Verified Change through Quickstart sample App
    In the Quickstart sample app, I have enabled Zookeeper Kerberos authentication and verified the fix

Quickstart Output Before Fix
Creating cluster: HELIX_QUICKSTART
Adding 2 participants to the cluster
Added participant: localhost_12000
Added participant: localhost_12001
Configuring StateModel: MyStateModel with 1 Leader and 1 Standby
Adding a resource MyResource: with 6 partitions and 2 replicas
Starting Participants
ERROR ZKHelixManager zkClient is not connected after waiting 10000ms., clusterName: HELIX_QUICKSTART, zkAddress: sl73tskrapd1044.visa.com:2181
ERROR ZKHelixManager fail to createClient. retry 1
org.apache.helix.HelixException: HelixManager is not connected within retry timeout for cluster HELIX_QUICKSTART
at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:417)
at org.apache.helix.manager.zk.ZKHelixManager.getConfigAccessor(ZKHelixManager.java:688)
at org.apache.helix.manager.zk.ParticipantManager.(ParticipantManager.java:118)
at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:1441)
at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:1391)
at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:783)
at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:818)
at org.apache.helix.examples.Quickstart$MyProcess.start(Quickstart.java:247)
at org.apache.helix.examples.Quickstart.startNodes(Quickstart.java:146)
at org.apache.helix.examples.Quickstart.main(Quickstart.java:164)
ERROR ZKHelixManager fail to createClient. retry 2
org.apache.helix.zookeeper.zkclient.exception.ZkTimeoutException: Waiting to be connected to ZK server has timed out.
at org.apache.helix.zookeeper.zkclient.ZkClient.waitForEstablishedSession(ZkClient.java:2082)
at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:776)
at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:818)
at org.apache.helix.examples.Quickstart$MyProcess.start(Quickstart.java:247)
at org.apache.helix.examples.Quickstart.startNodes(Quickstart.java:146)
at org.apache.helix.examples.Quickstart.main(Quickstart.java:164)

Quickstart Output After Fix:

Creating cluster: HELIX_QUICKSTART
Adding 2 participants to the cluster
Added participant: localhost_12000
Added participant: localhost_12001
Configuring StateModel: MyStateModel with 1 Leader and 1 Standby
Adding a resource MyResource: with 6 partitions and 2 replicas
Starting Participants
Started Participant: localhost_12000
Started Participant: localhost_12001
Starting Helix Controller
LeaderStandbyStateModel.onBecomeStandbyFromOffline():localhost_12000 transitioning from OFFLINE to STANDBY for MyResource MyResource_1
LeaderStandbyStateModel.onBecomeStandbyFromOffline():localhost_12000 transitioning from OFFLINE to STANDBY for MyResource MyResource_4
LeaderStandbyStateModel.onBecomeStandbyFromOffline():localhost_12000 transitioning from OFFLINE to STANDBY for MyResource MyResource_3
LeaderStandbyStateModel.onBecomeStandbyFromOffline():localhost_12000 transitioning from OFFLINE to STANDBY for MyResource MyResource_5

  • The following tests are written for this issue:

org.apache.helix.zookeeper.impl.client.TestRawZkClient
#testWaitForKeeperStateWithSaslAuthenticated
#testWaitForKeeperStateWithConnectedReadOnly
#testWaitForKeeperStateWithOtherStates
#testWaitForKeeperStateExactMatchStillWorks

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Copy link
Author

@junkaixue, could you please review this PR?

Comment on lines 2107 to 2113
// as they represent valid connected states
if (keeperState == KeeperState.SyncConnected && (
_currentState == KeeperState.SaslAuthenticated
|| _currentState == KeeperState.ConnectedReadOnly)) {
LOG.debug("zkclient {} Accepting state {} as equivalent to SyncConnected", _uid,
_currentState);
return true;
Copy link
Contributor

@junkaixue junkaixue Feb 24, 2026
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution.

Some edge cases:
- What if both SaslAuthenticated and ConnectedReadOnly could be relevant? The current logic handles SyncConnected waiting only, which is the main use case.
- The fix is specific to waiting for SyncConnected, which is appropriate since that's the primary "connected" state.

I would suggest to make it configure at least not impact original logic. I am not quite sure whether that's an expected logic but if you can check native Kerberos enabled flag or add some java config like ENABLE_KERBERIOS_FOR_ZK to true, then check this logic. At least, it would be backward compatbile and less risk for existing logic

Copy link
Author

@arshadmohammad arshadmohammad Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestions, I will update the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@junkaixue junkaixue junkaixue left review comments

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

Helix fails to connect with Kerberos enabled ZK

AltStyle によって変換されたページ (->オリジナル) /