Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

KAFKA-19925: Fix transaction timeout handling during broker upgrades #21161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
FrancisGodinho wants to merge 2 commits into apache:trunk
base: trunk
Choose a base branch
Loading
from FrancisGodinho:KAFKA-19925-transactions-upgrade-test

Conversation

@FrancisGodinho
Copy link

@FrancisGodinho FrancisGodinho commented Dec 16, 2025

Problem

During broker upgrades, the sendOffsetsToTransaction call would sometimes hang. Logs showed that it continuously returned errorCode=51 which is CONCURRENT_TRANSACTION. The test would eventually hit its timeout and fail. This happened for every single version upgrade and occurred in around 30% of the runs.

Resolution

The problem above left the producer in a broken state and even after 5-10 minutes of waiting, it didn't resolve itself (even if we waited a few minutes past the transaction.max.ms time). I tried multiple solutions including waiting extended periods of time and re-trying the sendOffsetsToTransaction multiple times whenever timeout occurred.

Unfortunately, the producer was just permanently stuck and always receiving the errorCode=51. In this case, the recommended resolution in the Kafka docs is to close the previous producer and create a new producer. https://kafka.apache.org/documentation/#usingtransactions
image

Using the old transaction.id would continue to lead to a stuck state, so this fix creates a brand new producer with a new ID and then rewinds the consumer offset to ensure EOD.

Testing and Validation

Previously, I was able to run the test for a single version upgrade and have it fail within the first 5-10 runs. After the fix, I was able to run it 40 times continuously with 0 failures. I also ran the full test (all versions) ~5 times with 9/9 cases passing.

Pankraz76 reacted with thumbs up emoji
@github-actions github-actions bot added triage PRs from the community tools small Small PRs labels Dec 16, 2025
self.perform_upgrade(from_kafka_version)

copier_timeout_sec = 180
copier_timeout_sec = 360
Copy link
Author

@FrancisGodinho FrancisGodinho Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: due to timeouts and re-creation of producer, this copier_timeout needed to be increased. I experimented a bit and found that 360s was a consistently reliable value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I described in https://issues.apache.org/jira/browse/KAFKA-20000, the performance regression is caused by the backoff logic. Therefore, I suggest fixing the underlying issue instead of increasing the timeout.

Copy link

@Pankraz76 Pankraz76 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kindly asking, if this is something to consider? If so, would add some test for this adjustment.

Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pankraz76 thanks for the effort. As Justine suggested, hardcoding the timeout is a bit coarse-grained. Please refer to KAFKA-20000 for more discussion.

Pankraz76 reacted with thumbs up emoji
Copy link
Author

@chia7712 can you take a look when you get a chance please?

@github-actions github-actions bot removed the triage PRs from the community label Dec 16, 2025
Copy link
Member

@FrancisGodinho thanks for you patch. I have identified some underlying issues in e2e and TV2. Addressing them should allow us to achieve more stable transaction behavior. Please check https://issues.apache.org/jira/browse/KAFKA-19999 and https://issues.apache.org/jira/browse/KAFKA-20000 for more details.

FrancisGodinho reacted with thumbs up emoji

Copy link

@Pankraz76 Pankraz76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

...eCopier.java
Co-authored-by: Vincent Potuček <8830888+Pankraz76@users.noreply.github.com>
Copy link
Author

@Pankraz76 thanks for the comments, can you re-review please?

Copy link

@Pankraz76 Pankraz76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue is very well documented, thanks for effort given.

self.perform_upgrade(from_kafka_version)

copier_timeout_sec = 180
copier_timeout_sec = 360
Copy link

@Pankraz76 Pankraz76 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
copier_timeout_sec = 360
copier_timeout_sec = 360

sorry again this something for SCA. Taking away the off-topics upfront.

spotless and rewrite both ready to fix on their own.

Comment on lines +395 to +400
// in case the producer gets stuck here, create a new one and continue the loop
try { producer.close(Duration.ofSeconds(0)); } catch (Exception ignore) {}
parsedArgs.getAttrs().put("transactionalId", parsedArgs.getString("transactionalId") + producerNumber++);
producer = createProducer(parsedArgs);
producer.initTransactions();
resetToLastCommittedPositions(consumer);
Copy link

@Pankraz76 Pankraz76 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// in case the producer gets stuck here, create a new one and continue the loop
try { producer.close(Duration.ofSeconds(0)); } catch (Exception ignore) {}
parsedArgs.getAttrs().put("transactionalId", parsedArgs.getString("transactionalId") + producerNumber++);
producer = createProducer(parsedArgs);
producer.initTransactions();
resetToLastCommittedPositions(consumer);
circutBreaker(); // in case the producer gets stuck here, create a new one and continue the loop

could give dedicated to this concern apply single responsibility principle, giving more focus to each own. Here its just about breaking the circut, how this is actually done seems to be some kind of (randomly) changing impl. detail.

Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...#21161
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
Pankraz76 pushed a commit to Pankraz76/kafka that referenced this pull request Dec 18, 2025
...pache#21161 #KAFKA-20000
Signed-off-by: Vincent Potucek <vpotucek@me.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@chia7712 chia7712 chia7712 left review comments

+1 more reviewer

@Pankraz76 Pankraz76 Pankraz76 approved these changes

Reviewers whose approvals may not affect merge requirements

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /