-
Notifications
You must be signed in to change notification settings - Fork 14.9k
KAFKA-19925: Fix transaction timeout handling during broker upgrades #21161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-19925: Fix transaction timeout handling during broker upgrades #21161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: due to timeouts and re-creation of producer, this copier_timeout needed to be increased. I experimented a bit and found that 360s was a consistently reliable value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I described in https://issues.apache.org/jira/browse/KAFKA-20000, the performance regression is caused by the backoff logic. Therefore, I suggest fixing the underlying issue instead of increasing the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kindly asking, if this is something to consider? If so, would add some test for this adjustment.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pankraz76 thanks for the effort. As Justine suggested, hardcoding the timeout is a bit coarse-grained. Please refer to KAFKA-20000 for more discussion.
FrancisGodinho
commented
Dec 16, 2025
@chia7712 can you take a look when you get a chance please?
chia7712
commented
Dec 16, 2025
@FrancisGodinho thanks for you patch. I have identified some underlying issues in e2e and TV2. Addressing them should allow us to achieve more stable transaction behavior. Please check https://issues.apache.org/jira/browse/KAFKA-19999 and https://issues.apache.org/jira/browse/KAFKA-20000 for more details.
@Pankraz76
Pankraz76
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
tools/src/main/java/org/apache/kafka/tools/TransactionalMessageCopier.java
Outdated
Show resolved
Hide resolved
tools/src/main/java/org/apache/kafka/tools/TransactionalMessageCopier.java
Show resolved
Hide resolved
...eCopier.java Co-authored-by: Vincent Potuček <8830888+Pankraz76@users.noreply.github.com>
FrancisGodinho
commented
Dec 18, 2025
@Pankraz76 thanks for the comments, can you re-review please?
@Pankraz76
Pankraz76
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue is very well documented, thanks for effort given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry again this something for SCA. Taking away the off-topics upfront.
spotless and rewrite both ready to fix on their own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could give dedicated to this concern apply single responsibility principle, giving more focus to each own. Here its just about breaking the circut, how this is actually done seems to be some kind of (randomly) changing impl. detail.
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...#21161 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...ache#21161 apache#21168 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...ache#21161 apache#21168 Signed-off-by: Vincent Potucek <vpotucek@me.com>
...pache#21161 #KAFKA-20000 Signed-off-by: Vincent Potucek <vpotucek@me.com>
Problem
During broker upgrades, the
sendOffsetsToTransactioncall would sometimes hang. Logs showed that it continuously returnederrorCode=51which isCONCURRENT_TRANSACTION. The test would eventually hit its timeout and fail. This happened for every single version upgrade and occurred in around 30% of the runs.Resolution
The problem above left the producer in a broken state and even after 5-10 minutes of waiting, it didn't resolve itself (even if we waited a few minutes past the transaction.max.ms time). I tried multiple solutions including waiting extended periods of time and re-trying the
sendOffsetsToTransactionmultiple times whenever timeout occurred.Unfortunately, the producer was just permanently stuck and always receiving the
errorCode=51. In this case, the recommended resolution in the Kafka docs is to close the previous producer and create a new producer. https://kafka.apache.org/documentation/#usingtransactionsimage
Using the old transaction.id would continue to lead to a stuck state, so this fix creates a brand new producer with a new ID and then rewinds the consumer offset to ensure EOD.
Testing and Validation
Previously, I was able to run the test for a single version upgrade and have it fail within the first 5-10 runs. After the fix, I was able to run it 40 times continuously with 0 failures. I also ran the full test (all versions) ~5 times with 9/9 cases passing.