How 3 Lines of Code Caused a 10x Kafka Throughput Drop

DEV Community

if (isUnderMinIsr) {
 trace(s"Not increasing HWM because partition is under min ISR")
 return false
}

Before v3.9.0, min.insync.replicas only affected producers using acks=all. It dictated how many replicas had to acknowledge a write before the producer considered it successful. It had nothing to do with consumers.

After v3.9.0, the same setting also blocks consumer reads. If a follower is slow and drops out of the ISR, the leader stops advancing the high watermark until that follower catches up. Consumers stall until the watermark moves again.

Why this is a feature, not a bug

Kafka prioritizes durability over throughput. Blocking reads until min.insync.replicas are healthy prevents consumers from reading data that has not been sufficiently replicated. If the leader crashes after a consumer reads an under-replicated message, that message is gone, and the consumer has already processed it.

The trade-off is real. The change arguably deserved a major version bump, because a 10x throughput drop in a minor release can break production pipelines.

The fix

If you hit this, your options are straightforward:

Lower min.insync.replicas if your durability requirements allow it.
Ensure followers have enough resources to keep up with the leader.
Monitor ISR size and follower lag as critical metrics.

Three lines of code. A massive performance impact. A reminder that distributed systems are full of sharp edges.

For the full timeline, mailing list discussion, and the exact PR diff: How a Minor Release Caused a 10x Throughput Drop in Kafka.