JAVA-5788 Improve ByteBufferBsonOutput::writeCharacters #1629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

franz1981 wants to merge 2 commits into mongodb:main

from franz1981:write_chars_heap_ascii

Open

JAVA-5788 Improve ByteBufferBsonOutput::writeCharacters #1629

franz1981 wants to merge 2 commits into mongodb:main from franz1981:write_chars_heap_ascii

Conversation

franz1981

Copy link

@franz1981 franz1981 commented Feb 11, 2025 •

edited

Loading

Fixes JAVA-5788

While running few benchmarks vs https://github.com/quarkusio/quarkus-super-heroes/blob/main/README.md using native image (graal CE) we found

image

writeCharacters heap-based code path to be bloated by:

many not inlined calls: which Graal CE should be blamed since it has a very simple and naive inliner
an excessive usage of logic to handle the current buffer to be used to write chars into
zero loop unrolling due to the complexity of the main loop; since native image cannot profile branches cannot "shape" the assembly to grant latin to be a fast path

This patch is manually inlining the fast paths for both latin chars and heap buffers, assuming them to have enough available space.
This should help immensely JVM mode as well (i.e. JDK Hotspot) - so if there's any JMH bench I can re-use or run to verify it would be great.

NOTE: i could change the assumption which make the optimization to be triggered to be more "elastic" and use the heap buffers (byte[]) remaining capacity and perform checkpoints to refill it, too, but it would make the logic way more complex - still more elastic.
Let me know what you prefer 🙏

@franz1981

Copy link

Author

franz1981 commented Feb 11, 2025

@jyemin I see I have to open an issue there but I need a bit of help to find if there is already a JMH benchmark which already cover this; if there is not I can provides a new one myself

@franz1981 franz1981 force-pushed the write_chars_heap_ascii branch from fc1c1f8 to 82782ee Compare

February 12, 2025 10:01

@franz1981 franz1981 changed the title ~~(削除) Latin heap array fast path (削除ここまで)~~ (追記) JAVA-5788 Latin heap array fast path (追記ここまで)

Feb 12, 2025

@franz1981

Copy link

Author

franz1981 commented Feb 12, 2025

This is a flamegraph coming from the same test running in JVM mode (JDK 21)

image

which reports a similar story.

@jyemin jyemin added the external label

Feb 13, 2025

@franz1981

Copy link

Author

franz1981 commented Feb 15, 2025

@jyemin I have just verified that since we are using the Netty driver as well, the core path there will uses (direct) Netty buffers too, so I think I will expand this optimization to include them, checking which optimization path is more appropriate there e g.

branchless ASCII check with 8 chars (using 2 longs storing 4 UTF-16 chars each)
branchless zero check leveraging ASCII bytes
batch buffer writes

@franz1981 franz1981 marked this pull request as draft

February 15, 2025 16:30

@franz1981 franz1981 force-pushed the write_chars_heap_ascii branch from bc4af89 to 7b80312 Compare

February 16, 2025 07:09

@franz1981

Copy link

Author

franz1981 commented Feb 16, 2025

I have added an extra commit to improve the case for the Netty buffers as well.
I can't modify the original JIRA, to explain and include such, how I can do it?

@franz1981


 JAVA-5788 Improve ByteBufferBsonOutput::writeCharacters

b2cd2b8

@franz1981 franz1981 force-pushed the write_chars_heap_ascii branch from 7b80312 to b2cd2b8 Compare

February 16, 2025 14:59

@franz1981 franz1981 marked this pull request as ready for review

February 16, 2025 14:59

@franz1981 franz1981 changed the title ~~(削除) JAVA-5788 Latin heap array fast path (削除ここまで)~~ (追記) JAVA-5788 Improve ByteBufferBsonOutput::writeCharacters (追記ここまで)

Feb 16, 2025

franz1981

franz1981 commented

Feb 16, 2025

View reviewed changes

driver-core/src/main/com/mongodb/internal/connection/ByteBufferBsonOutput.java Outdated Show resolved Hide resolved

@franz1981

Copy link

Author

franz1981 commented Feb 17, 2025

I have performed some JMH benchmarks and the Netty approach, despite is way better than the original code, is still much slower than the plain array case, byte per byte; I will experiment a bit more by using some Netty internalNioBuffer's API - although, since we create a Netty swapped wrapper (not needed actually) it could not work

@franz1981


 Use internal NIO ByteBuffer from Netty

ff360c0

@franz1981

Copy link

Author

franz1981 commented Feb 17, 2025

I've added ff360c0 to show what means using the internal NIO buffer from Netty
I'll post some JMH bench results soon - although native image CE won't benefit a lot from this since NIO ByteBuffer(s) have some "wide" enough OOP hierarchy which can still make single byte access to be too costly - which is something the "batchy" version instead was saving.

@franz1981

Copy link

Author

franz1981 commented Feb 18, 2025 •

edited

Loading

The new approach is pretty much on par with the array based (slightly slower than it actually) - but miles better than the batchy unrolled one.
I am waiting for some feedback before proceeding in this (including benchmarks)

@franz1981 franz1981 mentioned this pull request

Feb 23, 2025

Optimize writing numeric values. #1635

Merged

@franz1981

Copy link

Author

franz1981 commented Mar 4, 2025 •

edited

Loading

@jyemin any chance I can help moving this forward? Anything I can help with? 🙏
This has a very large performance impact to both netty and non-netty users...

@franz1981 franz1981 mentioned this pull request

Mar 4, 2025

JAVA-5789 Implements batch ByteBuf::indexOf #1630

Closed

@jyemin

Copy link

Collaborator

jyemin commented Mar 4, 2025

Hi @franz1981 we're actually working on a larger performance epic and one of the areas we are investigating is the writeCharacters method. @vbabanin will follow up soon.

@franz1981

Copy link

Author

franz1981 commented Mar 4, 2025

thanks @jyemin and I'm happy (or any in my team) to help - since i'm a Netty commiter, whatever I can help with re my knowledge there I'll do it 🙏

@franz1981 franz1981 mentioned this pull request

Mar 18, 2025

Optimize String write #1651

Merged

@vbabanin

Copy link

Member

vbabanin commented Mar 26, 2025 •

edited

Loading

Hi @franz1981, apologies for the late response. Thank you for the PR - i like the ideas! We’ve been working on similar improvements as part of our performance optimization investigations.

We added additional tests to guard against regressions and included enhancements for multi-byte encoding to reduce overhead from method calls and bounds checks. I’ll be proceeding with that implementation in #1651 with ByteBuf abstraction.

That said, your Netty-specific optimizations go beyond what we’ve implemented so far, and optimizing Netty integration in general is something we could explore.

We don’t yet have a Netty baseline to evaluate their impact, but I’ve added them to our driver-benchmark suite, currently under review in #1647. Once that’s merged and we can measure those changes, I’ll revisit the Netty-specific optimizations in this PR and follow up here.

One concern I have is that internalNioBuffer is documented as

Internal use only: Exposes the internal NIO buffer.

We aim to avoid using non-public APIs, as there’s no guarantee they’ll remain stable, and we should preserve compatibility for our users.

Regarding JMH: we don’t have any JMH benchmarks set up for this area. Instead, we use a custom benchmarking framework to measure performance. You could consider using BenchmarkSuite from there and running it locally to verify the performance of the Netty improvements. To enable Netty, configure NettyTransportSettings via MongoClientSettings on MongoClient created in Benchmark class.

@franz1981

Copy link

Author

franz1981 commented Mar 26, 2025

@vbabanin

It's fair enough related the internal NIO api, although (I am a Netty committer) I know for sure this won't go away for 4.2 as well (which means other N years before Netty 5 which have a much different Buffer API regardless).
I still suggest to use the ByteBuf::get(index) API since it will cut further the cost of moving the reader/writer offsets and further improving performance

Said that, I have sent some time ago #1630 which improve another "hot spot" found in production, so please check if it fits

JAVA-5788 Improve ByteBufferBsonOutput::writeCharacters #1629

Are you sure you want to change the base?

JAVA-5788 Improve ByteBufferBsonOutput::writeCharacters #1629

Uh oh!

Conversation

@franz1981 franz1981 commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Feb 11, 2025

Uh oh!

franz1981 commented Feb 12, 2025

Uh oh!

franz1981 commented Feb 15, 2025

Uh oh!

franz1981 commented Feb 16, 2025

Uh oh!

Uh oh!

franz1981 commented Feb 17, 2025

Uh oh!

franz1981 commented Feb 17, 2025

Uh oh!

franz1981 commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jyemin commented Mar 4, 2025

Uh oh!

franz1981 commented Mar 4, 2025

Uh oh!

vbabanin commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Mar 26, 2025

Uh oh!

Uh oh!

@franz1981 franz1981 commented Feb 11, 2025 •

edited

Loading

franz1981 commented Feb 18, 2025 •

edited

Loading

franz1981 commented Mar 4, 2025 •

edited

Loading

vbabanin commented Mar 26, 2025 •

edited

Loading