-
Notifications
You must be signed in to change notification settings - Fork 618
Add a new backend: Bolt #10929
-
Hi everyone,
Bolt is a velox fork within ByteDance. By integrating with our production environment, Bolt has effectively addressed numerous stability challenges—including off-heap OOM and core dump issues—while also delivering significant performance optimizations, such as those achieved through LLVM-based JIT compilation.
We are now planning to open source Bolt and enable the broader community to use it through Apache Gluten.
In this thread, I’d like to start a discussion with you on how we can better leverage Bolt within the Gluten ecosystem. Any questions, and use cases are highly welcome!
More detailed technical information will be shared soon. Looking forward to your thoughts!
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 20 -
🚀 4 -
👀 3
Replies: 13 comments 23 replies
-
Thank you to start the discussion, @WangGuangxin.
For short term experiment, can we simply replace the VELOX_REPO to enable Bolt? Do we need to change any Gluten code?
In long term, we may considered to add a new backend.
Beta Was this translation helpful? Give feedback.
All reactions
-
Bolt has improved it by treating Shuffle as an operator in Bolt and offloading it from Gluten to Bolt for parallel processing.
@frankobe would you give more explanation about
treat shuffle as an operator? doest Bolt still fetch shuffle data from spark's java iterator?
When shuffle writer or shuffle reader is adjacent to other operators that run in Gluten,it would be offloaded into Bolt and treated as a Bolt operator.
For the shuffle reader, Bolt would fetch the shuffle data directly from spark's InputStream, and then read raw bytes from it and do decompression and deserialization inside the shuffle reader operator, and then outputs result as a normal operator instead using Iterator[ColumnarBatch].
For the shuffle writer, Bolt also constructs a shuffle writer operator. It receives data directly from the upstream operator instead of wrapping it into an Iterator[ColumnarBatch].
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@zhangxffff thanks for the explanation!
For the shuffle reader
I'm curious about Morsel Driven mentioned above, I guess Bolt use parallel mode to execute task (Gluten use serial mode).
does fetching raw shuffle data, decompression, deserialization all done in one operator? how many driver does shuffle read operator generate? if only one driver of shuffle read op, seems no major difference with Gluten, if two, fetching, decompression, deserialization all in one driver?
Beta Was this translation helpful? Give feedback.
All reactions
-
@zhangxffff thanks for the explanation!
For the shuffle reader
I'm curious about
Morsel Drivenmentioned above, I guess Bolt use parallel mode to execute task (Gluten use serial mode). does fetching raw shuffle data, decompression, deserialization all done in one operator? how many driver does shuffle read operator generate? if only one driver of shuffle read op, seems no major difference with Gluten, if two, fetching, decompression, deserialization all in one driver?
After offload as a Bolt operator, shuffle read support multiple drivers, each driver's shuffle reader get different InputStreams (since Gluten get an iterator of InputStream from ShuffleBlockFetcherIterator, which contains multiple InputStreams), so they can fetching data, decompression, deserialization in parallel.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Bolt can dynamically over-issue memory after reaching the off-heap configuration value based on the actual physical memory usage, thereby reducing the amount of spilled data, improving performance, and reducing the risk of OOM.
Looking forward to see this open-sourced — especially how it helps avoid OOM. We’ve been struggling with that a lot.
@boneanxs We have implemented all memory management functions in Bolt, so Bolt can dynamically over-issue memory after reaching the off-heap configuration value based on the actual physical memory usage. The specific code is here:
https://github.com/bytedance/bolt/blob/c9b1507b696edbae28d5133d6d512a1eac77e446/bolt/common/memory/sparksql/ExecutionMemoryPool.cpp#L347-L419
We will also refactor the memory module later to avoid shortcomings of the existing mechanism. All planning and discussions will be conducted in the bolt repo's issues.
Beta Was this translation helpful? Give feedback.
All reactions
-
@WangGuangxin "We have implemented all memory management functions in Bolt, so Bolt can dynamically over-issue memory after reaching the off-heap configuration value based on the actual physical memory usage" IIRC, you mean Bolt memory pool can allocate over the config limit based on the current measured RSS to avoid a single query OOM. Then does it mean this might increase the risks of server OOMs. If that's the case, the memory config limit is more like an optimization to avoid frequent RSS limit check?
For "refactor the memory module later to avoid shortcomings of the existing mechanism", do you have details for these? Thank you!
Beta Was this translation helpful? Give feedback.
All reactions
-
@WangGuangxin Great work! What about if we add streaming computation extension to your Bolt? since we are doing gluten on Flink project and it's easyer work with you guys than velox team.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 4
-
It's good idea since Flink doesn't have many customer adoptions yet.
Beta Was this translation helpful? Give feedback.
All reactions
-
@WangGuangxin Do you have any plan to release Bolt source ?
Beta Was this translation helpful? Give feedback.
All reactions
-
@WangGuangxin i am interested in this timeline too.
Beta Was this translation helpful? Give feedback.
All reactions
-
@WangGuangxin also curious to see if the new Bolt backend is compatible with Apache Celeborn. I noticed you mentioned that bolt now handles the shuffle instead of gluten for more parallel processing.
Beta Was this translation helpful? Give feedback.
All reactions
-
@afterincomparableyum Bolt is compatible with Celeborn (which is used internally in ByteDance as well). We are planning a deeper integration with Celeborn C++ client for more flexibility & performance. Happy to collaborate in the community if you are interested
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 2
-
i’m happy to collaborate 😀
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
one more question @frankobe , i’m assuming that the gluten parameter for dynamic off heap sizing, spark.gluten.memory.dynamic.offHeap.sizing.enabled=true, works with Bolt too since it is a fork off of velox?
Or would we not need to specify off heap memory configs at all since Bolt has Memory Management Offload.
If all of these questions are going to be answered in docs for the PR you will submit, I can wait till those docs get released before asking more questions. Thank you!
Beta Was this translation helpful? Give feedback.
All reactions
-
@lgbo-ustc @afterincomparableyum Yes, Bolt source code, as well as the PR on Gluten integration, will be open source before Dec 6th(if not earlier). I will share the links here once it is ready. It is truly amazing to receive overwhelming interests and we would appreciate the test & feedback from the community to make Bolt better.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1 -
❤️ 1
-
@WangGuangxin and @frankobe Thank you for this proposal.
The community welcome this proposal and aprreciate ByteDance's interest in contributing Bolt to Apache Gluten. We're open to discussing the addition of a new backend.
To move forward, please highlight the key differentiations between Bolt and Velox - both in architecture and performance, so the community understands the unique value Bolt brings.
Official support for important feature like to add a new backend requires a formal community voting process, as this involves significant design and maintenance effort. Compliance with Apache licensing and IP clearance is essential, given Bolt's origin as a Velox fork.
I would encourage sharing additional data for Bolt, including the source code for review, performance benchmarks, and the roadmap for ongoing maintenance to ensure long-term sustainability.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 3
-
Bolt is open sourced: https://github.com/bytedance/bolt
Gluten Bolt Backend is available on: #11261
Feedbacks are welcomed!
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 3
-
@liuneng1994 Thanks for sharing your concerns in thread https://lists.apache.org/thread/9ry0jjydsvztnrosymlwzspdv4hdsvp1, I would like to justify it by providing more context
- Limited open-source impact and ecosystem maturity
Bolt has proved enterprise-grade maturity through massive internal adoptions within ByteDance on multiple product lines. The ecosystem is strengthening through direct & transparent collaboration with not only Gluten project, but also with OpenSearch / Flink / Paimon / Celeborn etc.
The maturity of project is the joint goal between Gluten & Bolt community, instead of the prerequisite on the code merge. Following the project history of Gazelle/Gluten, I witness how Gluten community pioneers to adopt existing backends at early, inmature stage and foster Flink & GPU support even nowadays so I would appreciate the similar standard on the Bolt backend.
- Increased long-term maintenance cost
Bolt is open to accepts commits from Gluten commits. Internally, every Bolt commit is checked against all DBMS integration including Spark-on-Gluten to ensure the stability. Currently Gluten backends-velox depends on ibm/velox which requires frequent rebases with upstream so Bolt backend will reduce the long-term maintenance cost without messy rebase burden and providing first party supports on Gluten integration
- Significant added pressure on GitHub Actions CI
Similar to existing maintenance model on backends-clickhouse, Github Actions CI on Bolt backend will run on additional compute resources provided by Bolt project to ensure the task bandwidth. To further enhance the transparency, run history and logs are available for public access. You can check the health on existing Bolt Github action run. On the top of CI resource, as stated in the voting message, Bolt community will assign 3 dedicated members to maintain the stability of Gluten & Bolt integration.
- Unclear readiness for community governance and sustainability
Bolt deeply values the "Apache Way" on community governance since day 1. We are in the process of submitting a proposal to ASF incubator.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
The comparison with GPU/Flink support is not appropriate.
GPU and Flink support add new, non-overlapping capabilities to the ecosystem, whereas Bolt is essentially a downstream fork of Velox. These two cases are fundamentally different in nature, so drawing a direct analogy between them is misleading.
Gluten still needs to maintain its own use cases, and the proposed community support model raises concerns.
The idea of "assigning" community members sounds more like an internal work arrangement rather than participation driven by open-source interest. If those three members change roles or priorities in the future, this commitment may no longer hold. In that case, the sustainability assumption behind this support model would break. This makes the project feel closer to an internal corporate project, and it risks losing the essence of an open-source community.
The current contributor base of Bolt is highly concentrated within ByteDance.
At the moment, almost all open-source contributions to Bolt come from ByteDance. This situation reminds me of the ByteHouse and ClickHouse relationship: when an open-source compute engine fails to attract a broader and more diverse contributor base, it becomes problematic in the long run. I believe this would not be a positive outcome for the Gluten community either.
Beta Was this translation helpful? Give feedback.
All reactions
-
Gluten still needs to maintain its own use cases, and the proposed community support model raises concerns.
The idea of "assigning" community members sounds more like an internal work arrangement rather than participation driven by open-source interest. If those three members change roles or priorities in the future, this commitment may no longer hold. In that case, the sustainability assumption behind this support model would break. This makes the project feel closer to an internal corporate project, and it risks losing the essence of an open-source community.
It's actually from my suggestion. I was thinking to refer to current Clickhouse backend way that once we have some clickhouse issue during Velox development, especially CI issues, we mostly ping @zzcclp to solve or review. Similarly, to mostly ease velox and Clickhouse backend's development, I suggested Bolt backend also should have some guys we can ping to get help. The worst case is that if no one mentain Bolt backend one day, we will have to remove it or freeze it from Gluten. It's the same thing in Velox and Clickhouse backend actually. Like no one maintain velox backend, we will have to remove or freeze it.
Also please note there is no fundamental difference of Velox and Bolt, they are both corporate projects now, velox is even more closed as cooperate project that we can't get any real committers in no matter how much we contributed and how hard we tried. On the other side Bolt already WIP to submit Apache incubatoring request and build LF foundation on failure of Apache. Bolt welcome PRs from outside and would like to join as initial committers.
Beta Was this translation helpful? Give feedback.
All reactions
-
The current contributor base of Bolt is highly concentrated within ByteDance.
It's more like a chiken and egg issue. If Bolt isn't in, no one would like to contribute. Once bolt join the game with more maturality and openness, customer may move.
Beta Was this translation helpful? Give feedback.
All reactions
-
(...) velox is even more closed as cooperate project that we can't get any real committers in no matter how much we contributed and how hard we tried.
As a quick fact-check, there are 4-5 Velox maintainers involved in the Gluten community today, including one from ByteDance. The list is available here:
https://velox-lib.io/docs/community/components-and-maintainers
Beta Was this translation helpful? Give feedback.
All reactions
-
@yaooqinn Thanks for sharing the sincere concerns in thread https://lists.apache.org/thread/d4oq3oydrzcndyphvfh3gnr6v08jxvp9. The discussion on Bolt backed is opened on Oct 23, 2025 but I would love to provide more context just in case
PMCshall not be used in a podling for formal vote, use PPMC instead.According to the Apache voting process, we need at least 3 +1 votes from PMC members/Committers and more +1 than -1 votes in total.
This isn't true, only PPMC member vote casts.
Thanks for the correction. We will follow your recommendations in the next round of voting.
- Based on the relationship between bolt and velox, I think we need a clarification about the legal issues.
Both projects are following Apache 2.0 license, Bolt's dependency is declared in NOTICE.txt. We are here to address any specific legal concerns, though there are no known issues to our knowledge.
- If we do reach a consensus later, for such a huge contribution, I insist that Intellectual property clearance - Apache Incubator https://incubator.apache.org/ip-clearance/ also need to be done, before any code get merged in.
Bolt backend PR only fork and modify the existing backends-velox (whose IP belongs to Gluten repository & ASF) without any Velox-related code.
- I'm wondering that this just increase long-term maintenance cost
Bolt is open to accepts commits from Gluten commits. Internally, every Bolt commit is checked against all DBMS integration including Spark-on-Gluten to ensure the stability. Currently Gluten backends-velox depends on ibm/velox which requires frequent rebases with upstream so Bolt backend reduces the long-term maintenance cost without messy rebase burden and providing first party supports on Gluten integration.
- Also, is bolt vendor-natural?
Assuming you refer to "vendor-neutral", Bolt project is in the process of submitting a proposal to ASF incubator, targeting 26Q1. We deeply value and encourage contributions from Gluten community to build the future of native engine acceleration. The merge of Bolt backend to the main branch is a foundational step to expose Spark-on-Bolt capability for community adoption which in return pushes the project to be "vendor-neutral".
Beta Was this translation helpful? Give feedback.
All reactions
-
Both projects are following Apache 2.0 license, Bolt's dependency is declared in NOTICE.txt. We are here to address any specific legal concerns, though there are no known issues to our knowledge.
In terms of OSS license, yup, it's legal. But socially and technically weird.
What is bolt's future plan? A permanently independent fork with no intention to stay aligned long‐term? Or an upstream-first fork where new work goes upstream ASAP? Or a downstream first fork keeps cherry-picking without feedback?
Bolt backend #11261 only fork and modify the existing backends-velox (whose IP belongs to Gluten repository & ASF) without any Velox-related code
Yes, I also mean it's just a requirement for this patch. For velox, that's a problem for you to tackle in incubator, not here
Bolt is open to accepts commits from Gluten commits. Internally, every Bolt commit is checked against all DBMS integration including Spark-on-Gluten to ensure the stability. Currently Gluten backends-velox depends on ibm/velox which requires frequent rebases with upstream so Bolt backend reduces the long-term maintenance cost without messy rebase burden and providing first party supports on Gluten integration.
So, bolt is going to replace velox entirely in gluten, then the reduction of maintenance you've mentioned could actually happen. Do you have a clear plan for that?
As a fork of velox, why the integration for bolt result in a 260k+ patch? It doesn't look a right way to maintain. What's the amount of duplication? Consider a better approach such as new API exposure?
If spark adds a new feature or function, the gluten contributors need to apply an identical patch to velox and bolt, right? Otherwise, Technical debt accumulates quickly.
CI cost increase, the PR process as well
Assuming you refer to "vendor-neutral", Bolt project is in the process of submitting a proposal to ASF incubator, targeting 26Q1. We deeply value and encourage contributions from Gluten community to build the future of native engine acceleration. The merge of Bolt backend to the main branch is a foundational step to expose Spark-on-Bolt capability for community adoption which in return pushes the project to be "vendor-neutral".
A proposal to ASF incubator does not represent vendor-neutral.
You encourage others to contribute bolt, that's good, but bolt itself does not follow the upstream-first practice. We're in an awkward position.
Beta Was this translation helpful? Give feedback.
All reactions
-
Some comments of what I know:
So, bolt is going to replace velox entirely in gluten
It's up to customer actually, just like Clickhouse vs. Velox. Customer will do the choices depending on various reasons.
As a fork of velox, why the integration for bolt result in a 260k+ patch? It doesn't look a right way to maintain. What's the amount of duplication? Consider a better approach such as new API exposure?
Duplicated code is because we need to seperate the two backend's development initially. Then we may rebase the code, remove the duplications, align the APIs, achieve similar even better code layout compared to Velox vs. Clickhouse. It takes time and we need a start point.
Beta Was this translation helpful? Give feedback.
All reactions
-
+1
The essence of open source is "all for one and one for all." The reason Gluten has reached its current level of maturity and widespread adoption is precisely because our countless users have resolved numerous bugs during their own deployments, preventing those who follow from repeating the same mistakes.
However, the Velox community has consistently struggled with excessively long PR review times, or even a complete lack of reviews. The latest example is Issue #11534, which could have saved us from a massive amount of redundant testing recently. While we could cherry-pick these PRs into ibm/velox (or your specific fork), the current rebase effort is already becoming overwhelming. Consequently, we have had to create an issue to track unmerged PRs from the Gluten community so that customers can selectively pick them as needed.
Since I began testing Bolt, I can confirm that everything mentioned by @frankobe is true. Introducing a new, more open backend will only benefit the long-term development of Gluten.
Beta Was this translation helpful? Give feedback.
All reactions
-
Since I began testing Bolt
@FelixYBW would you share your test results for performance?
significant 22% performance enhancement over existing gluten backend on the TPC-DS 1T benchmark
22% e2e boost mentioned from Bolt vote mail, but I can not find any detailed test limitation like which data storage was tested? what major optimizations contributes this boost?
Beta Was this translation helpful? Give feedback.
All reactions
-
SF500 on r6id.16xlarge, data is stored in local ssd. query is from https://github.com/prestodb/pbench/tree/tpc-queries-for-spark/benchmarks/tpc-ds/queries
See if you can reproduce.
Beta Was this translation helpful? Give feedback.
All reactions
-
I suggest we postpone the discussion of adding Bolt as a new backend until Gluten itself graduates from the incubator. I have discussed this privately with @WangGuangxin as well. The primary reason is to minimize potential risks and uncertainties during Gluten's critical incubation phase.
Regarding the ecosystem, given the issues we've seen with Velox, I strongly hope Bolt can eventually join the Apache Software Foundation or the Linux Foundation.
The recent changes in the companies of Gluten maintainers serve as a perfect example: precisely because Gluten is an Apache project with clear governance, these personnel changes did not introduce significant risks to the project's continuity. I believe Bolt following a similar governance model would be crucial for its long-term health and integration.
Beta Was this translation helpful? Give feedback.
All reactions
-
+1 thanks for the proposal bolt definitely looks promising. It would be great to have these details into a Proposal doc which can outline some of the technical integration with Gluten, and get it reviewed from the community in depth. PR review times and hard to get reviews is a long standing problem, hopefully issues can be addressed
Beta Was this translation helpful? Give feedback.
All reactions
-
Guys, one thing to clarify and reminder, Bytedance is the first round company who joint Gluten community since we started. There are several Gluten committers from Bytedance and they also contributed much to our community.
Beta Was this translation helpful? Give feedback.