-
Notifications
You must be signed in to change notification settings - Fork 618
-
Hi Gluten Community,
I am currently exploring the performance of Apache Gluten with the Velox backend specifically for Delta Lake workloads.
While there are several TPC-DS benchmark reports available for Parquet/ORC, I am looking for insights or existing benchmarking results for the following specific setup:
- Scale Factor: 1TB (TPC-DS)
- Data Format: Delta Lake (non-partitioned)
- Backend: Velox
- Storage: GCS
Context:
We are evaluating the overhead of the Delta Log reading process versus the native acceleration provided by Velox. Specifically, we are interested in:
- How non-partitioned Delta tables perform compared to standard Parquet in a Gluten environment.
- If anyone has observed specific bottlenecks in metadata handling or scan performance with this configuration.
- Recommended Spark/Gluten configurations to optimize the Delta-Velox scan path for large-scale non-partitioned data.
If anyone has run these benchmarks or has a performance comparison (Native Spark vs. Gluten+Velox) for this setup, I would greatly appreciate it if you could share your findings or any tuning tips!
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 2 comments
-
Delta tables has a bit lower performance than pure hive table. Delta uses SQL to query metadata during the SQL processing. But some operators are not supported in the metadata query which caused frequent C2R, R2C in some cases and perform worse than vanilla spark. Welcome to fix.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks @FelixYBW for your detailed response.
I am up for any kind of contribution, please guide me how can I proceed with.
Beta Was this translation helpful? Give feedback.