-
Notifications
You must be signed in to change notification settings - Fork 618
January 30, 2026: Weekly Status Update in Gluten #11530
-
This weekly update is generated by LLMs. You're welcome to join our Github for in-depth discussions.
Overall Activity Summary
The Gluten project showed strong momentum this week with 42 pull requests merged or opened, covering Velox version updates, Spark 4.x compatibility improvements, infrastructure enhancements, and bug fixes. Key themes include advancing Delta Lake write support, improving CI/CD infrastructure, and expanding test coverage for multiple Spark versions.
Key Ongoing Projects
Velox Backend Enhancements
- Daily Velox Version Updates: @GlutenPerfBot continues maintaining upstream compatibility with regular updates ([GLUTEN-6887][VL] Daily Update Velox Version (2026_01_30) #11529 , [GLUTEN-6887][VL] Daily Update Velox Version (2026_01_27) #11498 )
- Delta Lake Native Write Support: @zhztheplayer is leading efforts to eliminate C2R overhead in Delta writes ([GLUTEN-10215][VL] Delta write: Native statistics tracker to eliminate C2R overhead #11419 ) with native statistics tracking
- Broadcast Hash Join Optimization: @JkSelf implemented significant BHJ performance improvements showing 1.29x speedup in TPC-DS Q23a ([GLUTEN-7548][VL] Optimize BHJ in velox backend #8931 )
Spark 4.x Compatibility
- Comprehensive Test Suite Addition: @baibaichen added 426 missing Gluten test suites for Spark 4.0 and 4.1 ([UT] Add missing Gluten test suites for Spark 4.0 and 4.1 #11512 )
- Python Version Updates: @ReemaAlzaid upgraded CI to Python 3.10 for Spark 4.1 compatibility ([VL] Update CI Python to 3.10 for Spark 4.1 and enable ArrowEvalPythonExecSuite tests #11481 )
- Unit Test Fixes: @loudongfeng is addressing remaining Spark 4.0/4.1 test failures ([UT] Fix some of UT that marked as TODO for Spark-4.0 #11520 )
Infrastructure Improvements
- Maven Wrapper Migration: @yaooqinn standardized CI workflows to use
./build/mvnwrapper ([CORE] Use build/mvn wrapper for scheduled jobs and in Dockerfiles #11515 , [VL] Use build/mvn wrapper in velox_backend_enhanced and velox_backend_arm workflows #11496 ) - CentOS 9 Support: @ReemaAlzaid added CentOS 9 CI support with 6 new test jobs ([VL][CI] Migrate Spark 4.1 tests to CentOS 9 #11519 )
- Docker Optimization: @zhouyuan implemented m2 repository caching for faster builds ([VL] Adding cache for m2 repo #11469 )
Priority Items
Critical Bug Fixes
- Decimal Partition Key Reading: @zhouyuan fixed decimal partition key support ([GLUTEN-11402][VL] fix reading decimal partition key #11518 )
- Window to Aggregate Conversion: @lgbo-ustc resolved validation issues in window function conversions (Fix window to aggregate conversion with ordering expression validation #11523 )
- Parquet Write Options: @boneanxs fixed integer overflow in parquet write configurations ([VL] Fix stoi issue when get parquet write options #11504 )
Performance Optimizations
- StrictRule Refactoring: @beliefer achieved 3.52% average performance improvement in TPC-DS benchmarks ([GLUTEN-10559] Simplify StrictRule and remove unnecessary DummyLeafExec #10553 )
- Batch Type Identification: @beliefer optimized batch type identification calls for 1.34% performance gain ([GLUTEN-10649] Avoid repeated calls to identifiyBatchType #10573 )
Notable Discussions
Community Building
- Slack Channel Launch: @zhouyuan announced the new #incubator-gluten Slack channel for real-time community interaction (Gluten Slack Channel #gluten #8429 )
Technical Challenges
- GPU/CPU Mixed Cluster Scheduling: @jinchengchenghh outlined requirements for intelligent task scheduling between GPU and CPU nodes ([VL] GPU and CPU mixed cluster schedule #11524 )
- Flink Integration Performance: @ParyshevSergey raised concerns about Velox4j performance in Flink streaming scenarios ([GLUTEN][FLINK] Nexmark q0 performance #11508 )
Emerging Trends
- Multi-Backend Maturation: Strong focus on both Velox and ClickHouse backend improvements
- Spark Version Parity: Accelerated efforts to support Spark 4.x features and maintain backward compatibility
- Native Format Optimization: Continued push to eliminate C2R transitions for better performance
- Infrastructure Modernization: Systematic updates to CI/CD, dependency management, and build processes
Good First Issues
#11511: CI Migration to CentOS 9
Skills Needed: GitHub Actions, Docker, CI/CD
Why Good: Well-defined scope with existing CentOS 8 implementation as reference. Great introduction to Gluten's testing infrastructure.
#11509: TreeMemoryConsumer Thread Safety
Skills Needed: Java concurrency, memory management
Why Good: Clear problem description with error examples. Excellent for understanding Gluten's memory architecture.
#11501: Docker Java Dependencies Caching
Skills Needed: Docker, Maven, Build optimization
Why Good: Tangible performance impact with clear success metrics. Good entry point into build system improvements.
#11513: Iceberg input_file_name() Fix
Skills Needed: File format handling, debugging
Why Good: Isolated issue with clear expected behavior. Good introduction to file format integration.
#11400: Spark 4.1 Test Suite Completion
Skills Needed: Spark internals, testing
Why Good: Multiple sub-tasks available with varying complexity. Excellent way to learn Spark compatibility requirements.
Beta Was this translation helpful? Give feedback.