Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[Help] Spark 4.x status update and new sub-issues #11925

baibaichen started this conversation in General
Discussion options

Hi all,

I've completed a systematic triage of Spark 4.x related test issues. Here's a summary of the new issues created and how to get involved.

1. Simple suite fixes in #11550 — self-assign welcome

Several disabled suites in #11550 only require simple exclude + testGluten rewrites with no core-layer changes. These are marked with an empty Owner or "RC" in the table. If you have permission, feel free to assign yourself directly. Otherwise, leave a comment on the issue and I'll update the owner for you.

These include: GlutenExplainSuite, GlutenPlannerSuite, GlutenProjectedOrderingAndPartitioningSuite, GlutenRemoveRedundantProjectsSuite, GlutenRemoveRedundantSortsSuite, etc.

2. Bug sub-issues under #11550

These require deeper fixes at the Gluten core or C++ layer:

Issue Description
#11911 Enable Structured Streaming test suites (20 disabled suites)
#11912 JNI and Velox exception handling loses Spark error condition and exception type
#11913 Velox split function returns incorrect results with limit parameter (SPARK-49968)
#11914 Support Parquet struct field compatibility improvements (SPARK-53535)
#11915 Support checksum-based shuffle writers (SPARK-53322)
#11916 Diagnose and enable TODO SQL query test files
#11917 Velox decimal arithmetic does not respect allowPrecisionLoss context (SPARK-53968)
#11918 CastTransformer does not pass per-expression timezone for timestamp formatting

3. Feature sub-issues under #11910 (Spark 4.x new feature tracking)

These are new Spark 4.x features that Gluten/Velox does not yet support natively:

Issue Description Spark
#11919 Add TimeType (TIME data type) support (SPARK-51162) 4.1
#11920 Support dual-mode ColumnarToRow nodes (SPARK-51474) 4.1
#11921 Support NullType Parquet read/write (SPARK-54220) 4.1
#11922 Support memory shuffle spill by size threshold (SPARK-49386) 4.1

Additionally, existing issues #11371 (Variant) and #10134 (ANSI mode) have been linked under #11910 as well.

Note on Variant and ANSI

I've taken #11371 (Variant) and #10134 (ANSI mode) mainly to coordinate the overall effort, not to work on them exclusively. As we dig deeper, more sub-issues may be created. Contributions are very welcome.

Feel free to pick up any issue that interests you. Questions and discussions are welcome in the respective issue threads.

Thanks,
Chang Chen

You must be logged in to vote

Replies: 1 comment

Comment options

I'd like to take #11921 (Parquet NullType)

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /