I am trying to write a custom decoder function in Java targeting Spark 4.0:
public class MyDataToCatalyst extends UnaryExpression implements NonSQLExpression, ExpectsInputTypes, Serializable {
//...
}
However, I cannot get this to compile, because the class definition of UnaryExpression is invalid according to Java type hierarchy rules: The two final methods implementing its Scala trait UnaryLike[Epression] have an invalid return type:
public abstract class UnaryExpression extends Expression implements UnaryLike<Expression> {
public final TreeNode mapChildren(Function1);
public final TreeNode withNewChildrenInternal(IndexedSeq);
//...
}
UnaryExpression extends Expression, which is a TreeNode<Expression> and hence should have a return type of Expression for these two methods. And the same goes for UnaryLike<Expression>.
The compiler fails with these errors accordingly:
'mapChildren(Function1)' in 'org.apache.spark.sql.catalyst.expressions.UnaryExpression' clashes with 'mapChildren(Function1<BaseType, BaseType>)' in 'org.apache.spark.sql.catalyst.trees.TreeNode'; incompatible return type
'withNewChildrenInternal(IndexedSeq)' in 'org.apache.spark.sql.catalyst.expressions.UnaryExpression' clashes with 'withNewChildrenInternal(IndexedSeq<BaseType>)' in 'org.apache.spark.sql.catalyst.trees.TreeNode'; incompatible return type
This actually looks like a bug in the Scala compiler to me, generating a bad class file.
If the methods were not final, I could have probably just overridden them in my subclass to fix the issue, but this way, I'm stuck.
-
I don't know my way around Scala enough, but this looks suspiciously like this issue: github.com/scala/bug/issues/6103Carsten– Carsten2025年08月26日 17:22:56 +00:00Commented Aug 26, 2025 at 17:22
-
Don't attempt to extend Spark with Java, simply use Scala directly. Is there a reason why you must use Java? Also be aware although creating custom expressions is awesome for performance and optimisation options you will hit source (and runtime) compatibility issues for each minor release (and sometimes builds).Chris– Chris2025年08月27日 05:53:01 +00:00Commented Aug 27, 2025 at 5:53
-
The reason is that we're a Java shop, and having to use Scala will be a big hurdle for the team(s), both in terms of know-how and in terms of willingness. The custom expression has several reasons, one being that it sits in the innermost loop of a structured streaming pipeline, and the other that I'm using a bunch of Spark code that deals with InternalRow.Carsten– Carsten2025年08月27日 09:24:20 +00:00Commented Aug 27, 2025 at 9:24
-
ah well, I fear it'll be a far bigger hurdle trying to use Java even if this language mismatch issue didn't exist, you only need Scala for the bits that directly interact/extend and a lot of that you can copy and paste from other projects e.g. this simple example. There is nothing stopping those bits of code calling your java code. Best of luckChris– Chris2025年08月27日 10:05:48 +00:00Commented Aug 27, 2025 at 10:05