Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

JBang for UDFs #1489

mbroecheler started this conversation in Ideas
Aug 13, 2025 · 3 comments · 2 replies
Discussion options

To implement a user defined function for a DataSQRL pipeline the user currently has to setup an entire maven or gradle project and remember to compile it to a jar before compiling the pipeline. For simple UDFs that's a lot of work and source of error.

To support scripting of UDFs, I propose that we add a preprocessor for *.java and *.kt (kotlin) files which looks for

//DEPS com.datasqrl.flinkrunner:stdlib-utils

which is the base java module for all datasqrl UDFs (to support auto-discovery and provide general utility functions).
Any file containing this is a UDF implementation and we compile it to a jar:

jbang export jar --output [name-of-file]-jar.jar [name-of-file].java/kt

We then add the jar to the lib folder so it can be used by the compiler and pipeline.

This would compile the java and kotlin scripts with the pipeline compilation, eliminating any steps the user has to take and greatly simplifying UDF implementations.

To make this efficient, we need to install jbang into the docker image for the compiler and make sure we load all necessary JDK dependencies as well as the maven dependency for com.datasqrl.flinkrunner:stdlib-utils to avoid having to download it on every compile.

You must be logged in to vote

Replies: 3 comments 2 replies

Comment options

What about emphasizing Python for simple UDFs instead? Writing a simple python function in 1 file will always be simpler and more lightweight than Java. Also vectorized UDFs are a Python only in Flink, which is another cool thing.

Well, of course this brings in the burden of managing a Python version in our environments and also configure it properly for Flink (not an easy thing on its own, been there, done that...), and when someone wants to get more serious and bring in custom dependencies and stuff like the user side can also become complex and hard to manage as well.

But if we'd like to focus on the simple and lightweight stuff, IMO Python should be a first-class citizen.

You must be logged in to vote
1 reply
Comment options

mbroecheler Aug 13, 2025
Maintainer Author

We should definitely also look into supporting Python. But the interface to Python isn't great: there is no type checking for input types and you have to annotate the result type. This is error prone. Plus, all the complexity you mentioned about having to maintain a separate runtime and the subtle differences that lead to operational complexity.

Using python is primarily appealing for developers that use Python. We are targeting enterprise devs and the majority of those know java/jvm. I think we can come up with a user experience that is nicer than python and as simple (in particular with Kotlin which has a similar scripting feel to python) with none of the operational headaches.

That does not mean we should rule out Python. But I don't think Python should be our only answer to scripting UDFs.

Comment options

Protoyping the solution out a bit, it gets more complex when you want to pull in dependencies that are not provided by the Flink cluster.
Building a fatjar is not a great strategy because we end up with version conflicts.

It seems the best option is to do the following:

jbang export portable --output [name-of-file]-jar.jar [name-of-file].java/kt
# Then: also move all jars from the lib folder that are not flink, flink runner, or included jars (e.g. jackson)

The tricky bit is that this requires version alignment between flink and udf implementations. The positive of this approach is that for mutliple udfs we only use one copy of the dependencies.
For simple to medium-complex UDFs this should be good enough since depending on generic versions of jackson, flink, etc should be viable. For complex UDFs the user needs to use maven or gradle so they can shade (they can use jbang to bootstrap this).

The goal for this is to cover use cases with a handful of UDFs that fairly straight forward - i.e. <200 loc with only a few generic dependencies.

You must be logged in to vote
0 replies
Comment options

I played with this a bit, and JBang does not track down transitive dependencies. So adding

//DEPS com.datasqrl.flinkrunner:stdlib-utils

is not enough, it explicitly requires both Google auto service and the Flink table common deps to find @AutoService and ScalarFunction and be able to compile the most dummy UDF possible:

//DEPS org.apache.flink:flink-table-common:1.19.3
//DEPS com.google.auto.service:auto-service:1.1.1

Also, now there are a fatjar and a local option. fatjar copies every transitive dep, so with just this 2, it will result in a 20+MB JAR, which includes Guava and a bunch of stuff. And at the moment there is no way to exclude anything in any way AFAIK. So any UDF that is built this way is limited to use dependencies that will be already available in the classpath, which means stdlib. This latter is not an issue I guess, as the original idea also states this shortcut is targeted for lightweight scripts, but adding all the deps 1 by 1 will be necessary anyways.

You must be logged in to vote
1 reply
Comment options

mbroecheler Aug 28, 2025
Maintainer Author

Yes, it is fine to only support dependencies that are on the classpath. The goal is to support simple UDFs. If you need complex dependencies, the users should go the maven/gradle route.
Once we have this working, we can think about what types of dependencies a user may frequently need (jackson, okhttp) and provide those in the flink sql runner, but that's a secondary step. For now, we just want to make it is easy to build custom string manipulation, math, and other procedural code that's mostly standard java.

As for the transitive dependencies: That's a bummer since it makes things more verbose and creates explicit dependencies with versions that need to align. That will be a frequent source of errors. Is there a way we could create a module in flink sql runner that includes those dependencies explicitly. Since all of it will be provided by the flink sql runner anyway?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Ideas
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /