Does Databricks Spark SQL evaluate all CASE branches for UDFs?

Question 1

I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like:

CASE
 WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path
 ...
 ELSE -- IPv4 path
 inet_aton(ip_address)
END

All test inputs are IPv6, so I expect the IPv4 branch to never be evaluated.

But when running:

WITH test_ipv6 AS (
 SELECT * FROM VALUES
 ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
 ...
)
SELECT
 ip_address,
 get_geo_location(ip_address),
 get_isp_location(ip_address),
 get_geo_country_code(ip_address),
 get_isp_country_code(ip_address)
FROM test_ipv6;

I get:

[INVALID_ARRAY_INDEX] The index 4 is out of bounds.

The error originates from my IPv4-only helper UDF:

CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
 SELECT
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);

This function should never run for IPv6 inputs, but Databricks appears to evaluate or partially evaluate it anyway.

Does Databricks / Spark SQL guarantee lazy evaluation of CASE or can the optimizer evaluate expressions in branches that are not taken?

For SQL UDFs, does the optimizer treat the UDF body as fully transparent and attempt folding/evaluation regardless of the CASE predicate?

Is there any official guidance or best practice for safely branching IPv4 vs IPv6 logic inside SQL UDFs?

Question 2

Use IF or CASE WHEN expressions to do the null check and invoke the UDF in a conditional branch docs.databricks.com/aws/en/udf/…. databricks docs explicity asks us to use IF and CASE WHEN for order of evaluation to be met. So it should ideally work and not behave this way. Did you check with IF ELSE instead of CASE and see if it works ?

Question 3

I appreciate your response. but I also tried rewriting the logic using IF expressions instead of CASE, but I still see the same error

Question 4

Just a wild guess. Are you sure all rows are meeting the first condition? Could there be rows that are neither meeting ipv4 or ipv6? Just want to rule out that before we check anything further.

Question 5

Even when I provide a single valid IPv6 address, I still encounter the same error. Therefore, it does not appear to be a case where the rows fail to meet the IPv4 or IPv6 conditions and unintentionally fall into another branch. The error occurs even when the input is strictly a valid IPv6 value, which should trigger only the IPv6 branch. So I believe we can rule out the possibility that none of the rows are satisfying the conditions.

Question 6

please post the actual code used, your select doesn't have a case in it and will fail. That said this is likely due to sub expression elimination, e.g. if you have element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) in more than one place it will always be evaluated (see here)

Question 7

putting in an answer so it is easier to find...

This is likely due to sub expression elimination, e.g. if you have element_at(regexp_extract_all(ip_addr, '(\d+)'), 4) in more than one place it will always be evaluated (see here).

Spark, and Databricks, is free to optimise out the query, you cannot assume laziness of evaluation due to this. Code wise all optimised expressions (usually repeated sub expressions) are called before any other code is processed from a projection / query. This sub expression elimination optimisation phase swaps out expressions for proxies when running in interpreted mode and is compiled out to common variables in wholestagecodegen.

Chris 2,98016 silver badges15 bronze badges · Accepted Answer · 2025-12-11 08:06:44Z

putting in an answer so it is easier to find...

This is likely due to sub expression elimination, e.g. if you have element_at(regexp_extract_all(ip_addr, '(\d+)'), 4) in more than one place it will always be evaluated (see here).

Spark, and Databricks, is free to optimise out the query, you cannot assume laziness of evaluation due to this. Code wise all optimised expressions (usually repeated sub expressions) are called before any other code is processed from a projection / query. This sub expression elimination optimisation phase swaps out expressions for proxies when running in interpreted mode and is compiled out to common variables in wholestagecodegen.

CollectivesTM on Stack Overflow

Does Databricks Spark SQL evaluate all CASE branches for UDFs?

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related