I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like:
CASE
WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path
...
ELSE -- IPv4 path
inet_aton(ip_address)
END
All test inputs are IPv6, so I expect the IPv4 branch to never be evaluated.
But when running:
WITH test_ipv6 AS (
SELECT * FROM VALUES
('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
...
)
SELECT
ip_address,
get_geo_location(ip_address),
get_isp_location(ip_address),
get_geo_country_code(ip_address),
get_isp_country_code(ip_address)
FROM test_ipv6;
I get:
[INVALID_ARRAY_INDEX] The index 4 is out of bounds.
The error originates from my IPv4-only helper UDF:
CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
SELECT
element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);
This function should never run for IPv6 inputs, but Databricks appears to evaluate or partially evaluate it anyway.
Does Databricks / Spark SQL guarantee lazy evaluation of CASE or can the optimizer evaluate expressions in branches that are not taken?
For SQL UDFs, does the optimizer treat the UDF body as fully transparent and attempt folding/evaluation regardless of the CASE predicate?
Is there any official guidance or best practice for safely branching IPv4 vs IPv6 logic inside SQL UDFs?
-
Use IF or CASE WHEN expressions to do the null check and invoke the UDF in a conditional branch docs.databricks.com/aws/en/udf/…. databricks docs explicity asks us to use IF and CASE WHEN for order of evaluation to be met. So it should ideally work and not behave this way. Did you check with IF ELSE instead of CASE and see if it works ?Vindhya G– Vindhya G2025年12月09日 05:07:56 +00:00Commented Dec 9, 2025 at 5:07
-
1I appreciate your response. but I also tried rewriting the logic using IF expressions instead of CASE, but I still see the same errorYJCMS– YJCMS2025年12月09日 08:56:35 +00:00Commented Dec 9, 2025 at 8:56
-
Just a wild guess. Are you sure all rows are meeting the first condition? Could there be rows that are neither meeting ipv4 or ipv6? Just want to rule out that before we check anything further.Vindhya G– Vindhya G2025年12月09日 10:31:29 +00:00Commented Dec 9, 2025 at 10:31
-
Even when I provide a single valid IPv6 address, I still encounter the same error. Therefore, it does not appear to be a case where the rows fail to meet the IPv4 or IPv6 conditions and unintentionally fall into another branch. The error occurs even when the input is strictly a valid IPv6 value, which should trigger only the IPv6 branch. So I believe we can rule out the possibility that none of the rows are satisfying the conditions.YJCMS– YJCMS2025年12月10日 07:40:24 +00:00Commented Dec 10, 2025 at 7:40
-
2please post the actual code used, your select doesn't have a case in it and will fail. That said this is likely due to sub expression elimination, e.g. if you have element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) in more than one place it will always be evaluated (see here)Chris– Chris2025年12月10日 11:43:47 +00:00Commented Dec 10, 2025 at 11:43
1 Answer 1
putting in an answer so it is easier to find...
This is likely due to sub expression elimination, e.g. if you have element_at(regexp_extract_all(ip_addr, '(\d+)'), 4) in more than one place it will always be evaluated (see here).
Spark, and Databricks, is free to optimise out the query, you cannot assume laziness of evaluation due to this. Code wise all optimised expressions (usually repeated sub expressions) are called before any other code is processed from a projection / query. This sub expression elimination optimisation phase swaps out expressions for proxies when running in interpreted mode and is compiled out to common variables in wholestagecodegen.
Comments
Explore related questions
See similar questions with these tags.