Skip to main content
Stack Overflow
  1. About
  2. For Teams

Return to Question

Post Timeline

deleted 127 characters in body
Source Link

I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like:

CASE
 WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path
 ...
 ELSE -- IPv4 path
 inet_aton(ip_address)
END

All test inputs are IPv6, so I expect the IPv4 branch to never be evaluated.

But when running:

WITH test_ipv6 AS (
 SELECT * FROM VALUES
 ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
 ...
)
SELECT
 ip_address,
 get_geo_location(ip_address),
 get_isp_location(ip_address),
 get_geo_country_code(ip_address),
 get_isp_country_code(ip_address)
FROM test_ipv6;

I get:

[INVALID_ARRAY_INDEX] The index 4 is out of bounds.

The error originates from my IPv4-only helper UDF:

CREATE FUNCTION inet_aton(ip_addr STRING) RETURNS BIGINT RETURN (
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4)
);
CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
 SELECT
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);

This function should never run for IPv6 inputs, but Databricks appears to evaluate or partially evaluate it anyway.

Does Databricks / Spark SQL guarantee lazy evaluation of CASE or can the optimizer evaluate expressions in branches that are not taken?

For SQL UDFs, does the optimizer treat the UDF body as fully transparent and attempt folding/evaluation regardless of the CASE predicate?

Is there any official guidance or best practice for safely branching IPv4 vs IPv6 logic inside SQL UDFs?

I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like:

CASE
 WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path
 ...
 ELSE -- IPv4 path
 inet_aton(ip_address)
END

All test inputs are IPv6, so I expect the IPv4 branch to never be evaluated.

But when running:

WITH test_ipv6 AS (
 SELECT * FROM VALUES
 ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
 ...
)
SELECT
 ip_address,
 get_geo_location(ip_address),
 get_isp_location(ip_address),
 get_geo_country_code(ip_address),
 get_isp_country_code(ip_address)
FROM test_ipv6;

I get:

[INVALID_ARRAY_INDEX] The index 4 is out of bounds.

The error originates from my IPv4-only helper UDF:

CREATE FUNCTION inet_aton(ip_addr STRING) RETURNS BIGINT RETURN (
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4)
);
CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
 SELECT
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);

This function should never run for IPv6 inputs, but Databricks appears to evaluate or partially evaluate it anyway.

Does Databricks / Spark SQL guarantee lazy evaluation of CASE or can the optimizer evaluate expressions in branches that are not taken?

For SQL UDFs, does the optimizer treat the UDF body as fully transparent and attempt folding/evaluation regardless of the CASE predicate?

Is there any official guidance or best practice for safely branching IPv4 vs IPv6 logic inside SQL UDFs?

I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like:

CASE
 WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path
 ...
 ELSE -- IPv4 path
 inet_aton(ip_address)
END

All test inputs are IPv6, so I expect the IPv4 branch to never be evaluated.

But when running:

WITH test_ipv6 AS (
 SELECT * FROM VALUES
 ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
 ...
)
SELECT
 ip_address,
 get_geo_location(ip_address),
 get_isp_location(ip_address),
 get_geo_country_code(ip_address),
 get_isp_country_code(ip_address)
FROM test_ipv6;

I get:

[INVALID_ARRAY_INDEX] The index 4 is out of bounds.

The error originates from my IPv4-only helper UDF:

CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
 SELECT
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);

This function should never run for IPv6 inputs, but Databricks appears to evaluate or partially evaluate it anyway.

Does Databricks / Spark SQL guarantee lazy evaluation of CASE or can the optimizer evaluate expressions in branches that are not taken?

For SQL UDFs, does the optimizer treat the UDF body as fully transparent and attempt folding/evaluation regardless of the CASE predicate?

Is there any official guidance or best practice for safely branching IPv4 vs IPv6 logic inside SQL UDFs?

deleted 228 characters in body
Source Link

I'm using Databricks SQL and have several scalar SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like:

CASE
 WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path
 ...
 ELSE -- IPv4 path
 inet_aton(ip_address)
END

All test inputs are IPv6, so I expect the IPv4 branch to never be evaluated.

But when running:

WITH test_ipv6 AS (
 SELECT * FROM VALUES
 ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
 ...
)
SELECT
 ip_address,
 get_geo_location(ip_address),
 get_isp_location(ip_address),
 get_geo_country_code(ip_address),
 get_isp_country_code(ip_address)
FROM test_ipv6;

I get:

[INVALID_ARRAY_INDEX] The index 4 is out of bounds.

The error originates from my IPv4-only helper UDF:

CREATE FUNCTION inet_aton(ip_addr STRING) RETURNS BIGINT RETURN (
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4)
);
CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
 SELECT
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);

This function should never run for IPv6 inputs, yetbut Databricks appears to evaluate or partially evaluate it anyway.

Questions

Does Databricks / Spark SQL guarantee lazy evaluation of CASE or can the optimizer evaluate expressions in branches that are not taken?

For scalar SQL UDFs, does the optimizer treat the UDF body as fully transparent and attempt folding/evaluation regardless of the CASE predicate?

Should UDFs be written defensively because any branch might be evaluated during optimization?

Is there any official guidance or best practice for safely branching IPv4 vs IPv6 logic inside SQL UDFs?

Any clarification on how Databricks evaluates CASE branches with scalar UDFs would be appreciated.

I'm using Databricks SQL and have several scalar SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like:

CASE
 WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path
 ...
 ELSE -- IPv4 path
 inet_aton(ip_address)
END

All test inputs are IPv6, so I expect the IPv4 branch to never be evaluated.

But when running:

WITH test_ipv6 AS (
 SELECT * FROM VALUES
 ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
 ...
)
SELECT
 ip_address,
 get_geo_location(ip_address),
 get_isp_location(ip_address),
 get_geo_country_code(ip_address),
 get_isp_country_code(ip_address)
FROM test_ipv6;

I get:

[INVALID_ARRAY_INDEX] The index 4 is out of bounds.

The error originates from my IPv4-only helper UDF:

CREATE FUNCTION inet_aton(ip_addr STRING) RETURNS BIGINT RETURN (
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4)
);
CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
 SELECT
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);

This function should never run for IPv6 inputs, yet Databricks appears to evaluate or partially evaluate it anyway.

Questions

Does Databricks / Spark SQL guarantee lazy evaluation of CASE or can the optimizer evaluate expressions in branches that are not taken?

For scalar SQL UDFs, does the optimizer treat the UDF body as fully transparent and attempt folding/evaluation regardless of the CASE predicate?

Should UDFs be written defensively because any branch might be evaluated during optimization?

Is there any official guidance or best practice for safely branching IPv4 vs IPv6 logic inside SQL UDFs?

Any clarification on how Databricks evaluates CASE branches with scalar UDFs would be appreciated.

I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like:

CASE
 WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path
 ...
 ELSE -- IPv4 path
 inet_aton(ip_address)
END

All test inputs are IPv6, so I expect the IPv4 branch to never be evaluated.

But when running:

WITH test_ipv6 AS (
 SELECT * FROM VALUES
 ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
 ...
)
SELECT
 ip_address,
 get_geo_location(ip_address),
 get_isp_location(ip_address),
 get_geo_country_code(ip_address),
 get_isp_country_code(ip_address)
FROM test_ipv6;

I get:

[INVALID_ARRAY_INDEX] The index 4 is out of bounds.

The error originates from my IPv4-only helper UDF:

CREATE FUNCTION inet_aton(ip_addr STRING) RETURNS BIGINT RETURN (
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4)
);
CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
 SELECT
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);

This function should never run for IPv6 inputs, but Databricks appears to evaluate or partially evaluate it anyway.

Does Databricks / Spark SQL guarantee lazy evaluation of CASE or can the optimizer evaluate expressions in branches that are not taken?

For SQL UDFs, does the optimizer treat the UDF body as fully transparent and attempt folding/evaluation regardless of the CASE predicate?

Is there any official guidance or best practice for safely branching IPv4 vs IPv6 logic inside SQL UDFs?

deleted 486 characters in body
Source Link

IPv4-only UDF fails when querying IPv6-only data.

I'm using Databricks SQL and I’m seeing a confusing behavior around SQL UDFs + CASE expressions.

I have a set ofseveral scalar SQL UDFs for GeoIP / ISP lookups.
They take an IP address and internally branchEach UDF branches on IPv4 vs IPv6 using a CASECASE expression.

For example, the high-level logic(get_geo_location(ip_address), get_isp_location(ip_address), get_geo_country_code(ip_address), get_isp_country_code(ip_address)) looks like this:

CASE
 WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path (uses inet6_xxx UDFs) ...
 ELSE -- IPv4 path (uses inet_aton UDF inet_aton(ip_address)
END

All of my test inputs are IPv6 addresses, so I expect the IPv4 path (ELSE branch) to never be executedevaluated.

However,But when I run a test query like thisrunning:

WITH test_ipv6 AS (
 SELECT * FROM VALUES
 ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
 ...
)
SELECT
 ip_address,
 description,
 get_geo_location(ip_address),
 get_isp_location(ip_address),
 get_geo_country_code(ip_address),
 get_isp_country_code(ip_address)
FROM test_ipv6;

I get this error:

[INVALID_ARRAY_INDEX] The index 4 is out of bounds. The array has 3 elements.
Use the SQL function `get()` to tolerate accessing element at invalid index.

The error points intooriginates from my IPv4-onlyIPv4-only helper functionUDF:

CREATE FUNCTION inet_aton(ip_addr STRING) RETURNS BIGINT RETURN (
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4)
);
CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
 SELECT
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);

This function expects a valid IPv4 string with exactly 4 numeric components.
But in this test, Ishould never call it directly, and all inputs arerun for IPv6 inputs, yet Databricks appears to evaluate or partially evaluate it anyway.

From a logical SQL perspective I would expect:Questions

  • ip_address LIKE '%:%:%' to be true for all rows

  • only the IPv6 branch of the CASE to be evaluated

  • the IPv4 branch (with inet_aton) to be effectively "dead code" for this query

ButDoes Databricks still seems to evaluate/ Spark SQL guarantee lazy evaluation of CASE or partiallycan the optimizer evaluate expressions in branches that are not taken?

For scalar SQL UDFs, does the IPv4optimizer treat the UDF body as fully transparent and fails with INVALID_ARRAY_INDEX when regexp_extract_all returns fewer than 4 elements for some IPv6 strings.attempt folding/evaluation regardless of the CASE predicate?

+Should UDFs be written defensively because any branch might be evaluated during optimization?

Is there any official guidance or best practice for writing Databricks / Spark SQL UDFs that branch onsafely branching IPv4 vs IPv6 (or similar) using CASE, so that unreachable branches don’t cause runtime errors during optimizationlogic inside SQL UDFs?

Any clarification on how Databricks evaluates CASE branches with scalar UDFs would be appreciated.

IPv4-only UDF fails when querying IPv6-only data.

I'm using Databricks SQL and I’m seeing a confusing behavior around SQL UDFs + CASE expressions.

I have a set of scalar SQL UDFs for GeoIP / ISP lookups.
They take an IP address and internally branch on IPv4 vs IPv6 using a CASE expression.

For example, the high-level logic(get_geo_location(ip_address), get_isp_location(ip_address), get_geo_country_code(ip_address), get_isp_country_code(ip_address)) looks like this:

CASE
 WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path (uses inet6_xxx UDFs)
 ELSE -- IPv4 path (uses inet_aton UDF)
END

All of my test inputs are IPv6 addresses, so I expect the IPv4 path (ELSE branch) to never be executed.

However, when I run a test query like this:

WITH test_ipv6 AS (
 SELECT * FROM VALUES
 ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
 ...
)
SELECT
 ip_address,
 description,
 get_geo_location(ip_address),
 get_isp_location(ip_address),
 get_geo_country_code(ip_address),
 get_isp_country_code(ip_address)
FROM test_ipv6;

I get this error:

[INVALID_ARRAY_INDEX] The index 4 is out of bounds. The array has 3 elements.
Use the SQL function `get()` to tolerate accessing element at invalid index.

The error points into my IPv4-only helper function:

CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
 SELECT
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);

This function expects a valid IPv4 string with exactly 4 numeric components.
But in this test, I never call it directly, and all inputs are IPv6.

From a logical SQL perspective I would expect:

  • ip_address LIKE '%:%:%' to be true for all rows

  • only the IPv6 branch of the CASE to be evaluated

  • the IPv4 branch (with inet_aton) to be effectively "dead code" for this query

But Databricks still seems to evaluate or partially evaluate the IPv4 UDF and fails with INVALID_ARRAY_INDEX when regexp_extract_all returns fewer than 4 elements for some IPv6 strings.

+

Is there any official guidance or best practice for writing Databricks / Spark SQL UDFs that branch on IPv4 vs IPv6 (or similar) using CASE, so that unreachable branches don’t cause runtime errors during optimization?

I'm using Databricks SQL and have several scalar SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like:

CASE
 WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path  ...
 ELSE -- IPv4 path  inet_aton(ip_address)
END

All test inputs are IPv6, so I expect the IPv4 branch to never be evaluated.

But when running:

WITH test_ipv6 AS (
 SELECT * FROM VALUES
 ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'),
 ...
)
SELECT
 ip_address,
 get_geo_location(ip_address),
 get_isp_location(ip_address),
 get_geo_country_code(ip_address),
 get_isp_country_code(ip_address)
FROM test_ipv6;

I get:

[INVALID_ARRAY_INDEX] The index 4 is out of bounds.

The error originates from my IPv4-only helper UDF:

CREATE FUNCTION inet_aton(ip_addr STRING) RETURNS BIGINT RETURN (
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4)
);
CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING)
RETURNS BIGINT
DETERMINISTIC
RETURN (
 SELECT
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) +
 element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0)
);

This function should never run for IPv6 inputs, yet Databricks appears to evaluate or partially evaluate it anyway.

Questions

Does Databricks / Spark SQL guarantee lazy evaluation of CASE or can the optimizer evaluate expressions in branches that are not taken?

For scalar SQL UDFs, does the optimizer treat the UDF body as fully transparent and attempt folding/evaluation regardless of the CASE predicate?

Should UDFs be written defensively because any branch might be evaluated during optimization?

Is there any official guidance or best practice for safely branching IPv4 vs IPv6 logic inside SQL UDFs?

Any clarification on how Databricks evaluates CASE branches with scalar UDFs would be appreciated.

added 137 characters in body
Source Link
Loading
added 9 characters in body
Source Link
Loading
Source Link
Loading

AltStyle によって変換されたページ (->オリジナル) /