Regex help in Snowflake

Question 1

I am trying to write a regex that can replace '[', ']' and '].' with '_' and in cases where ']' is the last character it should be replaced with '' but I am struggling to come up with a regex for it that works in all cases because Snowflake does not support lookahead & lookbehind.

My question is, has anyone tried/achieved to do this before? or is it impossible and I should just give up??

Eg:

look_behind[0] --> look_behind_0
look_behind[0]_positive[1] --> look_behind_0_positive_1

Question 2

Can you show us your attempts

Question 3

Hi, and welcome to dba.se! You have in your two record sample, the datum look_behind[0]_positive[1] which you want converted to look_behind_0_positive_1. You only have one underscore after the 0 - should that be two underscores or should it be only one? This changes the question greatly!

Question 4

As far as I can see, you have no need to use a regex at all!

Using a search engine, I found that the functions LEFT() (manual for all string functions), RIGHT() and TRANSLATE() have the same signatures (mostly) in Snowflake (1, 2 & 3) as in PostgreSQL. All of the code below is available on the fiddle here.

Populate with your sample data with extra records:

INSERT INTO tab VALUES 
('look_behind[0]'),
('look_behind[0]_positive[1]'),
('No_[brackets]_[at]_end'), -- two added records for testing
('[]][][_sderfrx÷=/_[]ddf]');

Then, we do this:

--
-- Works where the string <underscore><left-bracket> (_[) OR the string 
-- <right-bracket><underscore> (]_) are to be replaced by two underscores.
--
-- If the last character of the string is <right-bracket> (]), it is replaced
-- by the empty string.
--
-- It is worth noting that the order of the replacements is important as
-- the operations are not commutative - i.e. x R y <> y R x (not necessarily!).
--
--
SELECT
 str,
 LENGTH(str) AS len,
 TRANSLATE
 (
 CASE
 WHEN (RIGHT(str, 1) = ']') THEN RIGHT(str, -1)
 ELSE str
 END,
 '[]',
 '__' -- two underscores - one for each bracket (left & right)
 ) AS f_str,
 REGEXP_COUNT
 (
 str,
 '[\[|\]]',
 1, -- default anyway - not required
 'i' -- not relevant for non-alpha characters
 ) AS f_str_chars_cnt
FROM
 tab;

Result:

str len f_str f_str_chars_cnt
look_behind[0] 14 look_behind_0 2
look_behind[0]_positive[1] 26 look_behind_0__positive_1 4
No_[brackets]_[at]_end 22 No__brackets___at__end 4
[]][][_sderfrx÷=/_[]ddf] 24 _______sderfrx÷=/___ddf 9

You could also do this:

--
-- Works where the string <underscore><left-bracket> (_[) OR the string 
-- <right-bracket><underscore> (]_) are to be replaced by one underscore. 
--
-- If the last character of the string is <right-bracket> (]), it is replaced
-- by the empty string.
--
-- It is worth noting that the order of the replacements is important as
-- the operations are not commutative - i.e. x R y <> y R x (not necessarily!).
--
--
SELECT
 str,
 TRANSLATE
 (
 CASE
 WHEN
 RIGHT (REPLACE(REPLACE(str, '_[', '_'), ']_', '_'), 1) = ']' THEN 
 LEFT (REPLACE(REPLACE(str, '_[', '_'), ']_', '_'), -1)
 ELSE
 str
 END,
 '[]',
 '__'
 )
FROM 
 tab;

Result (the reader can count the underscores!):

str translate
look_behind[0] look_behind_0
look_behind[0]_positive[1] look_behind_0_positive_1
No_[brackets]_[at]_end No__brackets___at__end
[]][][_sderfrx÷=/_[]ddf] _______sderfrx÷=/__ddf

Regexes are expensive relative to simple string functions. If you really want a regular expression, you can do the following (REGEXP_REPLACE() also has a similar signature in PostgreSQL as in Snowflake):

-- If the last character of the string is <right-bracket> (]), it is replaced
-- by the empty string.
--
-- Works where the string <underscore><left-bracket> (_[) AND the string 
-- <right-bracket><underscore> (]_) are to be replaced by two underscores (__).
--
-- The order of the replacements for the desired outcome in terms of what character/
-- string is replaced first is left up to the reader.
-- 
SELECT
 REGEXP_REPLACE
 (
 REGEXP_REPLACE
 (
 str, 
 '(.*)]$', -- replace with pattern 
 '1円'
 ),
 '[\[\]]',
 '_', 
 'g' -- "global" flag to replace all occurrences
 ) 
FROM
 tab;

Result: same as first (TRANSLATE()) snippet above.

Performance analysis (see end of fiddle):

The entire results of the fiddle aren't included here, but just to see the times is interesting.

TRANSLATE() - Execution Time: 0.037 ms
REGEXP_REPLACE() - Execution Time: 0.100 ms
REPLACE(...REPLACE() - Execution Time: 0.027 ms

The time taken for the two string function solutions is roughly 33% of the time taken for the regular expression solution. This is why it's always worth checking to see if an ordinary string function will suffice in cases such as this.

Vérace Vérace 31k9 gold badges73 silver badges86 bronze badges · Answer 1 · 2024-04-15 11:25:58Z

As far as I can see, you have no need to use a regex at all!

Using a search engine, I found that the functions LEFT() (manual for all string functions), RIGHT() and TRANSLATE() have the same signatures (mostly) in Snowflake (1, 2 & 3) as in PostgreSQL. All of the code below is available on the fiddle here.

Populate with your sample data with extra records:

INSERT INTO tab VALUES 
('look_behind[0]'),
('look_behind[0]_positive[1]'),
('No_[brackets]_[at]_end'), -- two added records for testing
('[]][][_sderfrx÷=/_[]ddf]');

Then, we do this:

--
-- Works where the string <underscore><left-bracket> (_[) OR the string 
-- <right-bracket><underscore> (]_) are to be replaced by two underscores.
--
-- If the last character of the string is <right-bracket> (]), it is replaced
-- by the empty string.
--
-- It is worth noting that the order of the replacements is important as
-- the operations are not commutative - i.e. x R y <> y R x (not necessarily!).
--
--
SELECT
 str,
 LENGTH(str) AS len,
 TRANSLATE
 (
 CASE
 WHEN (RIGHT(str, 1) = ']') THEN RIGHT(str, -1)
 ELSE str
 END,
 '[]',
 '__' -- two underscores - one for each bracket (left & right)
 ) AS f_str,
 REGEXP_COUNT
 (
 str,
 '[\[|\]]',
 1, -- default anyway - not required
 'i' -- not relevant for non-alpha characters
 ) AS f_str_chars_cnt
FROM
 tab;

Result:

str len f_str f_str_chars_cnt
look_behind[0] 14 look_behind_0 2
look_behind[0]_positive[1] 26 look_behind_0__positive_1 4
No_[brackets]_[at]_end 22 No__brackets___at__end 4
[]][][_sderfrx÷=/_[]ddf] 24 _______sderfrx÷=/___ddf 9

You could also do this:

--
-- Works where the string <underscore><left-bracket> (_[) OR the string 
-- <right-bracket><underscore> (]_) are to be replaced by one underscore. 
--
-- If the last character of the string is <right-bracket> (]), it is replaced
-- by the empty string.
--
-- It is worth noting that the order of the replacements is important as
-- the operations are not commutative - i.e. x R y <> y R x (not necessarily!).
--
--
SELECT
 str,
 TRANSLATE
 (
 CASE
 WHEN
 RIGHT (REPLACE(REPLACE(str, '_[', '_'), ']_', '_'), 1) = ']' THEN 
 LEFT (REPLACE(REPLACE(str, '_[', '_'), ']_', '_'), -1)
 ELSE
 str
 END,
 '[]',
 '__'
 )
FROM 
 tab;

Result (the reader can count the underscores!):

str translate
look_behind[0] look_behind_0
look_behind[0]_positive[1] look_behind_0_positive_1
No_[brackets]_[at]_end No__brackets___at__end
[]][][_sderfrx÷=/_[]ddf] _______sderfrx÷=/__ddf

Regexes are expensive relative to simple string functions. If you really want a regular expression, you can do the following (REGEXP_REPLACE() also has a similar signature in PostgreSQL as in Snowflake):

-- If the last character of the string is <right-bracket> (]), it is replaced
-- by the empty string.
--
-- Works where the string <underscore><left-bracket> (_[) AND the string 
-- <right-bracket><underscore> (]_) are to be replaced by two underscores (__).
--
-- The order of the replacements for the desired outcome in terms of what character/
-- string is replaced first is left up to the reader.
-- 
SELECT
 REGEXP_REPLACE
 (
 REGEXP_REPLACE
 (
 str, 
 '(.*)]$', -- replace with pattern 
 '1円'
 ),
 '[\[\]]',
 '_', 
 'g' -- "global" flag to replace all occurrences
 ) 
FROM
 tab;

Result: same as first (TRANSLATE()) snippet above.

Performance analysis (see end of fiddle):

The entire results of the fiddle aren't included here, but just to see the times is interesting.

TRANSLATE() - Execution Time: 0.037 ms
REGEXP_REPLACE() - Execution Time: 0.100 ms
REPLACE(...REPLACE() - Execution Time: 0.027 ms

The time taken for the two string function solutions is roughly 33% of the time taken for the regular expression solution. This is why it's always worth checking to see if an ordinary string function will suffice in cases such as this.

Stack Exchange Network

Regex help in Snowflake

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Regex help in Snowflake

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions