Longest matching suffix while using aggregate functions

Question 1

Background

Looking to find the longest matching string suffix.

Setup

Consider the following fiddle:

CREATE TABLE noun
 ("label" varchar(10))
;
INSERT INTO noun
 ("label")
VALUES
 ('bar'),
 ('blue bar'),
 ('red bar'),
 ('green bar'),
 ('purple bar'),
 ('handlebar')
;
CREATE TABLE noun_inflection
 ("label_singular" varchar(9), "label_plural" varchar(9))
;
INSERT INTO noun_inflection
 ("label_singular", "label_plural")
VALUES
 ('bar', 'bars'),
 ('handlebar', 'handlebar')
;

And the following query:

select * from noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular;

This returns two rows:

LABEL | LABEL_SINGULAR | LABEL_PLURAL
------------+----------------+-------------
handlebar | bar | bars
handlebar | handlebar | handlebar

The first row is correct, but not desired. For this particular purpose, the Levenshtein distance can be used to eliminate the duplicate:

select * from noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular
order by
 levenshtein( n.label, ni.label_singular )
limit 1;

This re-orders the rows based on the label similarity. In this example, "handlebar" exactly matches "handlebar", and has a distance of 0. Adding the limit 1 restricts the query to a single row.

Problem

The setup works, except PostgreSQL 9.1 does not respect LIMIT modifiers in aggregate functions. That is, the following doesn't work:

SELECT
 xmlagg( xmlement( ... ) ORDER BY levenshtein( ... ) LIMIT 1 )
FROM
 noun n, noun_inflection ni

The problem remains. The word 'handlebar' matches '%bar' and '%handlebar', and so this returns two rows, which, in turn, injects two xmlelements into the resulting XML document when only one is expected.

Update #1

To clarify:

select
 xmlagg(
 xmlelement(
 name "noun",
 trim( TRAILING label_singular FROM n.label ) || ni.label_plural
 )
 )
from
 noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular;

That should return a single 'handlebar' XML element. Currently, it returns 'handlebars' and 'handlebar':

{ "Value": "<noun>handlebars</noun><noun>handlebar</noun>", "Type": "xml" }

The desired output is:

{ "Value": "<noun>handlebar</noun>", "Type": "xml" }

Update #2

Even though the following code solves the handlebar/handlebars problem, it prevents multiple different nouns from being returned:

select
 xmlagg(
 xmlelement(
 name "noun",
 trim( TRAILING label_singular FROM n.label ) || ni.label_plural
 )
 )
from
 noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular
group by n.label, ni.label_singular
order by levenshtein( n.label, ni.label_singular )
limit 1

Update #3

This looks like it will require a stored function. Something along the lines of:

 SELECT
 trim( TRAILING label_singular FROM p_noun ) || ni.label_plural
 FROM
 noun_inflection ni
 WHERE
 p_noun ILIKE '%'||ni.label_singular
 ORDER BY
 levenshtein( p_noun, ni.label_singular )
 LIMIT 1;

Question

How would you match and return only the longest substring?

Question 2

What is wrong with the (maybe too obvious?):

select * from noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular
order by
 char_length(ni.label_singular) DESC
limit 1;

Question 3

PostgreSQL 9.1 does not respect LIMIT modifiers in aggregate functions.

Question 4

But you are not aggregating here, you are joining

Question 5

DaveJarvis it is not at all clear how this answer does not address your problem. Why the arbitrary restriction of not using LIMIT?

Question 6

Can't you use a cte or a derived table, like this?: SELECT xmlagg(...) FROM (query by Thomas) AS t;

Question 7

You could add a windows function with row_number() if that syntax works better for you?

Question 8

If you want the longest substring that means there is no other which is longer. A NOT EXISTS predicate will give this.

select
 <whatever>
from <your table> as aa
where <predicates>
and not exists
 (
 select 1
 from <your table> as bb
 where <predicates>
 and len(bb.SomeColumn) > len(aa.SomeColumn)
 );

Of course the len() function can be replaced by levenshtein() as your examples show. The correlated query may cause performance issues. Is your set of probe values sufficiently small to pre-compute the function values for each?

You may be able to use one of the fast-but-wrong queries to reduce the initial sets to a managably small superset of the correct answer(s), which can then be processed by a slow-but-correct algoithm.

Question 9

The only viable solution I could find was to write a function:

 FUNCTION get_noun_inflection( p_noun text, ... params ... )
 -- ... body, declare, variable, etc. 
 SELECT
 CASE
 -- ... conditions ...
 THEN trim( TRAILING ni.label_singular FROM p_noun ) || ni.label_plural
 -- Noun in singular form (no pluralization)
 ELSE p_noun
 END
 FROM
 noun_inflection ni
 INTO
 v_result
 WHERE
 p_noun ILIKE '%'||ni.label_singular
 ORDER BY
 levenshtein( p_noun, ni.label_singular )
 LIMIT 1;
 IF NOT found THEN
 v_result := p_noun;
 END IF;
 RETURN v_result;
 -- ... exception handling, default values, cost, etc.

Then use the function:

select
 xmlagg(
 xmlelement(
 name "noun",
 get_noun_inflection( n.label )
 )
 )
from
 noun n

Thomas Kejser Thomas Kejser 6,2182 gold badges24 silver badges46 bronze badges · Answer 1 · 2014-10-05 09:50:40Z

4

What is wrong with the (maybe too obvious?):

select * from noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular
order by
 char_length(ni.label_singular) DESC
limit 1;

Share

Improve this answer

answered Oct 5, 2014 at 9:50

Thomas Kejser's user avatar

Thomas Kejser Thomas Kejser

6,2182 gold badges24 silver badges46 bronze badges

5

PostgreSQL 9.1 does not respect LIMIT modifiers in aggregate functions.

Dave Jarvis
– Dave Jarvis

2014年10月05日 18:06:59 +00:00
Commented Oct 5, 2014 at 18:06
But you are not aggregating here, you are joining

Thomas Kejser
– Thomas Kejser

2014年10月05日 18:21:47 +00:00
Commented Oct 5, 2014 at 18:21
DaveJarvis it is not at all clear how this answer does not address your problem. Why the arbitrary restriction of not using LIMIT?

ypercubeᵀᴹ
– ypercubeᵀᴹ

2014年10月05日 18:28:41 +00:00
Commented Oct 5, 2014 at 18:28
1

Can't you use a cte or a derived table, like this?: SELECT xmlagg(...) FROM (query by Thomas) AS t;

ypercubeᵀᴹ
– ypercubeᵀᴹ

2014年10月05日 18:32:25 +00:00
Commented Oct 5, 2014 at 18:32
You could add a windows function with row_number() if that syntax works better for you?

Thomas Kejser
– Thomas Kejser

2014年10月05日 22:56:07 +00:00
Commented Oct 5, 2014 at 22:56

Add a comment |

score 1 · Answer 2 · 2014-10-07 09:27:44Z

If you want the longest substring that means there is no other which is longer. A NOT EXISTS predicate will give this.

select
 <whatever>
from <your table> as aa
where <predicates>
and not exists
 (
 select 1
 from <your table> as bb
 where <predicates>
 and len(bb.SomeColumn) > len(aa.SomeColumn)
 );

Of course the len() function can be replaced by levenshtein() as your examples show. The correlated query may cause performance issues. Is your set of probe values sufficiently small to pre-compute the function values for each?

You may be able to use one of the fast-but-wrong queries to reduce the initial sets to a managably small superset of the correct answer(s), which can then be processed by a slow-but-correct algoithm.

Dave Jarvis Dave Jarvis 8331 gold badge10 silver badges25 bronze badges · Answer 3 · 2014-10-06 04:19:35Z

The only viable solution I could find was to write a function:

 FUNCTION get_noun_inflection( p_noun text, ... params ... )
 -- ... body, declare, variable, etc. 
 SELECT
 CASE
 -- ... conditions ...
 THEN trim( TRAILING ni.label_singular FROM p_noun ) || ni.label_plural
 -- Noun in singular form (no pluralization)
 ELSE p_noun
 END
 FROM
 noun_inflection ni
 INTO
 v_result
 WHERE
 p_noun ILIKE '%'||ni.label_singular
 ORDER BY
 levenshtein( p_noun, ni.label_singular )
 LIMIT 1;
 IF NOT found THEN
 v_result := p_noun;
 END IF;
 RETURN v_result;
 -- ... exception handling, default values, cost, etc.

Then use the function:

select
 xmlagg(
 xmlelement(
 name "noun",
 get_noun_inflection( n.label )
 )
 )
from
 noun n

Stack Exchange Network

Longest matching suffix while using aggregate functions

Background

Setup

Problem

Update #1

Update #2

Update #3

Question

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Longest matching suffix while using aggregate functions

Background

Setup

Problem

Update #1

Update #2

Update #3

Question

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions