1

Background

Looking to find the longest matching string suffix.

Setup

Consider the following fiddle:

CREATE TABLE noun
 ("label" varchar(10))
;
INSERT INTO noun
 ("label")
VALUES
 ('bar'),
 ('blue bar'),
 ('red bar'),
 ('green bar'),
 ('purple bar'),
 ('handlebar')
;
CREATE TABLE noun_inflection
 ("label_singular" varchar(9), "label_plural" varchar(9))
;
INSERT INTO noun_inflection
 ("label_singular", "label_plural")
VALUES
 ('bar', 'bars'),
 ('handlebar', 'handlebar')
;

And the following query:

select * from noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular;

This returns two rows:

LABEL | LABEL_SINGULAR | LABEL_PLURAL
------------+----------------+-------------
handlebar | bar | bars
handlebar | handlebar | handlebar

The first row is correct, but not desired. For this particular purpose, the Levenshtein distance can be used to eliminate the duplicate:

select * from noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular
order by
 levenshtein( n.label, ni.label_singular )
limit 1;

This re-orders the rows based on the label similarity. In this example, "handlebar" exactly matches "handlebar", and has a distance of 0. Adding the limit 1 restricts the query to a single row.

Problem

The setup works, except PostgreSQL 9.1 does not respect LIMIT modifiers in aggregate functions. That is, the following doesn't work:

SELECT
 xmlagg( xmlement( ... ) ORDER BY levenshtein( ... ) LIMIT 1 )
FROM
 noun n, noun_inflection ni

The problem remains. The word 'handlebar' matches '%bar' and '%handlebar', and so this returns two rows, which, in turn, injects two xmlelements into the resulting XML document when only one is expected.

Update #1

To clarify:

select
 xmlagg(
 xmlelement(
 name "noun",
 trim( TRAILING label_singular FROM n.label ) || ni.label_plural
 )
 )
from
 noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular;

That should return a single 'handlebar' XML element. Currently, it returns 'handlebars' and 'handlebar':

{ "Value": "<noun>handlebars</noun><noun>handlebar</noun>", "Type": "xml" }

The desired output is:

{ "Value": "<noun>handlebar</noun>", "Type": "xml" }

Update #2

Even though the following code solves the handlebar/handlebars problem, it prevents multiple different nouns from being returned:

select
 xmlagg(
 xmlelement(
 name "noun",
 trim( TRAILING label_singular FROM n.label ) || ni.label_plural
 )
 )
from
 noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular
group by n.label, ni.label_singular
order by levenshtein( n.label, ni.label_singular )
limit 1

Update #3

This looks like it will require a stored function. Something along the lines of:

 SELECT
 trim( TRAILING label_singular FROM p_noun ) || ni.label_plural
 FROM
 noun_inflection ni
 WHERE
 p_noun ILIKE '%'||ni.label_singular
 ORDER BY
 levenshtein( p_noun, ni.label_singular )
 LIMIT 1;

Question

How would you match and return only the longest substring?

asked Oct 5, 2014 at 9:42

3 Answers 3

4

What is wrong with the (maybe too obvious?):

select * from noun n, noun_inflection ni
where
 n.label = 'handlebar' and
 n.label ilike '%'||ni.label_singular
order by
 char_length(ni.label_singular) DESC
limit 1;
answered Oct 5, 2014 at 9:50
5
  • PostgreSQL 9.1 does not respect LIMIT modifiers in aggregate functions. Commented Oct 5, 2014 at 18:06
  • But you are not aggregating here, you are joining Commented Oct 5, 2014 at 18:21
  • DaveJarvis it is not at all clear how this answer does not address your problem. Why the arbitrary restriction of not using LIMIT? Commented Oct 5, 2014 at 18:28
  • 1
    Can't you use a cte or a derived table, like this?: SELECT xmlagg(...) FROM (query by Thomas) AS t; Commented Oct 5, 2014 at 18:32
  • You could add a windows function with row_number() if that syntax works better for you? Commented Oct 5, 2014 at 22:56
1

If you want the longest substring that means there is no other which is longer. A NOT EXISTS predicate will give this.

select
 <whatever>
from <your table> as aa
where <predicates>
and not exists
 (
 select 1
 from <your table> as bb
 where <predicates>
 and len(bb.SomeColumn) > len(aa.SomeColumn)
 );

Of course the len() function can be replaced by levenshtein() as your examples show. The correlated query may cause performance issues. Is your set of probe values sufficiently small to pre-compute the function values for each?

You may be able to use one of the fast-but-wrong queries to reduce the initial sets to a managably small superset of the correct answer(s), which can then be processed by a slow-but-correct algoithm.

answered Oct 7, 2014 at 9:27
0
0

The only viable solution I could find was to write a function:

 FUNCTION get_noun_inflection( p_noun text, ... params ... )
 -- ... body, declare, variable, etc. 
 SELECT
 CASE
 -- ... conditions ...
 THEN trim( TRAILING ni.label_singular FROM p_noun ) || ni.label_plural
 -- Noun in singular form (no pluralization)
 ELSE p_noun
 END
 FROM
 noun_inflection ni
 INTO
 v_result
 WHERE
 p_noun ILIKE '%'||ni.label_singular
 ORDER BY
 levenshtein( p_noun, ni.label_singular )
 LIMIT 1;
 IF NOT found THEN
 v_result := p_noun;
 END IF;
 RETURN v_result;
 -- ... exception handling, default values, cost, etc.

Then use the function:

select
 xmlagg(
 xmlelement(
 name "noun",
 get_noun_inflection( n.label )
 )
 )
from
 noun n
answered Oct 6, 2014 at 4:19

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.