Background
Looking to find the longest matching string suffix.
Setup
Consider the following fiddle:
CREATE TABLE noun
("label" varchar(10))
;
INSERT INTO noun
("label")
VALUES
('bar'),
('blue bar'),
('red bar'),
('green bar'),
('purple bar'),
('handlebar')
;
CREATE TABLE noun_inflection
("label_singular" varchar(9), "label_plural" varchar(9))
;
INSERT INTO noun_inflection
("label_singular", "label_plural")
VALUES
('bar', 'bars'),
('handlebar', 'handlebar')
;
And the following query:
select * from noun n, noun_inflection ni
where
n.label = 'handlebar' and
n.label ilike '%'||ni.label_singular;
This returns two rows:
LABEL | LABEL_SINGULAR | LABEL_PLURAL
------------+----------------+-------------
handlebar | bar | bars
handlebar | handlebar | handlebar
The first row is correct, but not desired. For this particular purpose, the Levenshtein distance can be used to eliminate the duplicate:
select * from noun n, noun_inflection ni
where
n.label = 'handlebar' and
n.label ilike '%'||ni.label_singular
order by
levenshtein( n.label, ni.label_singular )
limit 1;
This re-orders the rows based on the label similarity. In this example, "handlebar" exactly matches "handlebar", and has a distance of 0. Adding the limit 1
restricts the query to a single row.
Problem
The setup works, except PostgreSQL 9.1 does not respect LIMIT modifiers in aggregate functions. That is, the following doesn't work:
SELECT
xmlagg( xmlement( ... ) ORDER BY levenshtein( ... ) LIMIT 1 )
FROM
noun n, noun_inflection ni
The problem remains. The word 'handlebar'
matches '%bar'
and '%handlebar'
, and so this returns two rows, which, in turn, injects two xmlelements into the resulting XML document when only one is expected.
Update #1
To clarify:
select
xmlagg(
xmlelement(
name "noun",
trim( TRAILING label_singular FROM n.label ) || ni.label_plural
)
)
from
noun n, noun_inflection ni
where
n.label = 'handlebar' and
n.label ilike '%'||ni.label_singular;
That should return a single 'handlebar' XML element. Currently, it returns 'handlebars' and 'handlebar':
{ "Value": "<noun>handlebars</noun><noun>handlebar</noun>", "Type": "xml" }
The desired output is:
{ "Value": "<noun>handlebar</noun>", "Type": "xml" }
Update #2
Even though the following code solves the handlebar/handlebars problem, it prevents multiple different nouns from being returned:
select
xmlagg(
xmlelement(
name "noun",
trim( TRAILING label_singular FROM n.label ) || ni.label_plural
)
)
from
noun n, noun_inflection ni
where
n.label = 'handlebar' and
n.label ilike '%'||ni.label_singular
group by n.label, ni.label_singular
order by levenshtein( n.label, ni.label_singular )
limit 1
Update #3
This looks like it will require a stored function. Something along the lines of:
SELECT
trim( TRAILING label_singular FROM p_noun ) || ni.label_plural
FROM
noun_inflection ni
WHERE
p_noun ILIKE '%'||ni.label_singular
ORDER BY
levenshtein( p_noun, ni.label_singular )
LIMIT 1;
Question
How would you match and return only the longest substring?
3 Answers 3
What is wrong with the (maybe too obvious?):
select * from noun n, noun_inflection ni
where
n.label = 'handlebar' and
n.label ilike '%'||ni.label_singular
order by
char_length(ni.label_singular) DESC
limit 1;
-
PostgreSQL 9.1 does not respect LIMIT modifiers in aggregate functions.Dave Jarvis– Dave Jarvis2014年10月05日 18:06:59 +00:00Commented Oct 5, 2014 at 18:06
-
But you are not aggregating here, you are joiningThomas Kejser– Thomas Kejser2014年10月05日 18:21:47 +00:00Commented Oct 5, 2014 at 18:21
-
DaveJarvis it is not at all clear how this answer does not address your problem. Why the arbitrary restriction of not using
LIMIT
?ypercubeᵀᴹ– ypercubeᵀᴹ2014年10月05日 18:28:41 +00:00Commented Oct 5, 2014 at 18:28 -
1Can't you use a cte or a derived table, like this?:
SELECT xmlagg(...) FROM (query by Thomas) AS t;
ypercubeᵀᴹ– ypercubeᵀᴹ2014年10月05日 18:32:25 +00:00Commented Oct 5, 2014 at 18:32 -
You could add a windows function with row_number() if that syntax works better for you?Thomas Kejser– Thomas Kejser2014年10月05日 22:56:07 +00:00Commented Oct 5, 2014 at 22:56
If you want the longest substring that means there is no other which is longer. A NOT EXISTS
predicate will give this.
select
<whatever>
from <your table> as aa
where <predicates>
and not exists
(
select 1
from <your table> as bb
where <predicates>
and len(bb.SomeColumn) > len(aa.SomeColumn)
);
Of course the len()
function can be replaced by levenshtein()
as your examples show. The correlated query may cause performance issues. Is your set of probe values sufficiently small to pre-compute the function values for each?
You may be able to use one of the fast-but-wrong queries to reduce the initial sets to a managably small superset of the correct answer(s), which can then be processed by a slow-but-correct algoithm.
The only viable solution I could find was to write a function:
FUNCTION get_noun_inflection( p_noun text, ... params ... )
-- ... body, declare, variable, etc.
SELECT
CASE
-- ... conditions ...
THEN trim( TRAILING ni.label_singular FROM p_noun ) || ni.label_plural
-- Noun in singular form (no pluralization)
ELSE p_noun
END
FROM
noun_inflection ni
INTO
v_result
WHERE
p_noun ILIKE '%'||ni.label_singular
ORDER BY
levenshtein( p_noun, ni.label_singular )
LIMIT 1;
IF NOT found THEN
v_result := p_noun;
END IF;
RETURN v_result;
-- ... exception handling, default values, cost, etc.
Then use the function:
select
xmlagg(
xmlelement(
name "noun",
get_noun_inflection( n.label )
)
)
from
noun n