SELECT rows based on indefinite number of filters

Question 1

How do I create a function which takes indefinite parameters? And then finds all game_ids in a table where each parameter matches a different row (but same game_id)?

Example

Table:

create table tags (
 tag_id serial primary key,
 game_id int, -- references games(game_id),
 tag_name text,
 tag_value text
)

Sample data:

 tag_id | game_id | tag_name | tag_value
--------+---------+-----------+----------------------
 55 | 6 | Event | EUR-ASIA Rapid Match
 58 | 6 | Round | 5
 400 | 38 | Event | EUR-ASIA Rapid Match
 403 | 38 | Round | 4

Example request: Let's say I want all game_ids where

Event (a tag_name) = 'EUR-ASIA Rapid Match' (a tag_value)
AND
Round (a tag_name) = '5' (a tag_value)

A hardcoded solution for that exact scenario only might look like this:

with m1 as (
 select game_id from tags
 where tag_name = 'Event'
 and tag_value = 'EUR-ASIA Rapid Match'
), m2 as (
 select game_id from tags
 where tag_name = 'Round'
 and tag_value = '5'
) select * from m1 intersect select * from m2;

Except I want an indefinite number of tag matches. Can I create a function that takes an arbitrary number of tag names/values and returns a set of game_id matching all? The call might look this (pseudo-code):

select * from get_games_by_tags('{Event,EUR-ASIA Rapid Match}', ...)

Question 2

This is a special case of relational-division . Here is an arsenal of query techniques:

How to filter SQL results in a has-many-through relation

The special difficulty of your case is to filter on the combination of two attributes, but the principle is the same.

You can make this fully dynamic with plain SQL, without string concatenation and dynamic SQL:

Using same column multiple times in WHERE clause

But performance won't come close to the following solution with dynamic SQL.
For best performance, have this (UNIQUE) multicolumn index:

CREATE [UNIQUE] INDEX ON tags (tag_name, tag_value, game_id);

Maybe your PRIMARY KEY on tags already spans these columns. For best performance you need index columns in the demonstrated order. Create an additional index if the PK does not match or change the column order of the PK unless you need columns in a different order (too). Related:

The basic query technique I chose uses the pattern:

SELECT game_id
FROM tags t
WHERE (tag_name, tag_value) = ('Event', 'EUR-ASIA Rapid Match')
AND EXISTS (SELECT FROM tags WHERE game_id = t.game_id AND (tag_name, tag_value) = ('Round', '5'))
AND EXISTS (SELECT FROM tags WHERE game_id = t.game_id AND (tag_name, tag_value) = ('some_tag', 'some value'))
AND ...

This query is already optimized for performance.

Function

Using a user-defined row type as input like you have in your answer (optional, but convenient for the function design). I chose the name game_tag because tag felt too generic:

CREATE TYPE game_tag AS(
 tag_name text
 , tag_value text
);

Note the subtle differences in syntax for these two row values:

'(Event,"EUR-ASIA Rapid Match")'::game_tag
('Event', 'EUR-ASIA Rapid Match')

The first one is a string literal for the registered row type game_tag, the second is a ROW constructor on two string literals building an anonymous row, short for:

ROW('Event', 'EUR-ASIA Rapid Match')

Either works for our purpose and gets index support. Just don't confuse the different syntax requirements. Related:

Invalid input syntax for type numeric: "(0.0000000000000000,8)"

The shortcuts for just 1 or 2 parameters are optional but should further improve performance.

CREATE OR REPLACE FUNCTION f_games_by_tags(VARIADIC _filters game_tag[])
 RETURNS table (game_id int) AS
$func$
BEGIN
 CASE cardinality(_filters)
-- WHEN 0 THEN -- impossible
 WHEN 1 THEN
 RETURN QUERY
 SELECT t.game_id
 FROM tags t
 WHERE (tag_name, tag_value) = _filters[1];
 WHEN 2 THEN
 RETURN QUERY
 SELECT t.game_id
 FROM tags t
 WHERE (tag_name, tag_value) = _filters[1]
 AND EXISTS (
 SELECT FROM tags t1
 WHERE t1.game_id = t.game_id
 AND (tag_name, tag_value) = _filters[2]
 );
 ELSE
 RETURN QUERY EXECUTE
 (SELECT 'SELECT game_id FROM tags t WHERE (tag_name, tag_value) = 1ドル[1] AND '
 || string_agg('EXISTS (SELECT FROM tags WHERE game_id = t.game_id AND (tag_name, tag_value) = 1ドル[' || g || '])', ' AND ')
 FROM generate_series (2, cardinality(_filters)) g)
 USING _filters;
 END CASE;
END
$func$ LANGUAGE plpgsql;

db<>fiddle here

Should be faster by orders of magnitude than what you have in your answer.

Call:

SELECT * FROM f_games_by_tags('(Event,"EUR-ASIA Rapid Match")');
SELECT * FROM f_games_by_tags('(Round,5)', '(Event,"EUR-ASIA Rapid Match")', '(some_tag,"some value")');

You can also pass an actual array to a VARIADIC function. Related:

Question 3

Hallelujah! That's an incredibly helpful explanation. I did not know the term "relational division" beforehand either. That helps me search and find way more information as well. Thanks!!!!

Question 4

Solution

CREATE TYPE tag AS (
 tag_name text,
 tag_value text
);
CREATE OR REPLACE FUNCTION games_by_tags(VARIADIC tag_filters tag[]) RETURNS table (
 game_id int
) as $$
declare
 cur_tag tag;
BEGIN
 create temp table matched_games as
 select games.game_id from games;
 foreach cur_tag in array tag_filters loop
 create temp table matched_games2 as
 select distinct tags.game_id from tags where tag_name = cur_tag.tag_name and tag_value = cur_tag.tag_value
 intersect
 select * from matched_games;
 delete from matched_games;
 insert into matched_games select * from matched_games2;
 drop table matched_games2;
 end loop;
 return query select matched_games.game_id from matched_games order by matched_games.game_id;
 drop table matched_games;
END;
$$ LANGUAGE plpgsql;

Example Usage

# select * from games_by_tags('(Round,5)'::tag);
 game_id
---------
 5
 6
 8
 40
 69
 100
 101
 102
 132
 155
 176
 216
 255
 258
 270
 282
 295
 300
 317
 318
 329
 345
 361
 362
 385
 422
 426
 450
 488
 490
 520
(31 rows)
# select * from games_by_tags('(Event,"EUR-ASIA Rapid Match")'::tag);
 game_id
---------
 6
 38
 93
 108
 109
 158
 226
 343
 396
 405
 497
 542
 546
 547
(14 rows)
# select * from games_by_tags('(Round,5)'::tag, '(Event,"EUR-ASIA Rapid Match")'::tag);
 game_id
---------
 6
(1 row)

Question 5

Creating a temp table, inserting and deleting just for the purpose of a single query imposes a massive overhead and will result in poor performance.

Question 6

@ErwinBrandstetter Is making and populating a temporary table slow just on PostgreSQL or on all major SQL DBMSes?

Question 7

@DamianYerrick: Not as sure about other RDBMS. But still pretty sure. My alternative runs a single, optimized query, backed by a perfectly matching index. Your function runs many queries, and writing, deleting and creating even temp objects always incurs major (comparatively) costs. Compare the runtime of both functions with EXPLAIN ANALYZE SELECT * FROM games_by_tags( ...) with short and long lists and report the result if you don't mind. Aside: INTERSECT or INTERSECT ALL? See: stackoverflow.com/a/27672973/939860, stackoverflow.com/a/31467595/939860

Question 8

@DamianYerrick: Oh, I see I confused you with the OP, sorry. So, not your function, etc.

Question 9

I see Erwin covered most of it already, so I just toss in a function body for an easy way to do relational division:

create table tags
( tag_id int not null
, game_id int not null
, tag_name varchar(10) not null
, tag_value varchar(20) not null );
insert into tags (tag_id, game_id, tag_name, tag_value)
values (55, 6, 'Event', 'EUR-ASIA Rapid Match')
 , (58, 6, 'Round', '5')
 , (400, 38, 'Event', 'EUR-ASIA Rapid Match')
 , (403, 38, 'Round', '4');
with filter(tag_name, tag_value) as (
 values ('Event', 'EUR-ASIA Rapid Match'), ('Round', '5')
) 
select game_id 
from filter f 
join tags t 
 on (f.tag_name, f.tag_value) = (t.tag_name, t.tag_value)
group by game_id
having count(distinct t.tag_name || t.tag_value) 
 = (select count(distinct f.tag_name || f.tag_value) 
 from filter)

If you have control over the filters, i.e. no duplicates exists, a simpler:

select count(1) from filter

will do. In addition, a unique constraint:

ALTER TABLE tags ADD CONSTRAINT ... UNIQUE (game_id, tag_name, tag_value)

will simplify the whole HAVING clause to:

having count(1) = (select count(1) from filter)

score 4 · Accepted Answer · 2018-05-06 03:43:37Z

This is a special case of relational-division . Here is an arsenal of query techniques:

How to filter SQL results in a has-many-through relation

The special difficulty of your case is to filter on the combination of two attributes, but the principle is the same.

You can make this fully dynamic with plain SQL, without string concatenation and dynamic SQL:

Using same column multiple times in WHERE clause

But performance won't come close to the following solution with dynamic SQL.
For best performance, have this (UNIQUE) multicolumn index:

CREATE [UNIQUE] INDEX ON tags (tag_name, tag_value, game_id);

Maybe your PRIMARY KEY on tags already spans these columns. For best performance you need index columns in the demonstrated order. Create an additional index if the PK does not match or change the column order of the PK unless you need columns in a different order (too). Related:

The basic query technique I chose uses the pattern:

SELECT game_id
FROM tags t
WHERE (tag_name, tag_value) = ('Event', 'EUR-ASIA Rapid Match')
AND EXISTS (SELECT FROM tags WHERE game_id = t.game_id AND (tag_name, tag_value) = ('Round', '5'))
AND EXISTS (SELECT FROM tags WHERE game_id = t.game_id AND (tag_name, tag_value) = ('some_tag', 'some value'))
AND ...

This query is already optimized for performance.

Function

Using a user-defined row type as input like you have in your answer (optional, but convenient for the function design). I chose the name game_tag because tag felt too generic:

CREATE TYPE game_tag AS(
 tag_name text
 , tag_value text
);

Note the subtle differences in syntax for these two row values:

'(Event,"EUR-ASIA Rapid Match")'::game_tag
('Event', 'EUR-ASIA Rapid Match')

The first one is a string literal for the registered row type game_tag, the second is a ROW constructor on two string literals building an anonymous row, short for:

ROW('Event', 'EUR-ASIA Rapid Match')

Either works for our purpose and gets index support. Just don't confuse the different syntax requirements. Related:

Invalid input syntax for type numeric: "(0.0000000000000000,8)"

The shortcuts for just 1 or 2 parameters are optional but should further improve performance.

CREATE OR REPLACE FUNCTION f_games_by_tags(VARIADIC _filters game_tag[])
 RETURNS table (game_id int) AS
$func$
BEGIN
 CASE cardinality(_filters)
-- WHEN 0 THEN -- impossible
 WHEN 1 THEN
 RETURN QUERY
 SELECT t.game_id
 FROM tags t
 WHERE (tag_name, tag_value) = _filters[1];
 WHEN 2 THEN
 RETURN QUERY
 SELECT t.game_id
 FROM tags t
 WHERE (tag_name, tag_value) = _filters[1]
 AND EXISTS (
 SELECT FROM tags t1
 WHERE t1.game_id = t.game_id
 AND (tag_name, tag_value) = _filters[2]
 );
 ELSE
 RETURN QUERY EXECUTE
 (SELECT 'SELECT game_id FROM tags t WHERE (tag_name, tag_value) = 1ドル[1] AND '
 || string_agg('EXISTS (SELECT FROM tags WHERE game_id = t.game_id AND (tag_name, tag_value) = 1ドル[' || g || '])', ' AND ')
 FROM generate_series (2, cardinality(_filters)) g)
 USING _filters;
 END CASE;
END
$func$ LANGUAGE plpgsql;

db<>fiddle here

Should be faster by orders of magnitude than what you have in your answer.

Call:

SELECT * FROM f_games_by_tags('(Event,"EUR-ASIA Rapid Match")');
SELECT * FROM f_games_by_tags('(Round,5)', '(Event,"EUR-ASIA Rapid Match")', '(some_tag,"some value")');

You can also pass an actual array to a VARIADIC function. Related:

Hallelujah! That's an incredibly helpful explanation. I did not know the term "relational division" beforehand either. That helps me search and find way more information as well. Thanks!!!!

Stack Exchange Network

SELECT rows based on indefinite number of filters

Example

3 Answers 3

Function

Solution

Example Usage

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions