select specified number of unique IDs where second column is unique

Question 1

Look at the following example starting from the top row (id=9) and work your way down, selecting a limit of 4 rows that have sec's that we have not yet seen. We "select" id=9 because we don't yet have sec=1. We continue to work our way down like this, but when we get to id=7 we skip it because we already have sec=5 (from row with id=8). We continue in the same manner, and we finally stop at id=3 because we have accumulated 4 rows (our desired limit).

 id | sec
----+-----
 9 | 1 <- 1
 8 | 5 <- 2
 7 | 5 # skip, already have sec=5
 6 | 4 <- 3
 5 | 1 # skip, already have sec=1
 4 | 1 # skip, already have sec=1
 3 | 3 <- 4
 2 | 2
 1 | 1

Of course the SQL algorithm can (will!) be different than I described.

Desired result:

 id
----
 9
 8
 6
 3
(4 rows)

If I wanted to increase the limit to 5 rows, then the row with id=2 would be included in the results. However, if I increased the limit to 6 rows, the row with id=1 would not be added because sec=1 has already been seen.

Note: Though it shouldn't matter, I am on PostgreSQL 9.3.1.

In case you want to quickly build the table to test this out:

CREATE TABLE my_table (id serial primary key, sec integer DEFAULT 0 NOT NULL);
INSERT INTO my_table (sec) VALUES
 (1)
, (2)
, (3)
, (1)
, (1)
, (4)
, (5)
, (5)
, (1);
CREATE INDEX index_my_table_on_sec ON my_table (sec);

Question 2

I had a new idea and added a working recursive CTE to my answer (finally). Would you mind comparing performance on your big table?

Question 3

Sure, but please tell me how to use it - just as earlier you showed me how to call your f_first_uniq function.

Question 4

Oh, it's just a plain query. Use it just like the other query. I added a demo to the fiddle (which seems to be down atm).

Question 5

On my computer, given a table of 1M rows, on average the PARTITION query takes 5.9s, the DISTINCT query takes 3.6s, the RECURSIVE query takes 1.5s, and the FUNCTION query takes less than 1ms.

Question 6

Thanks for the feedback! It's good to back the quick tests with some "real life" numbers.

Question 7

In Postgres, this is simpler with DISTINCT ON:

SELECT *
FROM (
 SELECT DISTINCT ON (sec)
 id, sec
 FROM tbl
 ORDER BY sec, id DESC
 ) sub
ORDER BY id DESC
LIMIT 4;

Detailed explanation in this related answer on SO:

Select first row in each GROUP BY group?

For a big table and small LIMIT, neither this nor @a_horse's solution are very efficient. The subquery will plough through the whole table, wasting a lot of time ...

Recursive CTE

I have tried and failed to solve similar problems with a recursive CTE in the past and resorted to a procedural solution with PL/pgSQL. Example:

Self-join unique pairs

Finally, here is a working rCTE:

WITH RECURSIVE cte AS (
 ( -- parentheses required
 SELECT id, '{}'::int[] AS last_arr, ARRAY[sec] AS arr
 FROM tbl
 ORDER BY id DESC
 LIMIT 1
 )
 UNION ALL
 (
 SELECT b.id, c.arr
 , CASE WHEN b.sec = ANY (c.arr) THEN c.arr ELSE b.sec || c.arr END
 FROM cte c
 JOIN tbl b ON b.id < c.id
 WHERE array_length(c.arr, 1) < 4
 ORDER BY id DESC
 LIMIT 1
 )
 )
SELECT id, arr[1] AS sec
FROM cte
WHERE last_arr <> arr;

It's not as fast or elegant as I had hoped for and not nearly as fast as the function below, but faster than the query above in my tests.

PL/pgSQL function

Fastest by far:

CREATE OR REPLACE FUNCTION f_first_uniq(_rows int)
 RETURNS TABLE (id int, sec int) AS
$func$
DECLARE
 _arr int[];
BEGIN
 FOR id, sec IN
 SELECT t.id, t.sec FROM tbl t ORDER BY t.id DESC
 LOOP
 IF sec = ANY (_arr) THEN
 -- do nothing
 ELSE
 RETURN NEXT;
 _arr := _arr || sec;
 EXIT WHEN array_length(_arr, 1) >= _rows;
 END IF;
 END LOOP;
END
$func$ LANGUAGE plpgsql;

Call:

SELECT * FROM f_first_uniq(4);

SQL Fiddle demonstrating all three.

Could be made out to work for any table with table and column names as parameters and dynamic SQL with EXECUTE ...

Why bother?

In a test table with only 30k rows the function ran 2000x faster than the above query (which already ran ~ 30% faster than a_horse's version). This difference grows with the size of the table. Performance of the function is about constant, while the query's performance gets progressively worse, since it tries to find distinct values in all of the table first. Try this in a table with a million rows ...

Question 8

SELECT id,
 sec
FROM (
 SELECT id,
 sec,
 row_number() OVER (PARTITION BY sec ORDER BY id DESC) AS rn
 FROM my_table
) t
WHERE rn = 1
ORDER BY id DESC 
LIMIT 4;

SQLFiddle example: http://sqlfiddle.com/#!15/1ca01/1

Question 9

I know we're supposed to avoid "thanks" comments, but I really want to let you know how much I appreciate this answer - it does exactly what I need, and I would not have been able to come up with it myself. So, thank you kindly! Out of curiosity, and as far as you know, is there more than one way to do it or is this pretty much it?

Question 10

@user664833: Feedback like this (including thanks) in a comment is welcome. Just keep the noise level in questions and answers low.

Question 11

@ErwinBrandstetter: Ok. Do you think my question had too much noise, and if so, what could have been avoided or improved? Thanks!

Question 12

@user664833: Your question is good. You might have started with what you want instead of a detailed algorithm how to get it. And I'll trim some noise from your code. But you made your case clear and your question is good.

Question 13

@a-horse-with-no-name: I am very sorry, and with all due respect, I had to re-assign the answer mark to @ErwinBrandstetter because his DISTINCT ON solution is considerably faster than yours (28-43%) based on my benchmarks on a table with 100k rows (his query regularly returns in 450-500ms. Furthermore, the speed of his f_first_uniq function is remarkable (it regularly returns in 0.5-5ms. Nonetheless, I do sincerely appreciate your answer, and would have continued to use your solution had @ErwinBrandstetter not presented his. I thank you kindly once again.

score 5 · Accepted Answer · 2014-04-05 22:12:27Z

In Postgres, this is simpler with DISTINCT ON:

SELECT *
FROM (
 SELECT DISTINCT ON (sec)
 id, sec
 FROM tbl
 ORDER BY sec, id DESC
 ) sub
ORDER BY id DESC
LIMIT 4;

Detailed explanation in this related answer on SO:

Select first row in each GROUP BY group?

For a big table and small LIMIT, neither this nor @a_horse's solution are very efficient. The subquery will plough through the whole table, wasting a lot of time ...

Recursive CTE

I have tried and failed to solve similar problems with a recursive CTE in the past and resorted to a procedural solution with PL/pgSQL. Example:

Self-join unique pairs

Finally, here is a working rCTE:

WITH RECURSIVE cte AS (
 ( -- parentheses required
 SELECT id, '{}'::int[] AS last_arr, ARRAY[sec] AS arr
 FROM tbl
 ORDER BY id DESC
 LIMIT 1
 )
 UNION ALL
 (
 SELECT b.id, c.arr
 , CASE WHEN b.sec = ANY (c.arr) THEN c.arr ELSE b.sec || c.arr END
 FROM cte c
 JOIN tbl b ON b.id < c.id
 WHERE array_length(c.arr, 1) < 4
 ORDER BY id DESC
 LIMIT 1
 )
 )
SELECT id, arr[1] AS sec
FROM cte
WHERE last_arr <> arr;

It's not as fast or elegant as I had hoped for and not nearly as fast as the function below, but faster than the query above in my tests.

PL/pgSQL function

Fastest by far:

CREATE OR REPLACE FUNCTION f_first_uniq(_rows int)
 RETURNS TABLE (id int, sec int) AS
$func$
DECLARE
 _arr int[];
BEGIN
 FOR id, sec IN
 SELECT t.id, t.sec FROM tbl t ORDER BY t.id DESC
 LOOP
 IF sec = ANY (_arr) THEN
 -- do nothing
 ELSE
 RETURN NEXT;
 _arr := _arr || sec;
 EXIT WHEN array_length(_arr, 1) >= _rows;
 END IF;
 END LOOP;
END
$func$ LANGUAGE plpgsql;

Call:

SELECT * FROM f_first_uniq(4);

SQL Fiddle demonstrating all three.

Could be made out to work for any table with table and column names as parameters and dynamic SQL with EXECUTE ...

Why bother?

In a test table with only 30k rows the function ran 2000x faster than the above query (which already ran ~ 30% faster than a_horse's version). This difference grows with the size of the table. Performance of the function is about constant, while the query's performance gets progressively worse, since it tries to find distinct values in all of the table first. Try this in a table with a million rows ...

Stack Exchange Network

select specified number of unique IDs where second column is unique

2 Answers 2

Recursive CTE

PL/pgSQL function

Why bother?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

select specified number of unique IDs where second column is unique

2 Answers 2

Recursive CTE

PL/pgSQL function

Why bother?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions