SELECT DISTINCT on multiple columns

Question 1

Supposing we have a table with four columns (a,b,c,d) of the same data type.

Is it possible to select all distinct values within the data in the columns and return them as a single column or do I have to create a function to achieve this?

Question 2

Update: Tested all 5 queries in SQLfiddle with 100K rows (and 2 separate cases, one with few (25) distinct values and another with lots (around 25K values).

A very simple query would be to use UNION DISTINCT. ~~(削除) I think it would be most efficient if there is a separate index on each of the four columns (削除ここまで)~~ It would be efficient with a separate index on each of the four columns, if Postgres had implemented Loose Index Scan optimization, which it hasn't. So this query will not be efficient as it requires 4 scans of the table (and no index is used):

-- Query 1. (334 ms, 368ms) 
SELECT a AS abcd FROM tablename 
UNION -- means UNION DISTINCT
SELECT b FROM tablename 
UNION 
SELECT c FROM tablename 
UNION 
SELECT d FROM tablename ;

Another would be to first UNION ALL and then use DISTINCT. This will also require 4 table scans (and no use of indexes). Not bad efficiency when the values are few, and with more values becomes the fastest in my (not extensive) test:

-- Query 2. (87 ms, 117 ms)
SELECT DISTINCT a AS abcd
FROM
 ( SELECT a FROM tablename 
 UNION ALL 
 SELECT b FROM tablename 
 UNION ALL
 SELECT c FROM tablename 
 UNION ALL
 SELECT d FROM tablename 
 ) AS x ;

The other answers have provided with more options using array functions or the LATERAL syntax. Jack's query (187 ms, 261 ms) has reasonable performance but AndriyM's query seems more efficient (125 ms, 155 ms). Both of them do one sequential scan of the table and do not use any index.

Actually Jack's query results are a bit better than shown above (if we remove the order by) and can be further improved by removing the 4 internal distinct and leaving only the external one.

Finally, if - and only if - the distinct values of the 4 columns are relatively few, you can use the WITH RECURSIVE hack/optimization described in the above Loose Index Scan page and use all 4 indexes, with remarkably fast result! Tested with the same 100K rows and approximately 25 distinct values spread across the 4 columns (runs in only 2 ms!) while with 25K distinct values it's the slowest with 368 ms:

-- Query 3. (2 ms, 368ms)
WITH RECURSIVE 
 da AS (
 SELECT min(a) AS n FROM observations
 UNION ALL
 SELECT (SELECT min(a) FROM observations
 WHERE a > s.n)
 FROM da AS s WHERE s.n IS NOT NULL ),
 db AS (
 SELECT min(b) AS n FROM observations
 UNION ALL
 SELECT (SELECT min(b) FROM observations
 WHERE b > s.n)
 FROM db AS s WHERE s.n IS NOT NULL ),
 dc AS (
 SELECT min(c) AS n FROM observations
 UNION ALL
 SELECT (SELECT min(c) FROM observations
 WHERE c > s.n)
 FROM dc AS s WHERE s.n IS NOT NULL ),
 dd AS (
 SELECT min(d) AS n FROM observations
 UNION ALL
 SELECT (SELECT min(d) FROM observations
 WHERE d > s.n)
 FROM db AS s WHERE s.n IS NOT NULL )
SELECT n 
FROM 
( TABLE da UNION 
 TABLE db UNION 
 TABLE dc UNION 
 TABLE dd
) AS x 
WHERE n IS NOT NULL ;

SQLfiddle

To summarize, when the distinct values are few, the recursive query is the absolute winner while with lots of values, my 2nd one, Jack's (improved version below) and AndriyM's queries are the best performers.

Late additions, a variation on the 1st query which despite the extra distinct operations, performs much better than the original 1st and only slightly worse than the 2nd:

-- Query 1b. (85 ms, 149 ms)
SELECT DISTINCT a AS n FROM observations 
UNION 
SELECT DISTINCT b FROM observations 
UNION 
SELECT DISTINCT c FROM observations 
UNION 
SELECT DISTINCT d FROM observations ;

and Jack's improved:

-- Query 4b. (104 ms, 128 ms)
select distinct unnest( array_agg(a)||
 array_agg(b)||
 array_agg(c)||
 array_agg(d) )
from t ;

Question 3

You could use LATERAL, like in this query:

SELECT DISTINCT
 x.n
FROM
 atable
 CROSS JOIN LATERAL (
 VALUES (a), (b), (c), (d)
 ) AS x (n)
;

The LATERAL keyword allows the right side of the join to reference objects from the left side. In this case, the right side is a VALUES constructor that builds a single-column subset out of the column values you want to put into a single column. The main query simply references the new column, also applying DISTINCT to it.

Question 4

To be clear, I'd use union as ypercube suggests, but it is also possible with arrays:

select distinct unnest( array_agg(distinct a)||
 array_agg(distinct b)||
 array_agg(distinct c)||
 array_agg(distinct d) )
from t
order by 1;

| unnest |
| :----- |
| 0 |
| 1 |
| 2 |
| 3 |
| 5 |
| 6 |
| 8 |
| 9 |

dbfiddle here

Question 5

Shortest

SELECT DISTINCT n FROM observations, unnest(ARRAY[a,b,c,d]) n;

A less verbose version of Andriy's idea is only slightly longer, but more elegant and faster. For many distinct / few duplicate values:

SELECT DISTINCT n FROM observations, LATERAL (VALUES (a),(b),(c),(d)) t(n);

Fastest

With an index on each involved column!
For few distinct / many duplicate values:

WITH RECURSIVE
 ta AS (
 (SELECT a FROM observations ORDER BY a LIMIT 1)
 UNION ALL
 SELECT o.a FROM ta t, LATERAL (SELECT a FROM observations WHERE a > t.a ORDER BY a LIMIT 1) o
 )
, tb AS (
 (SELECT b FROM observations ORDER BY b LIMIT 1)
 UNION ALL
 SELECT o.b FROM tb t, LATERAL (SELECT b FROM observations WHERE b > t.b ORDER BY b LIMIT 1) o
 )
, tc AS (
 (SELECT c FROM observations ORDER BY c LIMIT 1)
 UNION ALL
 SELECT o.c FROM tc t, LATERAL (SELECT c FROM observations WHERE c > t.c ORDER BY c LIMIT 1) o
 )
, td AS (
 (SELECT d FROM observations ORDER BY d LIMIT 1)
 UNION ALL
 SELECT o.d FROM td t, LATERAL (SELECT d FROM observations WHERE d > t.d ORDER BY d LIMIT 1) o
 )
SELECT a
FROM (
 TABLE ta
 UNION TABLE tb
 UNION TABLE tc
 UNION TABLE td
 ) sub
ORDER BY 1; -- optional

This is another rCTE variant, similar to the one @ypercube already posted, but I use ORDER BY 1 LIMIT 1 instead of min(a) which is typically a bit faster. I also need no additional predicate to exclude NULL values.
And LATERAL instead of a correlated subquery, because it's cleaner (not necessarily faster).

Detailed explanation in my go-to answer for this technique:

Optimize GROUP BY query to retrieve latest record per user

I added it to ypercube's sqlfiddle
... and now ported that to dbfiddle.uk, as sqlfiddle.com isn't keeping up:

db<>fiddle here

Question 6

Can you test with EXPLAIN (ANALYZE, TIMING OFF) to verify best overall performance? (Best of 5 to exclude caching effects.)

Question 7

Interesting. I thought a comma join would be equivalent to a CROSS JOIN in every respect, i.e. in terms of performance too. Is the difference specific to using LATERAL?

Question 8

Or maybe I misunderstood. When you said "faster" about the less verbose version of my suggestion, did you mean faster than mine or faster than the SELECT DISTINCT with unnest?

Question 9

@AndriyM: The comma is equivalent (except that explicit ` CROSS JOIN` syntax binds stronger when resolving join sequence). Yes, I mean your idea with VALUES ... is faster than unnest(ARRAY[...]). LATERAL is implicit for set-returning functions in the FROM list.

Question 10

Thnx for the improvements! I tried the order/limit-1 variant but there wasn't any noticable difference. Using LATERAL there is pretty cool, avoiding the multiple IS NOT NULL checks, great. You should suggest this variant to the Postgres guys, to be added in the Loose-Index-Scan page.

Question 11

You can, but as I wrote and tested the function I felt wrong. It is a resources waste.
Just please use a union and more select. Only advantage (if it is), one single scan from main table.

In sql fiddle you need to change separator from $ to something else, like /

CREATE TABLE observations (
 id serial
 , a int not null
 , b int not null
 , c int not null
 , d int not null
 , created_at timestamp
 , foo text
);
INSERT INTO observations (a, b, c, d, created_at, foo)
SELECT (random() * 20)::int AS a -- few values for a,b,c,d
 , (15 + random() * 10)::int 
 , (10 + random() * 10)::int 
 , ( 5 + random() * 20)::int 
 , '2014-01-01 0:0'::timestamp 
 + interval '1s' * g AS created_at -- ascending (probably like in real life)
 , 'aöguihaophgaduigha' || g AS foo -- random ballast
FROM generate_series (1, 10) g; -- 10k rows
CREATE INDEX observations_a_idx ON observations (a);
CREATE INDEX observations_b_idx ON observations (b);
CREATE INDEX observations_c_idx ON observations (c);
CREATE INDEX observations_d_idx ON observations (d);
CREATE OR REPLACE FUNCTION fn_readuniqu()
 RETURNS SETOF text AS $$
DECLARE
 a_array text[];
 b_array text[];
 c_array text[];
 d_array text[];
 r text;
BEGIN
 SELECT INTO a_array, b_array, c_array, d_array array_agg(a), array_agg(b), array_agg(c), array_agg(d)
 FROM observations;
 FOR r IN
 SELECT DISTINCT x
 FROM
 (
 SELECT unnest(a_array) AS x
 UNION
 SELECT unnest(b_array) AS x
 UNION
 SELECT unnest(c_array) AS x
 UNION
 SELECT unnest(d_array) AS x
 ) AS a
 LOOP
 RETURN NEXT r;
 END LOOP;
END;
$$
 LANGUAGE plpgsql STABLE
 COST 100
 ROWS 1000;
SELECT * FROM fn_readuniqu();

Question 12

You're actually right as a function would still use a union. In any case +1 for the effort.

Question 13

Why are you doing this array and cursor magic? @ypercube's solution does the work and it's very easy to wrap into a SQL language function.

Question 14

Sorry, I couldn't make your function to compile. I probably did something silly. If you manage to have it working here , please provide me with a link and I'll update my answer with results, so we can compare with the other answers.

Question 15

@ypercube Edited solution must work. Remember to change the separator in fiddle. I tested on my local db with table creation and works fine.

ypercubeTM ypercubeTM 99.7k13 gold badges217 silver badges306 bronze badges · Accepted Answer · 2015-05-28 16:39:06Z

Update: Tested all 5 queries in SQLfiddle with 100K rows (and 2 separate cases, one with few (25) distinct values and another with lots (around 25K values).

A very simple query would be to use UNION DISTINCT. ~~(削除) I think it would be most efficient if there is a separate index on each of the four columns (削除ここまで)~~ It would be efficient with a separate index on each of the four columns, if Postgres had implemented Loose Index Scan optimization, which it hasn't. So this query will not be efficient as it requires 4 scans of the table (and no index is used):

-- Query 1. (334 ms, 368ms) 
SELECT a AS abcd FROM tablename 
UNION -- means UNION DISTINCT
SELECT b FROM tablename 
UNION 
SELECT c FROM tablename 
UNION 
SELECT d FROM tablename ;

Another would be to first UNION ALL and then use DISTINCT. This will also require 4 table scans (and no use of indexes). Not bad efficiency when the values are few, and with more values becomes the fastest in my (not extensive) test:

-- Query 2. (87 ms, 117 ms)
SELECT DISTINCT a AS abcd
FROM
 ( SELECT a FROM tablename 
 UNION ALL 
 SELECT b FROM tablename 
 UNION ALL
 SELECT c FROM tablename 
 UNION ALL
 SELECT d FROM tablename 
 ) AS x ;

The other answers have provided with more options using array functions or the LATERAL syntax. Jack's query (187 ms, 261 ms) has reasonable performance but AndriyM's query seems more efficient (125 ms, 155 ms). Both of them do one sequential scan of the table and do not use any index.

Actually Jack's query results are a bit better than shown above (if we remove the order by) and can be further improved by removing the 4 internal distinct and leaving only the external one.

Finally, if - and only if - the distinct values of the 4 columns are relatively few, you can use the WITH RECURSIVE hack/optimization described in the above Loose Index Scan page and use all 4 indexes, with remarkably fast result! Tested with the same 100K rows and approximately 25 distinct values spread across the 4 columns (runs in only 2 ms!) while with 25K distinct values it's the slowest with 368 ms:

-- Query 3. (2 ms, 368ms)
WITH RECURSIVE 
 da AS (
 SELECT min(a) AS n FROM observations
 UNION ALL
 SELECT (SELECT min(a) FROM observations
 WHERE a > s.n)
 FROM da AS s WHERE s.n IS NOT NULL ),
 db AS (
 SELECT min(b) AS n FROM observations
 UNION ALL
 SELECT (SELECT min(b) FROM observations
 WHERE b > s.n)
 FROM db AS s WHERE s.n IS NOT NULL ),
 dc AS (
 SELECT min(c) AS n FROM observations
 UNION ALL
 SELECT (SELECT min(c) FROM observations
 WHERE c > s.n)
 FROM dc AS s WHERE s.n IS NOT NULL ),
 dd AS (
 SELECT min(d) AS n FROM observations
 UNION ALL
 SELECT (SELECT min(d) FROM observations
 WHERE d > s.n)
 FROM db AS s WHERE s.n IS NOT NULL )
SELECT n 
FROM 
( TABLE da UNION 
 TABLE db UNION 
 TABLE dc UNION 
 TABLE dd
) AS x 
WHERE n IS NOT NULL ;

SQLfiddle

To summarize, when the distinct values are few, the recursive query is the absolute winner while with lots of values, my 2nd one, Jack's (improved version below) and AndriyM's queries are the best performers.

Late additions, a variation on the 1st query which despite the extra distinct operations, performs much better than the original 1st and only slightly worse than the 2nd:

-- Query 1b. (85 ms, 149 ms)
SELECT DISTINCT a AS n FROM observations 
UNION 
SELECT DISTINCT b FROM observations 
UNION 
SELECT DISTINCT c FROM observations 
UNION 
SELECT DISTINCT d FROM observations ;

and Jack's improved:

-- Query 4b. (104 ms, 128 ms)
select distinct unnest( array_agg(a)||
 array_agg(b)||
 array_agg(c)||
 array_agg(d) )
from t ;

Stack Exchange Network

SELECT DISTINCT on multiple columns

5 Answers 5

Shortest

Fastest

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

SELECT DISTINCT on multiple columns

5 Answers 5

Shortest

Fastest

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions