PostgreSQL - Create table as select with distinct on specific columns

Question 1

I need a script to create a table . The problem is my select for creating the table contains equal values in some rows. I need to use distinct for these columns. The other columns' values can come from any matching row.

My current result table has data like this:

| CITY | STREET | STREET_NUM | VAL_X | VAL_Y |
------------------------------------------------------------------
| CityA | Street abc | 5 | 11.5 | 0.5 |
| CityA | Street abc | 5 | 15.4 | 1.8 |
| CityA | Street abc | 5 | 12.4 | 2.8 |
| CityB | Street xyz | 18 | 5.4 | 1.9 |
| CityB | Street xyz | 18 | 8.4 | 1.1 |
| CityC | Street klm | 55 | 9.6 | 0.8 |

But I need data like this:

| CITY | STREET | STREET_NUM | VAL_X | VAL_Y |
------------------------------------------------------------------
| CityA | Street abc | 5 | 11.5 | 0.5 |
| CityB | Street xyz | 18 | 5.4 | 1.9 |
| CityC | Street klm | 55 | 9.6 | 0.8 |

For columns city, street and street_num I need to apply distinct. val_x and val_y should be used anyone, for example first of that group with same city, street and street_num.

Can you give me advice how to edit this script?

Question 2

Will there be "duplicates" than need elimination between the different tables (a row from A and another from B)?

Question 3

I got your question wrong, I didn't notice that you were creating table, this is why my answer is irrelevant and I have deleted it. If you add your select query in the place of my_result table(refer to the fiddle) then it might work as you expected. And yes, my sample data was different. The values in my example were rounded up. Here is the fiddle.

Question 4

1. This is ambiguous: city, street and street_num I need to apply distinct. Is that supposed to be DISTINCT ON (city, street, street_num) or DISTINCT ON (city) and DISTINCT ON (street) and DISTINCT ON (street_num) - in this case you need to define priorities or you get arbitrary results. Very different cases. 2. Do val_x and val_y have to come from the same (even if arbitrary) row? 3. We need table definitions showing data types and constraints and your version of Postgres to give the best answer.

Question 5

@ErwinBrandstetter the answer to (2) is yes (was posted in a comment.)

Question 6

@ypercubeTM: Ah, I see. Goes to demonstrate the conventional wisdom: all defining information should go into the question ...

Question 7

"VAL_X" and "VAL_Y" chosen through some aggregate function

You should consider using GROUP BY for the columns whose values you consider that should be "distinct" (as a group), and, for the rest of columns, choose an appropriate aggregate function (for instance, MIN):

CREATE TABLE my_result AS 
SELECT
 city, street, streetnum, min(val_x) AS val_x, min(val_y) AS val_y
FROM
 tableA
WHERE
 true /* your condition goes here */ 
GROUP BY
 city, street, streetnum

If you need to put together values from several tables, UNION ALL of them before you GROUP BY:

CREATE TABLE my_result AS 
SELECT
 city, street, streetnum, min(val_x) AS val_x, min(val_y) AS val_y
FROM
 (
 SELECT city, street, streetnum, val_x, val_y FROM tableA
 UNION ALL
 SELECT city, street, streetnum, val_x, val_y FROM tableB
 UNION ALL
 SELECT city, street, streetnum, val_x, val_y FROM tableC
 ) AS s0
WHERE
 true /* your condition goes here */ 
GROUP BY
 city, street, streetnum ;

Using always "VAL_X" and "VAL_Y" from same row, using a WINDOW

If you need to make sure your values are always from the same row, the best way is to use a WINDOW in your query: PARTITION BY "CITY", "STREET", "STREET_NUM" and ORDER BY "VAL_X", "VAL_Y", and choose the first row of every partition.

You can do this with two steps:

1) Add the row_num() to every partition:

SELECT 
 *, 
 (row_number() OVER (PARTITION BY "CITY", "STREET", "STREET_NUM" ORDER BY "VAL_X", "VAL_Y")) AS rn
FROM 
 table_a
 | CITY | STREET | STREET_NUM | VAL_X | VAL_Y | rn |
 |-------|------------|------------|-------|-------|----|
 | CityA | Street abc | 5 | 11.5 | 0.5 | 1 |
 | CityA | Street abc | 5 | 12.4 | 2.8 | 2 |
 | CityA | Street abc | 5 | 15.4 | 1.8 | 3 |
 | CityB | Street xyz | 18 | 5.4 | 1.9 | 1 |
 | CityB | Street xyz | 18 | 8.4 | 1.1 | 2 |
 | CityC | Street klm | 55 | 9.6 | 0.8 | 1 |

2) At this point, choose only the rows WHERE rn=1 (and ORDER them, if necessary):

SELECT
 "CITY", "STREET", "STREET_NUM", "VAL_X", "VAL_Y"
FROM
 (
 SELECT 
 *, 
 (row_number() OVER (PARTITION BY "CITY", "STREET", "STREET_NUM" ORDER BY "VAL_X", "VAL_Y")) AS rn
 FROM 
 table_a
 ) AS table_a_grouped 
WHERE
 rn = 1
ORDER BY 
 "CITY", "STREET", "STREET_NUM"

The result is:

| CITY | STREET | STREET_NUM | VAL_X | VAL_Y |
|-------|------------|------------|-------|-------|
| CityA | Street abc | 5 | 11.5 | 0.5 |
| CityB | Street xyz | 18 | 5.4 | 1.9 |
| CityC | Street klm | 55 | 9.6 | 0.8 |

You can see the example at SQLFiddle

Question 8

but this should got me val_x from 1. record and val_y from last recors. I need this couple from anyone row, but must be from same record

Question 9

You can use a FIRST aggregate function, as explained wiki.postgresql.org/wiki/First/last_(aggregate). You could also use WINDOWs. It would make much more sense if there were a concept of ORDER.

Question 10

Added WINDOW example.

Question 11

Assumptions:

Resulting rows shall be distinct on (city, street, street_num) i.e. the combination of these 3 columns shall be unique. Individual columns can have dupes.
val_x and val_y (and possibly more columns) shall come from the same source row (they make sense together, like coordinates / a geocode).
All columns are defined NOT NULL.
_{NULL values are considered equal in by DISTINCT or DISTINCT ON (per SQL standard). So the query works with NULL values all the same. But if you add ORDER BY to define which row to pick from each set of dupes, pay attention if expressions can be NULL. See linked answer below for details.}

UNION ALL multiple sources, then apply DISTINCT ON (city, street, street_num) on the derived table:

CREATE TABLE my_result AS 
SELECT DISTINCT ON (city, street, street_num)
 city, street, street_num, val_x, val_y -- more columns?
FROM (
 SELECT city, street, street_num, val_x, val_y -- more columns?
 FROM tableA
 WHERE ...
 UNION ALL -- here probably cheaper than UNION
 SELECT city, street, street_num, val_x, val_y -- more columns?
 FROM tableB
 WHERE ...
 ) sub;

Results as desired.

dbfiddle here

It's probably cheaper to use a simple UNION ALL instead of UNION - which would not be wrong but impose a separate sort or hash operation on each SELECT of the UNION query. Since we do that (and more) in the outer query anyway, UNION ALL is probably cheaper. (Test to verify.)

Combining 2 SELECT queries and printing the results in PostgreSQL

I did not add ORDER BY, since you explicitly stated:

The other columns' values can come from any matching row.

You get an arbitrary pick from each set of dupes. The result can change with every invocation.

Works in basically any Postgres version. There are query techniques with LATERAL joins, CTE or window functions that require a more modern version.

Detailed discussion of the technique, indexes and possible alternatives:

Select first row in each GROUP BY group?

Question 12

Yeah, that's exactly what I was thinking. Another possible solution - which I think could be efficient - but should be used only if there are duplicates only in each table and not across tables (as it would yield different results otherwise):

SELECT DISTINCT ON (city, street, street_num) * FROM tableA UNION ALL SELECT DISTINCT ON (city, street, street_num) * FROM tableB ... ;

joanolo joanolo 13.7k8 gold badges39 silver badges67 bronze badges · Accepted Answer · 2017-03-12 14:39:11Z

"VAL_X" and "VAL_Y" chosen through some aggregate function

You should consider using GROUP BY for the columns whose values you consider that should be "distinct" (as a group), and, for the rest of columns, choose an appropriate aggregate function (for instance, MIN):

CREATE TABLE my_result AS 
SELECT
 city, street, streetnum, min(val_x) AS val_x, min(val_y) AS val_y
FROM
 tableA
WHERE
 true /* your condition goes here */ 
GROUP BY
 city, street, streetnum

If you need to put together values from several tables, UNION ALL of them before you GROUP BY:

CREATE TABLE my_result AS 
SELECT
 city, street, streetnum, min(val_x) AS val_x, min(val_y) AS val_y
FROM
 (
 SELECT city, street, streetnum, val_x, val_y FROM tableA
 UNION ALL
 SELECT city, street, streetnum, val_x, val_y FROM tableB
 UNION ALL
 SELECT city, street, streetnum, val_x, val_y FROM tableC
 ) AS s0
WHERE
 true /* your condition goes here */ 
GROUP BY
 city, street, streetnum ;

Using always "VAL_X" and "VAL_Y" from same row, using a WINDOW

If you need to make sure your values are always from the same row, the best way is to use a WINDOW in your query: PARTITION BY "CITY", "STREET", "STREET_NUM" and ORDER BY "VAL_X", "VAL_Y", and choose the first row of every partition.

You can do this with two steps:

1) Add the row_num() to every partition:

SELECT 
 *, 
 (row_number() OVER (PARTITION BY "CITY", "STREET", "STREET_NUM" ORDER BY "VAL_X", "VAL_Y")) AS rn
FROM 
 table_a
 | CITY | STREET | STREET_NUM | VAL_X | VAL_Y | rn |
 |-------|------------|------------|-------|-------|----|
 | CityA | Street abc | 5 | 11.5 | 0.5 | 1 |
 | CityA | Street abc | 5 | 12.4 | 2.8 | 2 |
 | CityA | Street abc | 5 | 15.4 | 1.8 | 3 |
 | CityB | Street xyz | 18 | 5.4 | 1.9 | 1 |
 | CityB | Street xyz | 18 | 8.4 | 1.1 | 2 |
 | CityC | Street klm | 55 | 9.6 | 0.8 | 1 |

2) At this point, choose only the rows WHERE rn=1 (and ORDER them, if necessary):

SELECT
 "CITY", "STREET", "STREET_NUM", "VAL_X", "VAL_Y"
FROM
 (
 SELECT 
 *, 
 (row_number() OVER (PARTITION BY "CITY", "STREET", "STREET_NUM" ORDER BY "VAL_X", "VAL_Y")) AS rn
 FROM 
 table_a
 ) AS table_a_grouped 
WHERE
 rn = 1
ORDER BY 
 "CITY", "STREET", "STREET_NUM"

The result is:

| CITY | STREET | STREET_NUM | VAL_X | VAL_Y |
|-------|------------|------------|-------|-------|
| CityA | Street abc | 5 | 11.5 | 0.5 |
| CityB | Street xyz | 18 | 5.4 | 1.9 |
| CityC | Street klm | 55 | 9.6 | 0.8 |

You can see the example at SQLFiddle

but this should got me val_x from 1. record and val_y from last recors. I need this couple from anyone row, but must be from same record
You can use a FIRST aggregate function, as explained wiki.postgresql.org/wiki/First/last_(aggregate). You could also use WINDOWs. It would make much more sense if there were a concept of ORDER.

Stack Exchange Network

PostgreSQL - Create table as select with distinct on specific columns

2 Answers 2

"VAL_X" and "VAL_Y" chosen through some aggregate function

Using always "VAL_X" and "VAL_Y" from same row, using a WINDOW

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

PostgreSQL - Create table as select with distinct on specific columns

2 Answers 2

"VAL_X" and "VAL_Y" chosen through some aggregate function

Using always "VAL_X" and "VAL_Y" from same row, using a WINDOW

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions