I need a script to create a table . The problem is my select for creating the table contains equal values in some rows. I need to use distinct for these columns. The other columns' values can come from any matching row.
My current result table has data like this:
| CITY | STREET | STREET_NUM | VAL_X | VAL_Y |
------------------------------------------------------------------
| CityA | Street abc | 5 | 11.5 | 0.5 |
| CityA | Street abc | 5 | 15.4 | 1.8 |
| CityA | Street abc | 5 | 12.4 | 2.8 |
| CityB | Street xyz | 18 | 5.4 | 1.9 |
| CityB | Street xyz | 18 | 8.4 | 1.1 |
| CityC | Street klm | 55 | 9.6 | 0.8 |
But I need data like this:
| CITY | STREET | STREET_NUM | VAL_X | VAL_Y |
------------------------------------------------------------------
| CityA | Street abc | 5 | 11.5 | 0.5 |
| CityB | Street xyz | 18 | 5.4 | 1.9 |
| CityC | Street klm | 55 | 9.6 | 0.8 |
For columns city, street and street_num I need to apply distinct. val_x and val_y should be used anyone, for example first of that group with same city, street and street_num.
Can you give me advice how to edit this script?
2 Answers 2
"VAL_X" and "VAL_Y" chosen through some aggregate function
You should consider using GROUP BY
for the columns whose values you consider that should be "distinct" (as a group), and, for the rest of columns, choose an appropriate aggregate function (for instance, MIN
):
CREATE TABLE my_result AS
SELECT
city, street, streetnum, min(val_x) AS val_x, min(val_y) AS val_y
FROM
tableA
WHERE
true /* your condition goes here */
GROUP BY
city, street, streetnum
If you need to put together values from several tables, UNION ALL
of them before you GROUP BY
:
CREATE TABLE my_result AS
SELECT
city, street, streetnum, min(val_x) AS val_x, min(val_y) AS val_y
FROM
(
SELECT city, street, streetnum, val_x, val_y FROM tableA
UNION ALL
SELECT city, street, streetnum, val_x, val_y FROM tableB
UNION ALL
SELECT city, street, streetnum, val_x, val_y FROM tableC
) AS s0
WHERE
true /* your condition goes here */
GROUP BY
city, street, streetnum ;
Using always "VAL_X" and "VAL_Y" from same row, using a WINDOW
If you need to make sure your values are always from the same row, the best way is to use a WINDOW
in your query: PARTITION BY "CITY", "STREET", "STREET_NUM"
and ORDER BY "VAL_X", "VAL_Y"
, and choose the first row of every partition.
You can do this with two steps:
1) Add the row_num()
to every partition:
SELECT
*,
(row_number() OVER (PARTITION BY "CITY", "STREET", "STREET_NUM" ORDER BY "VAL_X", "VAL_Y")) AS rn
FROM
table_a
| CITY | STREET | STREET_NUM | VAL_X | VAL_Y | rn |
|-------|------------|------------|-------|-------|----|
| CityA | Street abc | 5 | 11.5 | 0.5 | 1 |
| CityA | Street abc | 5 | 12.4 | 2.8 | 2 |
| CityA | Street abc | 5 | 15.4 | 1.8 | 3 |
| CityB | Street xyz | 18 | 5.4 | 1.9 | 1 |
| CityB | Street xyz | 18 | 8.4 | 1.1 | 2 |
| CityC | Street klm | 55 | 9.6 | 0.8 | 1 |
2) At this point, choose only the rows WHERE rn=1
(and ORDER
them, if necessary):
SELECT
"CITY", "STREET", "STREET_NUM", "VAL_X", "VAL_Y"
FROM
(
SELECT
*,
(row_number() OVER (PARTITION BY "CITY", "STREET", "STREET_NUM" ORDER BY "VAL_X", "VAL_Y")) AS rn
FROM
table_a
) AS table_a_grouped
WHERE
rn = 1
ORDER BY
"CITY", "STREET", "STREET_NUM"
The result is:
| CITY | STREET | STREET_NUM | VAL_X | VAL_Y |
|-------|------------|------------|-------|-------|
| CityA | Street abc | 5 | 11.5 | 0.5 |
| CityB | Street xyz | 18 | 5.4 | 1.9 |
| CityC | Street klm | 55 | 9.6 | 0.8 |
You can see the example at SQLFiddle
-
but this should got me val_x from 1. record and val_y from last recors. I need this couple from anyone row, but must be from same recordDenis Stephanov– Denis Stephanov2017年03月12日 14:43:15 +00:00Commented Mar 12, 2017 at 14:43
-
You can use a
FIRST
aggregate function, as explained wiki.postgresql.org/wiki/First/last_(aggregate). You could also use WINDOWs. It would make much more sense if there were a concept of ORDER.joanolo– joanolo2017年03月12日 14:46:03 +00:00Commented Mar 12, 2017 at 14:46 -
Assumptions:
Resulting rows shall be distinct on
(city, street, street_num)
i.e. the combination of these 3 columns shall be unique. Individual columns can have dupes.val_x
andval_y
(and possibly more columns) shall come from the same source row (they make sense together, like coordinates / a geocode).All columns are defined
NOT NULL
.
NULL values are considered equal in byDISTINCT
orDISTINCT ON
(per SQL standard). So the query works with NULL values all the same. But if you addORDER BY
to define which row to pick from each set of dupes, pay attention if expressions can be NULL. See linked answer below for details.
UNION ALL
multiple sources, then apply DISTINCT ON (city, street, street_num)
on the derived table:
CREATE TABLE my_result AS
SELECT DISTINCT ON (city, street, street_num)
city, street, street_num, val_x, val_y -- more columns?
FROM (
SELECT city, street, street_num, val_x, val_y -- more columns?
FROM tableA
WHERE ...
UNION ALL -- here probably cheaper than UNION
SELECT city, street, street_num, val_x, val_y -- more columns?
FROM tableB
WHERE ...
) sub;
Results as desired.
dbfiddle here
It's probably cheaper to use a simple UNION ALL
instead of UNION
- which would not be wrong but impose a separate sort or hash operation on each SELECT
of the UNION
query. Since we do that (and more) in the outer query anyway, UNION ALL
is probably cheaper. (Test to verify.)
I did not add ORDER BY
, since you explicitly stated:
The other columns' values can come from any matching row.
You get an arbitrary pick from each set of dupes. The result can change with every invocation.
Works in basically any Postgres version. There are query techniques with LATERAL
joins, CTE or window functions that require a more modern version.
Detailed discussion of the technique, indexes and possible alternatives:
-
1Yeah, that's exactly what I was thinking. Another possible solution - which I think could be efficient - but should be used only if there are duplicates only in each table and not across tables (as it would yield different results otherwise):
SELECT DISTINCT ON (city, street, street_num) * FROM tableA UNION ALL SELECT DISTINCT ON (city, street, street_num) * FROM tableB ... ;
ypercubeᵀᴹ– ypercubeᵀᴹ2017年03月13日 10:42:07 +00:00Commented Mar 13, 2017 at 10:42
my_result
table(refer to the fiddle) then it might work as you expected. And yes, my sample data was different. The values in my example were rounded up. Here is the fiddle.city, street and street_num I need to apply distinct
. Is that supposed to beDISTINCT ON (city, street, street_num)
orDISTINCT ON (city)
andDISTINCT ON (street)
andDISTINCT ON (street_num)
- in this case you need to define priorities or you get arbitrary results. Very different cases. 2. Doval_x
andval_y
have to come from the same (even if arbitrary) row? 3. We need table definitions showing data types and constraints and your version of Postgres to give the best answer.