I'm using Postgres 9.5
and I have a column, named phones
, of type text[]
.
I need to find all rows where this column contains duplicates.
I've found this extremely useful set of functions https://github.com/JDBurnZ/postgresql-anyarray and I could use this particular function https://github.com/JDBurnZ/postgresql-anyarray/blob/master/stable/anyarray_uniq.sql
select * from user_info
where anyarray_uniq(phones) <> phones
I was wondering though if there is a faster way of achieving what I want. Maybe unnesting the array and using the window functionality would be better? Although I can find my way around SQL, I'm new to Postgres' specific best practices, so any help is welcome.
Is this better suited for CodeReview?
2 Answers 2
1) Function anyarray_uniq
can be simplified in several ways to make it faster (note that in the function's body the input parameter can be accessed not only by the name but also by the number: $<n>
):
create or replace function array_deldup1(anyarray) returns anyarray as $body$
declare
result 1ドル%type = '{}';
i int;
begin
for i in array_lower(1,ドル 1)..array_upper(1,ドル 1) loop
if array_position(result, 1ドル[i]) is null then -- function was introduced in 9.5 version
result := result || 1ドル[i];
end if;
end loop;
return result;
end $body$ language plpgsql immutable;
or yet simpler using pure SQL:
create or replace function array_deldup2(anyarray) returns anyarray as $body$
select array_agg(x order by n)
from (
select distinct on (x) x, n
from unnest(1ドル) with ordinality as t(x,n) order by x, n) as t(x,n);
$body$ language sql immutable;
Second one is slower then first but still faster then the original on my tests.
Those functions doing exactly the same thing as anyarray_uniq
(removes duplicates and keeps the order of the elements), but for your purpose the order is irrelevant, so the simplest way (using function) is
create or replace function array_deldup3(anyarray) returns anyarray as $body$
select array_agg(distinct x) from unnest(1ドル) t(x);
-- Or yet another syntax doing the same thing:
-- select array(select distinct unnest(1ドル));
$body$ language sql immutable;
and now because the elements order changed you should to compare the arrays length instead of its content:
select * from user_info
where array_length(array_deldup3(phones), 1) <> array_length(phones, 1)
2) To achieve your goal you are doing ambiguous work by calling the function (it is also slowing down the query performance), calculating the result as array without duplicates and finally comparing two arrays. The actual goal is to compare the whole array length against the count of the distinct values:
select * from user_info
where (select count(x) <> count(distinct x) from unnest(phones) as t(x))
Upd:
3) When you fix your data using one of the functions above
update user_info set phones = array_deldup<n>(phones);
you can avoid those situation by creating constraint on the field:
create or replace function array_havedup(anyarray) returns boolean as $body$
select count(x) <> count(distinct x) from unnest(1ドル) as t(x);
$body$ language sql immutable;
alter table user_info add constraint chk_user_info_phone check (not array_havedup(phones));
Actually you can use this function in the question's query:
select * from user_info where array_havedup(phones); -- Simple, isn't it?
4) Try to follow to the common database designing rules called "database normalization". The example you provided is exactly about the First and Second normal forms.
Let's imagine that you need the phone's additional info like "home/work/mobile", "internal code", "availability time" and so on. Using your current design it can be problematic.
-
@alkis Small update provided. Actually the (4) item should be the main item of the whole answer.Abelisto– Abelisto2016年07月11日 02:24:28 +00:00Commented Jul 11, 2016 at 2:24
Among the fastest ways to run such a query is with a specialized index:
create index on user_info ((anyarray_uniq(phones) <> phones)) where
anyarray_uniq(phones) <> phones;
That way the work of doing the comparisons happens when the records are inserted or updated, not when they are selected. (To get this to work, you will have to mark as anyarray_uniq as IMMUTABLE, but as far as I can tell this is an accurate way to mark it)
-
+1. Nice to know, thank you. In this case though I need something that won't add an overhead to the other operations (insert, update).Alkis Kalogeris– Alkis Kalogeris2016年07月09日 19:03:30 +00:00Commented Jul 9, 2016 at 19:03
... where (select count(x) <> count(distinct x) from unnest(phones) as t(x))
unnest(phones)
set-returning function. Without it you should to writewhere (select count(unnest) <> count(distinct unnest) from unnest(phones))
what could be confusing.phones
is not defined withcollate C
but could be due to containing US-ASCII only, usephones collate "C"
in the query. The comparisons will be much faster. (4x faster thanen_US.utf8
on linux for instance).