Fastest query for selecting arrays that contain duplicates

Question 1

I'm using Postgres 9.5 and I have a column, named phones, of type text[]. I need to find all rows where this column contains duplicates.

I've found this extremely useful set of functions https://github.com/JDBurnZ/postgresql-anyarray and I could use this particular function https://github.com/JDBurnZ/postgresql-anyarray/blob/master/stable/anyarray_uniq.sql

select * from user_info 
where anyarray_uniq(phones) <> phones

I was wondering though if there is a faster way of achieving what I want. Maybe unnesting the array and using the window functionality would be better? Although I can find my way around SQL, I'm new to Postgres' specific best practices, so any help is welcome.

Is this better suited for CodeReview?

Question 2

Try ... where (select count(x) <> count(distinct x) from unnest(phones) as t(x))

Question 3

Can you provide this as an answer with some explanation? The purpose of this post is not to solve a problem, as I already have a solution. The point is to find the optimal, and of course understand the thinking behind it. Thank you for your time

Question 4

@Abelisto hello again, can you please explain the logic behind t(x)?

Question 5

It is just alias for unnest(phones) set-returning function. Without it you should to write where (select count(unnest) <> count(distinct unnest) from unnest(phones)) what could be confusing.

Question 6

Note that if phones is not defined with collate C but could be due to containing US-ASCII only, use phones collate "C" in the query. The comparisons will be much faster. (4x faster than en_US.utf8 on linux for instance).

Question 7

1) Function anyarray_uniq can be simplified in several ways to make it faster (note that in the function's body the input parameter can be accessed not only by the name but also by the number: $<n>):

create or replace function array_deldup1(anyarray) returns anyarray as $body$
declare
 result 1ドル%type = '{}';
 i int;
begin
 for i in array_lower(1,ドル 1)..array_upper(1,ドル 1) loop
 if array_position(result, 1ドル[i]) is null then -- function was introduced in 9.5 version 
 result := result || 1ドル[i];
 end if;
 end loop;
 return result;
end $body$ language plpgsql immutable;

or yet simpler using pure SQL:

create or replace function array_deldup2(anyarray) returns anyarray as $body$
 select array_agg(x order by n) 
 from (
 select distinct on (x) x, n 
 from unnest(1ドル) with ordinality as t(x,n) order by x, n) as t(x,n);
$body$ language sql immutable;

Second one is slower then first but still faster then the original on my tests.

Those functions doing exactly the same thing as anyarray_uniq (removes duplicates and keeps the order of the elements), but for your purpose the order is irrelevant, so the simplest way (using function) is

create or replace function array_deldup3(anyarray) returns anyarray as $body$
 select array_agg(distinct x) from unnest(1ドル) t(x);
 -- Or yet another syntax doing the same thing:
 -- select array(select distinct unnest(1ドル));
$body$ language sql immutable;

and now because the elements order changed you should to compare the arrays length instead of its content:

select * from user_info
where array_length(array_deldup3(phones), 1) <> array_length(phones, 1)

2) To achieve your goal you are doing ambiguous work by calling the function (it is also slowing down the query performance), calculating the result as array without duplicates and finally comparing two arrays. The actual goal is to compare the whole array length against the count of the distinct values:

select * from user_info 
where (select count(x) <> count(distinct x) from unnest(phones) as t(x))

Upd:
3) When you fix your data using one of the functions above

update user_info set phones = array_deldup<n>(phones);

you can avoid those situation by creating constraint on the field:

create or replace function array_havedup(anyarray) returns boolean as $body$
 select count(x) <> count(distinct x) from unnest(1ドル) as t(x);
$body$ language sql immutable;
alter table user_info add constraint chk_user_info_phone check (not array_havedup(phones));

Actually you can use this function in the question's query:

select * from user_info where array_havedup(phones); -- Simple, isn't it?

4) Try to follow to the common database designing rules called "database normalization". The example you provided is exactly about the First and Second normal forms.

Let's imagine that you need the phone's additional info like "home/work/mobile", "internal code", "availability time" and so on. Using your current design it can be problematic.

Question 8

@alkis Small update provided. Actually the (4) item should be the main item of the whole answer.

Question 9

Among the fastest ways to run such a query is with a specialized index:

create index on user_info ((anyarray_uniq(phones) <> phones)) where
 anyarray_uniq(phones) <> phones;

That way the work of doing the comparisons happens when the records are inserted or updated, not when they are selected. (To get this to work, you will have to mark as anyarray_uniq as IMMUTABLE, but as far as I can tell this is an accurate way to mark it)

Question 10

+1. Nice to know, thank you. In this case though I need something that won't add an overhead to the other operations (insert, update).

Abelisto Abelisto 1,5891 gold badge10 silver badges14 bronze badges · Accepted Answer · 2016-07-10 11:03:43Z

1) Function anyarray_uniq can be simplified in several ways to make it faster (note that in the function's body the input parameter can be accessed not only by the name but also by the number: $<n>):

create or replace function array_deldup1(anyarray) returns anyarray as $body$
declare
 result 1ドル%type = '{}';
 i int;
begin
 for i in array_lower(1,ドル 1)..array_upper(1,ドル 1) loop
 if array_position(result, 1ドル[i]) is null then -- function was introduced in 9.5 version 
 result := result || 1ドル[i];
 end if;
 end loop;
 return result;
end $body$ language plpgsql immutable;

or yet simpler using pure SQL:

create or replace function array_deldup2(anyarray) returns anyarray as $body$
 select array_agg(x order by n) 
 from (
 select distinct on (x) x, n 
 from unnest(1ドル) with ordinality as t(x,n) order by x, n) as t(x,n);
$body$ language sql immutable;

Second one is slower then first but still faster then the original on my tests.

Those functions doing exactly the same thing as anyarray_uniq (removes duplicates and keeps the order of the elements), but for your purpose the order is irrelevant, so the simplest way (using function) is

create or replace function array_deldup3(anyarray) returns anyarray as $body$
 select array_agg(distinct x) from unnest(1ドル) t(x);
 -- Or yet another syntax doing the same thing:
 -- select array(select distinct unnest(1ドル));
$body$ language sql immutable;

and now because the elements order changed you should to compare the arrays length instead of its content:

select * from user_info
where array_length(array_deldup3(phones), 1) <> array_length(phones, 1)

2) To achieve your goal you are doing ambiguous work by calling the function (it is also slowing down the query performance), calculating the result as array without duplicates and finally comparing two arrays. The actual goal is to compare the whole array length against the count of the distinct values:

select * from user_info 
where (select count(x) <> count(distinct x) from unnest(phones) as t(x))

Upd:
3) When you fix your data using one of the functions above

update user_info set phones = array_deldup<n>(phones);

you can avoid those situation by creating constraint on the field:

create or replace function array_havedup(anyarray) returns boolean as $body$
 select count(x) <> count(distinct x) from unnest(1ドル) as t(x);
$body$ language sql immutable;
alter table user_info add constraint chk_user_info_phone check (not array_havedup(phones));

Actually you can use this function in the question's query:

select * from user_info where array_havedup(phones); -- Simple, isn't it?

4) Try to follow to the common database designing rules called "database normalization". The example you provided is exactly about the First and Second normal forms.

Let's imagine that you need the phone's additional info like "home/work/mobile", "internal code", "availability time" and so on. Using your current design it can be problematic.

@alkis Small update provided. Actually the (4) item should be the main item of the whole answer.

Stack Exchange Network

Fastest query for selecting arrays that contain duplicates

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Fastest query for selecting arrays that contain duplicates

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions