Efficient merging (removing duplicates) of arrays

Question 1

I have two tables, left2 and right2. Both tables will be large (1-10M rows).

CREATE TABLE left2(id INTEGER, t1 INTEGER, d INTEGER);
ALTER TABLE left2 ADD PRIMARY KEY (id,t1);
CREATE TABLE right2( t1 INTEGER, d INTEGER, arr INTEGER[] );
ALTER TABLE right2 ADD PRIMARY KEY(t1,d);

I will perform this type of query:

SELECT l.d + r.d,
 UNIQ(SORT((array_agg_mult(r.arr)))
FROM left2 l,
 right2 r
WHERE l.t1 = r.t1
GROUP BY l.d + r.d
ORDER BY l.d + r.d;

Where for aggregation of arrays I use the function:

CREATE AGGREGATE array_agg_mult(anyarray) (
SFUNC=array_cat,
STYPE=anyarray,
INITCOND='{}');

After concatenating the arrays, I use the UNIQ function of the intarray module. Is there a more efficient way of doing this? Is there any index on the arr field to speed up the merging (with removing duplicates)? Can the aggregate function remove duplicates directly? Original arrays may be considered sorted (and they are unique) if that helps.

The SQL Fiddle is here:

Question 2

Are you going to query millions of rows at once? What are you doing with the result? Or will there be predicates to select a few? Can right2.arr be NULL like your demo schema suggests? Do you need sorted arrays as result?

Question 3

Correct results?

First off: correctness. You want to produce an array of unique elements? Your current query does not do that. The function uniq() from the intarray module only promises to:

remove adjacent duplicates

Like instructed in the manual, you would need:

SELECT l.d + r.d, uniq(sort(array_agg_mult(r.arr)))
FROM ...

Also gives you sorted arrays - assuming you want that, you did not clarify.

I see you have sort() in your fiddle, so this may just be a typo in your question.

Postgres 9.5 or later

Either way,since Postgres 9.5 array_agg() has the capabilities of my array_agg_mult() built-in out of the box, and much faster, too:

There have also been other performance improvements for array handling.

Query

The main purpose of array_agg_mult() is to aggregate multi-dimensional arrays, but you only produce 1-dimensional arrays anyway. So I would at least try this alternative query:

SELECT l.d + r.d AS d_sum, array_agg(DISTINCT elem) AS result_arr
FROM left2 l
JOIN right2 r USING (t1)
 , unnest(r.arr) elem
GROUP BY 1
ORDER BY 1;

Which also addresses your question:

Can the aggregate function remove duplicates directly?

Yes, it can, with DISTINCT. But that's not faster than uniq() for integer arrays, which has been optimized for integer arrays, while DISTINCT is generic for all qualifying data types.

Doesn't require the intarray module. However, the result is not necessarily sorted. Postgres uses varying algorithms for DISTINCT. Big sets are typically hashed, which leaves the result unsorted unless you add explicit ORDER BY. If you need sorted arrays, you could add ORDER BY to the aggregate function directly:

array_agg(DISTINCT elem ORDER BY elem)

But that's typically slower than feeding pre-sorted data to array_agg() (one big sort versus many small sorts). So I would sort in a subquery and then aggregate:

SELECT d_sum, uniq(array_agg(elem)) AS result_arr
FROM (
 SELECT l.d + r.d AS d_sum, elem
 FROM left2 l
 JOIN right2 r USING (t1)
 , unnest(r.arr) elem
 ORDER BY 1, 2
 ) sub
GROUP BY 1
ORDER BY 1;

This was the fastest variant in my cursory test on Postgres 9.4.

SQL Fiddle based on the one you provided.

Index

I don't see much potential for any index here. The only option would be:

CREATE INDEX ON right2 (t1, arr);

Only makes sense if you get index-only scans out of this - which will happen if the underlying table right2 is substantially wider than just these two columns and your setup qualifies for index-only scans. Details in the Postgres Wiki.

Question 4

Thanks +1. I will have to UNNEST later anyway, but want to check if removing duplicates in the arrays and then UNNEST is faster.

Question 5

I'm really disappointed, this is an easy thing to do in Microsoft Access. You can create a "remove duplicates" query then look at the SQL to see how it's doing it. I'll have to fire up a Windows machine to look. They vary, the query wizard does it.

One thing that works I think is to load all your data into one table then do SELECT DISTINCT into a new table. You can also stick in an order by clause while you're at it. I did it somehow a year ago, that must be it.

I'm combining 2 years worth of temperature data, the sensor sends 2 copies of the same data point every minute as a redundant safeguard. Sometimes one gets trashed, but I only want to keep one. I also have overlaps between files.

If the data is exactly the same format over the whole run, on a unix machine you can do something like

cat *.tab > points.txt
sort -n < points.txt > sorted.txt
uniq -u sorted.txt unique.txt

But uniq compares lines as strings and for example 18.7000 isn't the same as 18.7. I've changed my software during the 2 years so I have both formats.

Question 6

Disappointed from Postgres? Does Access even have arrays?

Question 7

I don't know but it can remove duplicates, it's a common enough problem in data cleansing. Select distinct is close enough. You don't always have control over your raw data from the real world.

score 11 · Accepted Answer · 2015-10-14 00:58:49Z

Correct results?

First off: correctness. You want to produce an array of unique elements? Your current query does not do that. The function uniq() from the intarray module only promises to:

remove adjacent duplicates

Like instructed in the manual, you would need:

SELECT l.d + r.d, uniq(sort(array_agg_mult(r.arr)))
FROM ...

Also gives you sorted arrays - assuming you want that, you did not clarify.

I see you have sort() in your fiddle, so this may just be a typo in your question.

Postgres 9.5 or later

Either way,since Postgres 9.5 array_agg() has the capabilities of my array_agg_mult() built-in out of the box, and much faster, too:

There have also been other performance improvements for array handling.

Query

The main purpose of array_agg_mult() is to aggregate multi-dimensional arrays, but you only produce 1-dimensional arrays anyway. So I would at least try this alternative query:

SELECT l.d + r.d AS d_sum, array_agg(DISTINCT elem) AS result_arr
FROM left2 l
JOIN right2 r USING (t1)
 , unnest(r.arr) elem
GROUP BY 1
ORDER BY 1;

Which also addresses your question:

Can the aggregate function remove duplicates directly?

Yes, it can, with DISTINCT. But that's not faster than uniq() for integer arrays, which has been optimized for integer arrays, while DISTINCT is generic for all qualifying data types.

Doesn't require the intarray module. However, the result is not necessarily sorted. Postgres uses varying algorithms for DISTINCT. Big sets are typically hashed, which leaves the result unsorted unless you add explicit ORDER BY. If you need sorted arrays, you could add ORDER BY to the aggregate function directly:

array_agg(DISTINCT elem ORDER BY elem)

But that's typically slower than feeding pre-sorted data to array_agg() (one big sort versus many small sorts). So I would sort in a subquery and then aggregate:

SELECT d_sum, uniq(array_agg(elem)) AS result_arr
FROM (
 SELECT l.d + r.d AS d_sum, elem
 FROM left2 l
 JOIN right2 r USING (t1)
 , unnest(r.arr) elem
 ORDER BY 1, 2
 ) sub
GROUP BY 1
ORDER BY 1;

This was the fastest variant in my cursory test on Postgres 9.4.

SQL Fiddle based on the one you provided.

Index

I don't see much potential for any index here. The only option would be:

CREATE INDEX ON right2 (t1, arr);

Only makes sense if you get index-only scans out of this - which will happen if the underlying table right2 is substantially wider than just these two columns and your setup qualifies for index-only scans. Details in the Postgres Wiki.

Thanks +1. I will have to UNNEST later anyway, but want to check if removing duplicates in the arrays and then UNNEST is faster.

Stack Exchange Network

Efficient merging (removing duplicates) of arrays

2 Answers 2

Correct results?

Postgres 9.5 or later

Query

Index

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Efficient merging (removing duplicates) of arrays

2 Answers 2

Correct results?

Postgres 9.5 or later

Query

Index

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions