How to increase the speed of DeleteDuplicates on some selected columns?

Question 1

I have a data1 with 4 columns and would like to remove the duplications by looking at the first 3 columns. Thus I used below code:-

data1 = BlockRandom[SeedRandom[12]; RandomInteger[2, {600, 4}]];
select1 = Timing@DeleteDuplicates[data1, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &]

The code looks okay, but when the range of variety and number of data increase, it will slow down very quickly. There must be a quicker way, since if I removed all the noncomparable columns first before doing DeleteDuplicates, that will be more than 10 times faster:-

select2 = Timing@DeleteDuplicates@data1[[All, 1 ;; 3]]

But the drawback of select2 is, I will lose some of the columns. Of course, I can add the lost columns back to select2, but the codes would become very clumsy and I wonder if there are some other ways to get it done.

Many thanks!

Question 2

DeleteDuplicatesBy[data1, #[[;; 3]] &] is certainly more readable, and I would expect faster as well.

Question 3

You should wait to mark something as accepted. I put up an answer that is actually a little faster than Marco's. :)

Question 4

@rcollyer At larger scale, e.g. with 2,000,000 points, the DeleteDuplicatesBy version still edges out a narrow win over the faster of your two GroupBy approaches, at least on my machine :-)

Question 5

@MarcoB on my machine, the switchover is at approx. 425000 points. And, Henrik's is an order of magnitude faster than ours, with the caveat of not being order preserving. If that is not needed, use his, otherwise use mine for shorter lengths, and yours for longer. Win.

Question 6

Use DeleteDuplicatesBy with an appropriate function that selects only the first three columns:

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
(* Yours *)
AbsoluteTiming[
 DeleteDuplicates[moredata, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];
]
(* 5.38 s *)
(* Slightly improved, still using DeleteDuplicates *)
AbsoluteTiming[DeleteDuplicates[moredata, #1[[;; 3]] == #2[[;; 3]] &];]
(* 2.44 s *)
(* DeleteDuplicatesBy *)
AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
(* 0.05 s*)

The latter takes 0.05 seconds whereas your original formulation takes roughly 5 seconds, so a 100x speedup.

Question 7

This is not meant completely seriously, but in some performance-critical situations, SpareArrays can perform miracles.

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
a = DeleteDuplicatesBy[moredata, #[[;; 3]] &]; // RepeatedTiming // First
b = Join[#["NonzeroPositions"] - 1, moredata[[#["NonzeroValues"], 4 ;;]], 2] &[
 With[{p = moredata[[All, 1 ;; 3]] + 1}, 
 SparseArray[p -> Range[Length[p]], Max /@ Transpose[p]]
 ]
 ]; // RepeatedTiming // First
Sort@a == Sort@b

0.041

0.0039

True

Not that you can save even more time id you now the maximal dimensions of the SparseArray in advance.

c = Join[#["NonzeroPositions"] - 1, moredata[[#["NonzeroValues"], 4 ;;]], 2] &[
 With[{p = moredata[[All, 1 ;; 3]] + 1}, 
 SparseArray[p -> Range[Length[p]], {3,3,3}]
 ]
 ]; // RepeatedTiming // First

0.00335

I'd like to add that my approach exploits several assumptions, so it has some severe limitations:

All entries in moredata[[All, 1 ;; 3]] are nonnegative integers, so that they can be used as sparsity pattern of a SparseArray (after adding 1).
The ordering of the result does not really matter to you.

Question 8

what's bla in Range[Length[bla]]?

Question 9

Sorry. My generic variable when I am uncreative. I forgot to rename it. Fixed it.

Question 10

Thanks. I'll test it now. :)

Question 11

The only caveat with yours is that it is not order preserving. But, still lightning fast.

Question 12

@rcollyer Yeah, there is a price for everything you know... ;)

Question 13

Here are a couple of alternatives using other functions.

In[64]:= moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
In[86]:= AbsoluteTiming[DeleteDuplicates[moredata, 
 {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];]
Out[86]= {4.73677, Null}
(* Marco's DeleteDuplicates *)
In[88]:= AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
Out[88]= {0.044251, Null}
(* a Reap/Sow based solution *)
In[80]:= Reap[Sow[#, {#[[;; 3]]}] & /@ moredata][[2, All, 1]]; // AbsoluteTiming
Out[80]= {0.161927, Null}
(* GroupBy using the third argument to reduce the results *)
In[82]:= Values@GroupBy[moredata, (#[[;; 3]] &), First]; // AbsoluteTiming
Out[82]= {0.041241, Null}
(* GroupBy using Part instead of First *)
In[81]:= Values@GroupBy[moredata, (#[[;; 3]] &)][[All, 1]]; // AbsoluteTiming
Out[81]= {0.038312, Null}

Question 14

The GroupBy approach is pretty cool. I would not have thought of that!

Question 15

@MarcoB it's the upgrade of this SO answer.

Question 16

Here is a refinement of @rcollyer's answer using my GatherByList function. The function is short:

GatherByList[list_, representatives_] := Module[{func},
 func /: Map[func, _] := representatives;
 GatherBy[list, func]
]

And using GatherByList:

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
(* MarcoB *)
r1 = DeleteDuplicatesBy[moredata, #[[;; 3]] &]; //AbsoluteTiming
(* rcollyer *)
r2 = Values@GroupBy[moredata, (#[[;; 3]] &)][[All, 1]]; // AbsoluteTiming
(* GatherByList *)
r3 = GatherByList[moredata, moredata[[All,;;3]]][[All, 1]]; //AbsoluteTiming
r1 === r2 === r3

{0.047484, Null}

{0.043889, Null}

{0.029294, Null}

True

Not as fast as @Henrik's answer, of course.

MarcoB 68k19 gold badges98 silver badges205 bronze badges · Accepted Answer · 2018-06-15 20:15:51Z

Use DeleteDuplicatesBy with an appropriate function that selects only the first three columns:

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
(* Yours *)
AbsoluteTiming[
 DeleteDuplicates[moredata, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];
]
(* 5.38 s *)
(* Slightly improved, still using DeleteDuplicates *)
AbsoluteTiming[DeleteDuplicates[moredata, #1[[;; 3]] == #2[[;; 3]] &];]
(* 2.44 s *)
(* DeleteDuplicatesBy *)
AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
(* 0.05 s*)

The latter takes 0.05 seconds whereas your original formulation takes roughly 5 seconds, so a 100x speedup.

Stack Exchange Network

How to increase the speed of DeleteDuplicates on some selected columns?

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How to increase the speed of DeleteDuplicates on some selected columns?

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions