5
$\begingroup$

I have a data1 with 4 columns and would like to remove the duplications by looking at the first 3 columns. Thus I used below code:-

data1 = BlockRandom[SeedRandom[12]; RandomInteger[2, {600, 4}]];
select1 = Timing@DeleteDuplicates[data1, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &]

The code looks okay, but when the range of variety and number of data increase, it will slow down very quickly. There must be a quicker way, since if I removed all the noncomparable columns first before doing DeleteDuplicates, that will be more than 10 times faster:-

select2 = Timing@DeleteDuplicates@data1[[All, 1 ;; 3]]

But the drawback of select2 is, I will lose some of the columns. Of course, I can add the lost columns back to select2, but the codes would become very clumsy and I wonder if there are some other ways to get it done.

Many thanks!

asked Jun 15, 2018 at 20:07
$\endgroup$
4
  • $\begingroup$ DeleteDuplicatesBy[data1, #[[;; 3]] &] is certainly more readable, and I would expect faster as well. $\endgroup$ Commented Jun 15, 2018 at 20:12
  • $\begingroup$ You should wait to mark something as accepted. I put up an answer that is actually a little faster than Marco's. :) $\endgroup$ Commented Jun 15, 2018 at 20:30
  • 1
    $\begingroup$ @rcollyer At larger scale, e.g. with 2,000,000 points, the DeleteDuplicatesBy version still edges out a narrow win over the faster of your two GroupBy approaches, at least on my machine :-) $\endgroup$ Commented Jun 15, 2018 at 20:36
  • 2
    $\begingroup$ @MarcoB on my machine, the switchover is at approx. 425000 points. And, Henrik's is an order of magnitude faster than ours, with the caveat of not being order preserving. If that is not needed, use his, otherwise use mine for shorter lengths, and yours for longer. Win. $\endgroup$ Commented Jun 15, 2018 at 20:48

4 Answers 4

6
$\begingroup$

Use DeleteDuplicatesBy with an appropriate function that selects only the first three columns:

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
(* Yours *)
AbsoluteTiming[
 DeleteDuplicates[moredata, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];
]
(* 5.38 s *)
(* Slightly improved, still using DeleteDuplicates *)
AbsoluteTiming[DeleteDuplicates[moredata, #1[[;; 3]] == #2[[;; 3]] &];]
(* 2.44 s *)
(* DeleteDuplicatesBy *)
AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
(* 0.05 s*)

The latter takes 0.05 seconds whereas your original formulation takes roughly 5 seconds, so a 100x speedup.

answered Jun 15, 2018 at 20:15
$\endgroup$
0
6
$\begingroup$

This is not meant completely seriously, but in some performance-critical situations, SpareArrays can perform miracles.

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
a = DeleteDuplicatesBy[moredata, #[[;; 3]] &]; // RepeatedTiming // First
b = Join[#["NonzeroPositions"] - 1, moredata[[#["NonzeroValues"], 4 ;;]], 2] &[
 With[{p = moredata[[All, 1 ;; 3]] + 1}, 
 SparseArray[p -> Range[Length[p]], Max /@ Transpose[p]]
 ]
 ]; // RepeatedTiming // First
Sort@a == Sort@b

0.041

0.0039

True

Not that you can save even more time id you now the maximal dimensions of the SparseArray in advance.

c = Join[#["NonzeroPositions"] - 1, moredata[[#["NonzeroValues"], 4 ;;]], 2] &[
 With[{p = moredata[[All, 1 ;; 3]] + 1}, 
 SparseArray[p -> Range[Length[p]], {3,3,3}]
 ]
 ]; // RepeatedTiming // First

0.00335

I'd like to add that my approach exploits several assumptions, so it has some severe limitations:

  1. All entries in moredata[[All, 1 ;; 3]] are nonnegative integers, so that they can be used as sparsity pattern of a SparseArray (after adding 1).

  2. The ordering of the result does not really matter to you.

answered Jun 15, 2018 at 20:38
$\endgroup$
5
  • 1
    $\begingroup$ what's bla in Range[Length[bla]]? $\endgroup$ Commented Jun 15, 2018 at 20:42
  • 2
    $\begingroup$ Sorry. My generic variable when I am uncreative. I forgot to rename it. Fixed it. $\endgroup$ Commented Jun 15, 2018 at 20:43
  • $\begingroup$ Thanks. I'll test it now. :) $\endgroup$ Commented Jun 15, 2018 at 20:44
  • 2
    $\begingroup$ The only caveat with yours is that it is not order preserving. But, still lightning fast. $\endgroup$ Commented Jun 15, 2018 at 20:47
  • 1
    $\begingroup$ @rcollyer Yeah, there is a price for everything you know... ;) $\endgroup$ Commented Jun 15, 2018 at 20:49
5
$\begingroup$

Here are a couple of alternatives using other functions.

In[64]:= moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
In[86]:= AbsoluteTiming[DeleteDuplicates[moredata, 
 {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];]
Out[86]= {4.73677, Null}
(* Marco's DeleteDuplicates *)
In[88]:= AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
Out[88]= {0.044251, Null}
(* a Reap/Sow based solution *)
In[80]:= Reap[Sow[#, {#[[;; 3]]}] & /@ moredata][[2, All, 1]]; // AbsoluteTiming
Out[80]= {0.161927, Null}
(* GroupBy using the third argument to reduce the results *)
In[82]:= Values@GroupBy[moredata, (#[[;; 3]] &), First]; // AbsoluteTiming
Out[82]= {0.041241, Null}
(* GroupBy using Part instead of First *)
In[81]:= Values@GroupBy[moredata, (#[[;; 3]] &)][[All, 1]]; // AbsoluteTiming
Out[81]= {0.038312, Null}
answered Jun 15, 2018 at 20:29
$\endgroup$
2
  • $\begingroup$ The GroupBy approach is pretty cool. I would not have thought of that! $\endgroup$ Commented Jun 15, 2018 at 20:40
  • $\begingroup$ @MarcoB it's the upgrade of this SO answer. $\endgroup$ Commented Jun 15, 2018 at 20:43
2
$\begingroup$

Here is a refinement of @rcollyer's answer using my GatherByList function. The function is short:

GatherByList[list_, representatives_] := Module[{func},
 func /: Map[func, _] := representatives;
 GatherBy[list, func]
]

And using GatherByList:

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
(* MarcoB *)
r1 = DeleteDuplicatesBy[moredata, #[[;; 3]] &]; //AbsoluteTiming
(* rcollyer *)
r2 = Values@GroupBy[moredata, (#[[;; 3]] &)][[All, 1]]; // AbsoluteTiming
(* GatherByList *)
r3 = GatherByList[moredata, moredata[[All,;;3]]][[All, 1]]; //AbsoluteTiming
r1 === r2 === r3

{0.047484, Null}

{0.043889, Null}

{0.029294, Null}

True

Not as fast as @Henrik's answer, of course.

answered Jun 16, 2018 at 15:56
$\endgroup$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.