I have a data1 with 4 columns and would like to remove the duplications by looking at the first 3 columns. Thus I used below code:-
data1 = BlockRandom[SeedRandom[12]; RandomInteger[2, {600, 4}]];
select1 = Timing@DeleteDuplicates[data1, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &]
The code looks okay, but when the range of variety and number of data increase, it will slow down very quickly. There must be a quicker way, since if I removed all the noncomparable columns first before doing DeleteDuplicates, that will be more than 10 times faster:-
select2 = Timing@DeleteDuplicates@data1[[All, 1 ;; 3]]
But the drawback of select2 is, I will lose some of the columns. Of course, I can add the lost columns back to select2, but the codes would become very clumsy and I wonder if there are some other ways to get it done.
Many thanks!
4 Answers 4
Use DeleteDuplicatesBy with an appropriate function that selects only the first three columns:
moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
(* Yours *)
AbsoluteTiming[
DeleteDuplicates[moredata, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];
]
(* 5.38 s *)
(* Slightly improved, still using DeleteDuplicates *)
AbsoluteTiming[DeleteDuplicates[moredata, #1[[;; 3]] == #2[[;; 3]] &];]
(* 2.44 s *)
(* DeleteDuplicatesBy *)
AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
(* 0.05 s*)
The latter takes 0.05 seconds whereas your original formulation takes roughly 5 seconds, so a 100x speedup.
This is not meant completely seriously, but in some performance-critical situations, SpareArrays can perform miracles.
moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
a = DeleteDuplicatesBy[moredata, #[[;; 3]] &]; // RepeatedTiming // First
b = Join[#["NonzeroPositions"] - 1, moredata[[#["NonzeroValues"], 4 ;;]], 2] &[
With[{p = moredata[[All, 1 ;; 3]] + 1},
SparseArray[p -> Range[Length[p]], Max /@ Transpose[p]]
]
]; // RepeatedTiming // First
Sort@a == Sort@b
0.041
0.0039
True
Not that you can save even more time id you now the maximal dimensions of the SparseArray in advance.
c = Join[#["NonzeroPositions"] - 1, moredata[[#["NonzeroValues"], 4 ;;]], 2] &[
With[{p = moredata[[All, 1 ;; 3]] + 1},
SparseArray[p -> Range[Length[p]], {3,3,3}]
]
]; // RepeatedTiming // First
0.00335
I'd like to add that my approach exploits several assumptions, so it has some severe limitations:
All entries in
moredata[[All, 1 ;; 3]]are nonnegative integers, so that they can be used as sparsity pattern of aSparseArray(after adding1).The ordering of the result does not really matter to you.
-
1$\begingroup$ what's
blainRange[Length[bla]]? $\endgroup$rcollyer– rcollyer2018年06月15日 20:42:43 +00:00Commented Jun 15, 2018 at 20:42 -
2$\begingroup$ Sorry. My generic variable when I am uncreative. I forgot to rename it. Fixed it. $\endgroup$Henrik Schumacher– Henrik Schumacher2018年06月15日 20:43:49 +00:00Commented Jun 15, 2018 at 20:43
-
$\begingroup$ Thanks. I'll test it now. :) $\endgroup$rcollyer– rcollyer2018年06月15日 20:44:09 +00:00Commented Jun 15, 2018 at 20:44
-
2$\begingroup$ The only caveat with yours is that it is not order preserving. But, still lightning fast. $\endgroup$rcollyer– rcollyer2018年06月15日 20:47:11 +00:00Commented Jun 15, 2018 at 20:47
-
1$\begingroup$ @rcollyer Yeah, there is a price for everything you know... ;) $\endgroup$Henrik Schumacher– Henrik Schumacher2018年06月15日 20:49:34 +00:00Commented Jun 15, 2018 at 20:49
Here are a couple of alternatives using other functions.
In[64]:= moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
In[86]:= AbsoluteTiming[DeleteDuplicates[moredata,
{#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];]
Out[86]= {4.73677, Null}
(* Marco's DeleteDuplicates *)
In[88]:= AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
Out[88]= {0.044251, Null}
(* a Reap/Sow based solution *)
In[80]:= Reap[Sow[#, {#[[;; 3]]}] & /@ moredata][[2, All, 1]]; // AbsoluteTiming
Out[80]= {0.161927, Null}
(* GroupBy using the third argument to reduce the results *)
In[82]:= Values@GroupBy[moredata, (#[[;; 3]] &), First]; // AbsoluteTiming
Out[82]= {0.041241, Null}
(* GroupBy using Part instead of First *)
In[81]:= Values@GroupBy[moredata, (#[[;; 3]] &)][[All, 1]]; // AbsoluteTiming
Out[81]= {0.038312, Null}
-
$\begingroup$ The
GroupByapproach is pretty cool. I would not have thought of that! $\endgroup$MarcoB– MarcoB2018年06月15日 20:40:38 +00:00Commented Jun 15, 2018 at 20:40 -
Here is a refinement of @rcollyer's answer using my GatherByList function. The function is short:
GatherByList[list_, representatives_] := Module[{func},
func /: Map[func, _] := representatives;
GatherBy[list, func]
]
And using GatherByList:
moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
(* MarcoB *)
r1 = DeleteDuplicatesBy[moredata, #[[;; 3]] &]; //AbsoluteTiming
(* rcollyer *)
r2 = Values@GroupBy[moredata, (#[[;; 3]] &)][[All, 1]]; // AbsoluteTiming
(* GatherByList *)
r3 = GatherByList[moredata, moredata[[All,;;3]]][[All, 1]]; //AbsoluteTiming
r1 === r2 === r3
{0.047484, Null}
{0.043889, Null}
{0.029294, Null}
True
Not as fast as @Henrik's answer, of course.
Explore related questions
See similar questions with these tags.
DeleteDuplicatesBy[data1, #[[;; 3]] &]is certainly more readable, and I would expect faster as well. $\endgroup$DeleteDuplicatesByversion still edges out a narrow win over the faster of your twoGroupByapproaches, at least on my machine :-) $\endgroup$