Improve running speed for DeleteDuplicates

Question 1

How to make this faster? I want to remove elements that satisfy the condition below but it's way too slow.

tup1 = Tuples[{{0, 1}, {-1, 0, 1}, {-1, 0, 1}, {0, 1}, {-1, 0, 
 1}, {-1, 0, 1}, {0, 1}, {-1, 0, 1}, {-1, 0, 1}, {0, 1}, {-1, 0, 
 1}, {-1, 0, 1}}];
tup2 = DeleteDuplicates[
 tup1, (#1[[1 ;; 3]] == #2[[4 ;; 
 6]]) && (#1[[4 ;; 6]] == #2[[1 ;; 3]] && #1[[7 ;; 
 9]] == #2[[10 ;; 12]]) && (#1[[10 ;; 12]] == #2[[7 ;; 
 9]]) &];

Question 2

Here is a semi-imperative way that gives the same result as tup2 from the original question, but much faster:

tup2b = Module[{keep}
, keep[t_] :=
 ( keep[t] = False
 ; keep[t[[{4,5,6,1,2,3,10,11,12,7,8,9}]]] = False
 ; True
 )
; Select[tup1, keep]
];
tup2b === tup2
(* True *)
Length[tup2b] === Length[tup2] === 52650
(* True *)

On my machine, the calculation of tup2 takes a couple of hours whereas the approach shown here is subsecond. This approach also makes it easy to add other equivalence criteria if desired.

How It Works

The function keep is used as a predicate to Select to determine whether to keep each entry of the list. The first time each element is encountered, keep returns True. But as a side-effect it also adds a new definition to keep that records that the entry and its equivalent permutation are no longer to be kept. The new definition will return False if either entry is encountered later in the list. In this way keep effectively maintains a set of entries seen so far (along with their equivalents).

For the example list given in the question, it is not strictly necessary to record that we have seen each unpermuted entry since there are no duplicates. But in the general case that might not be so.

This method scans each entry of the list once, so it runs in time roughly proportional to the length $n$ of the list. Technically, the time is more on the order of $n \ln(n)$ due to the hashing involved in saving and testing entries, but for small $n$ the difference is not that noticeable.

By contrast, and as Leonid Shifrin points out, DeleteDuplicates must in principle compare every pair of elements, so the run time is proportional to $n^2$ -- a much larger number of iterations.

Question 3

wondering if Pick might be slightly more readable here.

Question 4

@AccidentalFourierTransform It is probably just a matter of taste, but I think I prefer the Select form over something like Pick[tup1, keep /@ tup1].

Question 5

Sure, it is definitely a matter of taste. My point was that Pick together with a minor modification of keep might be ever so slightly cleaner/more readable. Or using the third argument. Or, being optimistic, perhaps there is a way to avoid having to define keep altogether. But I'm really just thinking out loud so don't take me too seriously :-)

Question 6

+1. It might be worth mentioning that DeleteDuplicates with explicit predicate uses quadratic complexity algorithm based on pairwise comparisons, which explains the timing difference, as well as the existence of DeleteDuplicatesBy as a separate function.

Question 7

I have added the new section How It Works. (also @LeonidShifrin)

Question 8

You can also get tup2 from tup1 using:

1. `Union`

ClearAll[fA]
fA = Union[Sort[{#, #[[{4, 5, 6, 1, 2, 3, 10, 11, 12, 7, 8, 9}]]}] & /@ #][[All, 1]] &;
tup2A = fA @ tup1; // AbsoluteTiming // First

0.213222

Length @ tup2A

2. `DeleteDuplicates`

ClearAll[fB]
fB = DeleteDuplicates[
 Sort[{#, #[[{4, 5, 6, 1, 2, 3, 10, 11, 12, 7, 8, 9}]]}] & /@ #][[All, 1]] &;
tup2B = fB @ tup1; // AbsoluteTiming // First

0.257217

3. `GroupOrbits` + `PermutationGroup`

ClearAll[fC]
fC = GroupOrbits[PermutationGroup[{{4, 5, 6, 1, 2, 3, 10, 11, 12, 7, 8, 9}}], #, 
 Permute][[All, 1]] &
tup2C = fC @ tup1; // AbsoluteTiming // First

0.640413

4. Memoization

ClearAll[fD]
fD = Module[{f0}, 
 f0[x_] := (f0[x] = f0[x[[{4, 5, 6, 1, 2, 3, 10, 11, 12, 7, 8, 9}]]] = Sequence[]; x); 
 f0 /@ #] &;
tup2D = fD @ tup1; // AbsoluteTiming // First

0.794055

5. `DeleteDuplicatesBy`

ClearAll[fE]
fE = DeleteDuplicatesBy[Sort[{#, #[[{4, 5, 6, 1, 2, 3, 10, 11, 12, 7, 8, 9}]]}]&]
tup2E = fE @ tup1; // AbsoluteTiming // First

1.13389

6. `GroupBy`

ClearAll[fF]
fF = Values @ GroupBy[#, 
 Sort[{#, #[[{4, 5, 6, 1, 2, 3, 10, 11, 12, 7, 8, 9}]]}] &, 
 First] &;
tup2F = fF @ tup1; // AbsoluteTiming // First

1.28655

All six results match tup2b from WReach's answer:

tup2b == tup2A == tup2B == tup2C == tup2D == tup2E == tup2F

True

In comparison, tup2b takes about a second:

tup2b = Module[{keep}, 
 keep[t_] := (keep[t] = False; 
 keep[t[[{4, 5, 6, 1, 2, 3, 10, 11, 12, 7, 8, 9}]]] = False; 
 True); Select[tup1, keep]]; // AbsoluteTiming // First

1.06063

Question 9

A tuple is deleted if both its first two triples and its second two triples are identical when exchanged (but not when they're the same when not swapped). That we can construct the undeleted tuples directly like so:

tup2 =
 Module[{
 pair = Tuples[Tuples[{{0, 1}, {0, -1, 1}, {0, -1, 1}}], 2],
 unique
 },
 
 unique = DeleteDuplicates[pair, #1 === Reverse@#2 &];
 
 Flatten /@ 
 DeleteDuplicates@
 Join[Tuples[{unique, pair}], Tuples[{pair, unique}]]];

Takes about a tenth of a second on my laptop.

UPDATE: If you want to use the existing tup1 and delete the duplicates, you can keep the ones that appear in tup2 like so:

tup3 = Intersection[tup1, tup2];

which is very fast. If for some reason you need to keep tup1 in the original order, you can do something a bit slower:

tup4 = Select[tup1, AssociationThread[tup2 -> True]];

This takes another 0.2 seconds on my laptop.

Given the way the problem is stated, it's much easier to construct tuples you know are unique than to delete duplicates after the fact.

Question 10

I don't get this. Your pair structure is different from mine. Can this method be used if I already have tup1 as above?

WReach 69.8k4 gold badges167 silver badges275 bronze badges · Accepted Answer · 2021-01-05 17:38:15Z

Here is a semi-imperative way that gives the same result as tup2 from the original question, but much faster:

tup2b = Module[{keep}
, keep[t_] :=
 ( keep[t] = False
 ; keep[t[[{4,5,6,1,2,3,10,11,12,7,8,9}]]] = False
 ; True
 )
; Select[tup1, keep]
];
tup2b === tup2
(* True *)
Length[tup2b] === Length[tup2] === 52650
(* True *)

On my machine, the calculation of tup2 takes a couple of hours whereas the approach shown here is subsecond. This approach also makes it easy to add other equivalence criteria if desired.

How It Works

The function keep is used as a predicate to Select to determine whether to keep each entry of the list. The first time each element is encountered, keep returns True. But as a side-effect it also adds a new definition to keep that records that the entry and its equivalent permutation are no longer to be kept. The new definition will return False if either entry is encountered later in the list. In this way keep effectively maintains a set of entries seen so far (along with their equivalents).

For the example list given in the question, it is not strictly necessary to record that we have seen each unpermuted entry since there are no duplicates. But in the general case that might not be so.

This method scans each entry of the list once, so it runs in time roughly proportional to the length $n$ of the list. Technically, the time is more on the order of $n \ln(n)$ due to the hashing involved in saving and testing entries, but for small $n$ the difference is not that noticeable.

By contrast, and as Leonid Shifrin points out, DeleteDuplicates must in principle compare every pair of elements, so the run time is proportional to $n^2$ -- a much larger number of iterations.

@AccidentalFourierTransform It is probably just a matter of taste, but I think I prefer the Select form over something like Pick[tup1, keep /@ tup1].
Sure, it is definitely a matter of taste. My point was that Pick together with a minor modification of keep might be ever so slightly cleaner/more readable. Or using the third argument. Or, being optimistic, perhaps there is a way to avoid having to define keep altogether. But I'm really just thinking out loud so don't take me too seriously :-)
+1. It might be worth mentioning that DeleteDuplicates with explicit predicate uses quadratic complexity algorithm based on pairwise comparisons, which explains the timing difference, as well as the existence of DeleteDuplicatesBy as a separate function.
I have added the new section How It Works. (also @LeonidShifrin)

Stack Exchange Network

Improve running speed for DeleteDuplicates

3 Answers 3

1. `Union`

2. `DeleteDuplicates`

3. `GroupOrbits` + `PermutationGroup`

4. Memoization

5. `DeleteDuplicatesBy`

6. `GroupBy`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Improve running speed for DeleteDuplicates

3 Answers 3

1. Union

2. DeleteDuplicates

3. GroupOrbits + PermutationGroup

4. Memoization

5. DeleteDuplicatesBy

6. GroupBy

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

1. `Union`

2. `DeleteDuplicates`

3. `GroupOrbits` + `PermutationGroup`

5. `DeleteDuplicatesBy`

6. `GroupBy`