Efficient set data structure supporting insert and set equal

Question 1

What's the best way to represent sets that support the following two operations:

Insert(s, i) - adds nonnegative integer i to set s
Equal(s1, s2) - Tests if s1 and s2 are the same set.

In addition, I have an upper bound on the integers added to each set. That is, I know $N$ such that 0ドル \le i < N$.

I'm also willing to accept that Equal is only approximate. ie Equal(s1,s2) could be true even if s1 != s2. I guess in that case I'm looking for a hash function. Does anyone know a good hash function?

We can assume that an integer is only inserted once into a set. $N$ can be about 3-5 million, and the number of sets I want to test for equality can be in the thousands.

Question 2

Let $F$ be a finite field with efficient procedures for addition and
multiplication and almost-uniform sampling. $\:$ Let $z$ be a field element.
(I use that letter because things will be slightly more convenient if $z$ is
the zero element, although efficient sampling means that even if one
can't find the zero of $F\hspace{-0.03 in},ドル one should still be able to find a field element.)

To initialize the data structure, sample a field element $x$.
Empty sets are represented by $z,ドル Insert(s,i) adds $x^i$ to the representation of s,
and Equal(s1,s2) just tests if the two representing field elements are equal.
Since addition is associative and commutative, there are no false negatives. $\:$ If each integer is only inserted once into a set, then the probability of there being at least one false positive is at most
the distance from $x$'s distribution to uniform $\;+\;$ $((N\cdot \text{number of equality tests})\hspace{.02 in}/($$\hspace{-0.03 in}\operatorname{card}$$(F\hspace{.03 in})) \;\;$.

To efficiently compute $x^i\hspace{-0.04 in},ドル see this paper, and either avoid the methods that involve negative
powers or make slightly stronger assumptions on $F$. $\:$ I would recommend 5.1 on page 14.

Question 3

This assumes that there are no duplicates. Also, this is just a special case of the method of adding the outputs of a hash function.

Question 4

See the OP: "We can assume that an integer is only inserted once into a set." $\:$ It's a quite useful special case, since it gives provable error bounds. $\;\;\;\;$

Question 5

If you are not allowed to add the same value more than once, then the insert operation can add the hash value of $i$ to some accumulator associated with $s,ドル and the equality testing can compare the accumulators.

If you are allowed to add the same value twice, you can use Bloom filters.

Question 6

Isn't a bloom filter used to represent a single set, and efficiently test membership in that set? In my problem I have multiple sets and want to test if two sets are equal.

Question 7

You'll use one Bloom filter per set.

Question 8

Why would I want to use a Bloom filter for each set, when I don't care about testing membership in the set? That seems like overkill to me.

Question 9

@foobarbazquxx : $\;\;\;$ If the data structure needs to support inserting the same value more than once to the same set, then the data structure must suffice for testing membership. $\:$ (Just use Equal(Insert(s,i),s).) $\:$ If the data structure does not need to support that (or the data structure is allowed to represent multisets instead of sets) then you can use less memory per set. $\;\;\;\;\;\;\;$

Question 10

@foobarbazquxx Bloom filters are just a different way of using hashing, generalizing Juho's approach in their answer. With the correct parameters they sound like a reasonable approach to me.

Question 11

Assuming $N$ is not too large, consider using a bitvector. For example, you can use a single integer type as a data structure, or an array of bits. Insertion can be done in constant time, and equality checking is very fast as well.

Question 12

This solution works even if there are duplicate elements. Just use bitwise OR. If $N$ is large then you can replace your bitvector with a Bloom filter (in which indices are decided by applying a hash function).

Question 13

N is on the order of 3 to 5 million

Question 14

I'll add an additional suggestion based on @Mihai's answer to this similar cstheory.se question.

If each object in the set has a unique hash, then any commutative operation on the hashes of the members of the set will produce a decent insertion-order independent hash of the entire set. @Yuval's Bloom filters can be viewed as using the bitwise "or" operator. @Juho's approach is also using the bitwise "or" operator and the element hash is just the "one-hot" encoding of each member of the set. @Mihai is suggesting that the encoding of a member $i$ of the set be $a^i \mod p,ドル for some preselected $a$ that is relatively prime to $p,ドル and that the commutative operator be addition $\mod p$. If $p$ is larger than $N$ then the probability of a false equality between two different sets is $O(N/p)$ (according to @Mihai's answer). I'm sure that there are situations in which bitwise "xor" works as the commutative operator too.

Question 15

$\operatorname{lg}_2$ is strictly increasing, so the sentence involving it can be simplified. $\;$

Question 16

Won't using bitwise OR eventually converge the entire bitset to be set to 1?

Question 17

for bitvectors the number of items is not really important, 5 million bits can be stored in 1 megabyte. you should look at the total length of the biggest number or find a mapping into smaller numbers.

For comparison you can calculate the hash over this bitvector.

Question 18

Every number between 0 and N can appear in a set, so I'm not sure what you mean by mapping into smaller numbers. If each set requires a MB bitvector and I want to have several thousand sets I'll need several GBs of memory. Unfortunately, that's too expensive, so some kind of hash seems necessary.

user12859user12859 · Answer 1 · 2014-05-15 19:58:45Z

Let $F$ be a finite field with efficient procedures for addition and
multiplication and almost-uniform sampling. $\:$ Let $z$ be a field element.
(I use that letter because things will be slightly more convenient if $z$ is
the zero element, although efficient sampling means that even if one
can't find the zero of $F\hspace{-0.03 in},ドル one should still be able to find a field element.)

To initialize the data structure, sample a field element $x$.
Empty sets are represented by $z,ドル Insert(s,i) adds $x^i$ to the representation of s,
and Equal(s1,s2) just tests if the two representing field elements are equal.
Since addition is associative and commutative, there are no false negatives. $\:$ If each integer is only inserted once into a set, then the probability of there being at least one false positive is at most
the distance from $x$'s distribution to uniform $\;+\;$ $((N\cdot \text{number of equality tests})\hspace{.02 in}/($$\hspace{-0.03 in}\operatorname{card}$$(F\hspace{.03 in})) \;\;$.

To efficiently compute $x^i\hspace{-0.04 in},ドル see this paper, and either avoid the methods that involve negative
powers or make slightly stronger assumptions on $F$. $\:$ I would recommend 5.1 on page 14.

This assumes that there are no duplicates. Also, this is just a special case of the method of adding the outputs of a hash function.
See the OP: "We can assume that an integer is only inserted once into a set." $\:$ It's a quite useful special case, since it gives provable error bounds. $\;\;\;\;$

Yuval Filmus Yuval Filmus 281k27 gold badges317 silver badges514 bronze badges · Answer 2 · 2014-05-15 14:47:50Z

1

$\begingroup$

If you are not allowed to add the same value more than once, then the insert operation can add the hash value of $i$ to some accumulator associated with $s,ドル and the equality testing can compare the accumulators.

If you are allowed to add the same value twice, you can use Bloom filters.

Share

Improve this answer

answered May 15, 2014 at 14:47

Yuval Filmus's user avatar

Yuval Filmus Yuval Filmus

281k27 gold badges317 silver badges514 bronze badges

$\endgroup$

7

$\begingroup$ Isn't a bloom filter used to represent a single set, and efficiently test membership in that set? In my problem I have multiple sets and want to test if two sets are equal. $\endgroup$

foobarbazquxx
– foobarbazquxx

2014年05月15日 14:57:50 +00:00
Commented May 15, 2014 at 14:57
$\begingroup$ You'll use one Bloom filter per set. $\endgroup$

Yuval Filmus
– Yuval Filmus

2014年05月15日 15:19:00 +00:00
Commented May 15, 2014 at 15:19
$\begingroup$ Why would I want to use a Bloom filter for each set, when I don't care about testing membership in the set? That seems like overkill to me. $\endgroup$

foobarbazquxx
– foobarbazquxx

2014年05月15日 16:57:55 +00:00
Commented May 15, 2014 at 16:57
$\begingroup$ @foobarbazquxx : $\;\;\;$ If the data structure needs to support inserting the same value more than once to the same set, then the data structure must suffice for testing membership. $\:$ (Just use Equal(Insert(s,i),s).) $\:$ If the data structure does not need to support that (or the data structure is allowed to represent multisets instead of sets) then you can use less memory per set. $\;\;\;\;\;\;\;$ $\endgroup$

user12859
– user12859

2014年05月15日 18:04:46 +00:00
Commented May 15, 2014 at 18:04
$\begingroup$ @foobarbazquxx Bloom filters are just a different way of using hashing, generalizing Juho's approach in their answer. With the correct parameters they sound like a reasonable approach to me. $\endgroup$

Yuval Filmus
– Yuval Filmus

2014年05月15日 18:19:42 +00:00
Commented May 15, 2014 at 18:19

| Show 2 more comments

Juho Juho 22.9k7 gold badges63 silver badges118 bronze badges · Answer 3 · 2014-05-15 17:07:57Z

1

$\begingroup$

Assuming $N$ is not too large, consider using a bitvector. For example, you can use a single integer type as a data structure, or an array of bits. Insertion can be done in constant time, and equality checking is very fast as well.

Share

Improve this answer

edited May 15, 2014 at 21:08

answered May 15, 2014 at 17:07

Juho's user avatar

Juho Juho

22.9k7 gold badges63 silver badges118 bronze badges

$\endgroup$

2

$\begingroup$ This solution works even if there are duplicate elements. Just use bitwise OR. If $N$ is large then you can replace your bitvector with a Bloom filter (in which indices are decided by applying a hash function). $\endgroup$

Yuval Filmus
– Yuval Filmus

2014年05月15日 18:17:55 +00:00
Commented May 15, 2014 at 18:17
$\begingroup$ N is on the order of 3 to 5 million $\endgroup$

foobarbazquxx
– foobarbazquxx

2014年05月15日 19:01:02 +00:00
Commented May 15, 2014 at 19:01

Add a comment |

score 1 · Answer 4 · 2014-05-15 20:29:49Z

I'll add an additional suggestion based on @Mihai's answer to this similar cstheory.se question.

If each object in the set has a unique hash, then any commutative operation on the hashes of the members of the set will produce a decent insertion-order independent hash of the entire set. @Yuval's Bloom filters can be viewed as using the bitwise "or" operator. @Juho's approach is also using the bitwise "or" operator and the element hash is just the "one-hot" encoding of each member of the set. @Mihai is suggesting that the encoding of a member $i$ of the set be $a^i \mod p,ドル for some preselected $a$ that is relatively prime to $p,ドル and that the commutative operator be addition $\mod p$. If $p$ is larger than $N$ then the probability of a false equality between two different sets is $O(N/p)$ (according to @Mihai's answer). I'm sure that there are situations in which bitwise "xor" works as the commutative operator too.

$\operatorname{lg}_2$ is strictly increasing, so the sentence involving it can be simplified. $\;$
Won't using bitwise OR eventually converge the entire bitset to be set to 1?

afwe afwe 1 · Answer 5 · 2014-05-15 21:39:28Z

0

$\begingroup$

for bitvectors the number of items is not really important, 5 million bits can be stored in 1 megabyte. you should look at the total length of the biggest number or find a mapping into smaller numbers.

For comparison you can calculate the hash over this bitvector.

Share

Improve this answer

answered May 15, 2014 at 21:39

afwe's user avatar

afwe afwe

1

$\endgroup$

1

$\begingroup$ Every number between 0 and N can appear in a set, so I'm not sure what you mean by mapping into smaller numbers. If each set requires a MB bitvector and I want to have several thousand sets I'll need several GBs of memory. Unfortunately, that's too expensive, so some kind of hash seems necessary. $\endgroup$

foobarbazquxx
– foobarbazquxx

2014年05月16日 13:35:58 +00:00
Commented May 16, 2014 at 13:35

Add a comment |

Stack Exchange Network

Efficient set data structure supporting insert and set equal

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficient set data structure supporting insert and set equal

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions