3
$\begingroup$

What's the best way to represent sets that support the following two operations:

  1. Insert(s, i) - adds nonnegative integer i to set s
  2. Equal(s1, s2) - Tests if s1 and s2 are the same set.

In addition, I have an upper bound on the integers added to each set. That is, I know $N$ such that 0ドル \le i < N$.

I'm also willing to accept that Equal is only approximate. ie Equal(s1,s2) could be true even if s1 != s2. I guess in that case I'm looking for a hash function. Does anyone know a good hash function?

We can assume that an integer is only inserted once into a set. $N$ can be about 3-5 million, and the number of sets I want to test for equality can be in the thousands.

asked May 15, 2014 at 14:40
$\endgroup$

5 Answers 5

2
$\begingroup$

Let $F$ be a finite field with efficient procedures for addition and
multiplication and almost-uniform sampling. $\:$ Let $z$ be a field element.
(I use that letter because things will be slightly more convenient if $z$ is
the zero element, although efficient sampling means that even if one
can't find the zero of $F\hspace{-0.03 in},ドル one should still be able to find a field element.)

To initialize the data structure, sample a field element $x$.
Empty sets are represented by $z,ドル Insert(s,i) adds $x^i$ to the representation of s,
and Equal(s1,s2) just tests if the two representing field elements are equal.
Since addition is associative and commutative, there are no false negatives. $\:$ If each integer is only inserted once into a set, then the probability of there being at least one false positive is at most
the distance from $x$'s distribution to uniform $\;+\;$ $((N\cdot \text{number of equality tests})\hspace{.02 in}/($$\hspace{-0.03 in}\operatorname{card}$$(F\hspace{.03 in})) \;\;$.

To efficiently compute $x^i\hspace{-0.04 in},ドル see this paper, and either avoid the methods that involve negative
powers or make slightly stronger assumptions on $F$. $\:$ I would recommend 5.1 on page 14.

answered May 15, 2014 at 19:58
$\endgroup$
2
  • $\begingroup$ This assumes that there are no duplicates. Also, this is just a special case of the method of adding the outputs of a hash function. $\endgroup$ Commented May 15, 2014 at 20:02
  • 1
    $\begingroup$ See the OP: "We can assume that an integer is only inserted once into a set." $\:$ It's a quite useful special case, since it gives provable error bounds. $\;\;\;\;$ $\endgroup$ Commented May 15, 2014 at 20:06
1
$\begingroup$

If you are not allowed to add the same value more than once, then the insert operation can add the hash value of $i$ to some accumulator associated with $s,ドル and the equality testing can compare the accumulators.

If you are allowed to add the same value twice, you can use Bloom filters.

answered May 15, 2014 at 14:47
$\endgroup$
7
  • $\begingroup$ Isn't a bloom filter used to represent a single set, and efficiently test membership in that set? In my problem I have multiple sets and want to test if two sets are equal. $\endgroup$ Commented May 15, 2014 at 14:57
  • $\begingroup$ You'll use one Bloom filter per set. $\endgroup$ Commented May 15, 2014 at 15:19
  • $\begingroup$ Why would I want to use a Bloom filter for each set, when I don't care about testing membership in the set? That seems like overkill to me. $\endgroup$ Commented May 15, 2014 at 16:57
  • $\begingroup$ @foobarbazquxx : $\;\;\;$ If the data structure needs to support inserting the same value more than once to the same set, then the data structure must suffice for testing membership. $\:$ (Just use Equal(Insert(s,i),s).) $\:$ If the data structure does not need to support that (or the data structure is allowed to represent multisets instead of sets) then you can use less memory per set. $\;\;\;\;\;\;\;$ $\endgroup$ Commented May 15, 2014 at 18:04
  • $\begingroup$ @foobarbazquxx Bloom filters are just a different way of using hashing, generalizing Juho's approach in their answer. With the correct parameters they sound like a reasonable approach to me. $\endgroup$ Commented May 15, 2014 at 18:19
1
$\begingroup$

Assuming $N$ is not too large, consider using a bitvector. For example, you can use a single integer type as a data structure, or an array of bits. Insertion can be done in constant time, and equality checking is very fast as well.

answered May 15, 2014 at 17:07
$\endgroup$
2
  • $\begingroup$ This solution works even if there are duplicate elements. Just use bitwise OR. If $N$ is large then you can replace your bitvector with a Bloom filter (in which indices are decided by applying a hash function). $\endgroup$ Commented May 15, 2014 at 18:17
  • $\begingroup$ N is on the order of 3 to 5 million $\endgroup$ Commented May 15, 2014 at 19:01
1
$\begingroup$

I'll add an additional suggestion based on @Mihai's answer to this similar cstheory.se question.

If each object in the set has a unique hash, then any commutative operation on the hashes of the members of the set will produce a decent insertion-order independent hash of the entire set. @Yuval's Bloom filters can be viewed as using the bitwise "or" operator. @Juho's approach is also using the bitwise "or" operator and the element hash is just the "one-hot" encoding of each member of the set. @Mihai is suggesting that the encoding of a member $i$ of the set be $a^i \mod p,ドル for some preselected $a$ that is relatively prime to $p,ドル and that the commutative operator be addition $\mod p$. If $p$ is larger than $N$ then the probability of a false equality between two different sets is $O(N/p)$ (according to @Mihai's answer). I'm sure that there are situations in which bitwise "xor" works as the commutative operator too.

answered May 15, 2014 at 20:29
$\endgroup$
2
  • 1
    $\begingroup$ $\operatorname{lg}_2$ is strictly increasing, so the sentence involving it can be simplified. $\;$ $\endgroup$ Commented May 15, 2014 at 21:33
  • $\begingroup$ Won't using bitwise OR eventually converge the entire bitset to be set to 1? $\endgroup$ Commented Jan 30, 2022 at 14:18
0
$\begingroup$

for bitvectors the number of items is not really important, 5 million bits can be stored in 1 megabyte. you should look at the total length of the biggest number or find a mapping into smaller numbers.

For comparison you can calculate the hash over this bitvector.

answered May 15, 2014 at 21:39
$\endgroup$
1
  • $\begingroup$ Every number between 0 and N can appear in a set, so I'm not sure what you mean by mapping into smaller numbers. If each set requires a MB bitvector and I want to have several thousand sets I'll need several GBs of memory. Unfortunately, that's too expensive, so some kind of hash seems necessary. $\endgroup$ Commented May 16, 2014 at 13:35

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.