What's the best way to represent sets that support the following two operations:
- Insert(s, i) - adds nonnegative integer i to set s
- Equal(s1, s2) - Tests if s1 and s2 are the same set.
In addition, I have an upper bound on the integers added to each set. That is, I know $N$ such that 0ドル \le i < N$.
I'm also willing to accept that Equal is only approximate. ie Equal(s1,s2) could be true even if s1 != s2. I guess in that case I'm looking for a hash function. Does anyone know a good hash function?
We can assume that an integer is only inserted once into a set. $N$ can be about 3-5 million, and the number of sets I want to test for equality can be in the thousands.
5 Answers 5
Let $F$ be a finite field with efficient procedures for addition and
multiplication and almost-uniform sampling. $\:$ Let $z$ be a field element.
(I use that letter because things will be slightly more convenient if $z$ is
the zero element, although efficient sampling means that even if one
can't find the zero of $F\hspace{-0.03 in},ドル one should still be able to find a field element.)
To initialize the data structure, sample a field element $x$.
Empty sets are represented by $z,ドル Insert(s,i)
adds $x^i$ to the representation of s,
and Equal(s1,s2)
just tests if the two representing field elements are equal.
Since addition is associative and commutative, there are no false negatives. $\:$ If each integer is only inserted once into a set, then the probability of there being at least one false positive is at most
the distance from $x$'s distribution to uniform $\;+\;$ $((N\cdot \text{number of equality tests})\hspace{.02 in}/($$\hspace{-0.03 in}\operatorname{card}$$(F\hspace{.03 in})) \;\;$.
To efficiently compute $x^i\hspace{-0.04 in},ドル see this paper, and either avoid the methods that involve negative
powers or make slightly stronger assumptions on $F$. $\:$ I would recommend 5.1 on page 14.
-
$\begingroup$ This assumes that there are no duplicates. Also, this is just a special case of the method of adding the outputs of a hash function. $\endgroup$Yuval Filmus– Yuval Filmus2014年05月15日 20:02:27 +00:00Commented May 15, 2014 at 20:02
-
1$\begingroup$ See the OP: "We can assume that an integer is only inserted once into a set." $\:$ It's a quite useful special case, since it gives provable error bounds. $\;\;\;\;$ $\endgroup$user12859– user128592014年05月15日 20:06:57 +00:00Commented May 15, 2014 at 20:06
If you are not allowed to add the same value more than once, then the insert operation can add the hash value of $i$ to some accumulator associated with $s,ドル and the equality testing can compare the accumulators.
If you are allowed to add the same value twice, you can use Bloom filters.
-
$\begingroup$ Isn't a bloom filter used to represent a single set, and efficiently test membership in that set? In my problem I have multiple sets and want to test if two sets are equal. $\endgroup$foobarbazquxx– foobarbazquxx2014年05月15日 14:57:50 +00:00Commented May 15, 2014 at 14:57
-
$\begingroup$ You'll use one Bloom filter per set. $\endgroup$Yuval Filmus– Yuval Filmus2014年05月15日 15:19:00 +00:00Commented May 15, 2014 at 15:19
-
$\begingroup$ Why would I want to use a Bloom filter for each set, when I don't care about testing membership in the set? That seems like overkill to me. $\endgroup$foobarbazquxx– foobarbazquxx2014年05月15日 16:57:55 +00:00Commented May 15, 2014 at 16:57
-
$\begingroup$ @foobarbazquxx : $\;\;\;$ If the data structure needs to support inserting the same value more than once to the same set, then the data structure must suffice for testing membership. $\:$ (Just use Equal(Insert(s,i),s).) $\:$ If the data structure does not need to support that (or the data structure is allowed to represent multisets instead of sets) then you can use less memory per set. $\;\;\;\;\;\;\;$ $\endgroup$user12859– user128592014年05月15日 18:04:46 +00:00Commented May 15, 2014 at 18:04
-
$\begingroup$ @foobarbazquxx Bloom filters are just a different way of using hashing, generalizing Juho's approach in their answer. With the correct parameters they sound like a reasonable approach to me. $\endgroup$Yuval Filmus– Yuval Filmus2014年05月15日 18:19:42 +00:00Commented May 15, 2014 at 18:19
Assuming $N$ is not too large, consider using a bitvector. For example, you can use a single integer type as a data structure, or an array of bits. Insertion can be done in constant time, and equality checking is very fast as well.
-
$\begingroup$ This solution works even if there are duplicate elements. Just use bitwise OR. If $N$ is large then you can replace your bitvector with a Bloom filter (in which indices are decided by applying a hash function). $\endgroup$Yuval Filmus– Yuval Filmus2014年05月15日 18:17:55 +00:00Commented May 15, 2014 at 18:17
-
$\begingroup$ N is on the order of 3 to 5 million $\endgroup$foobarbazquxx– foobarbazquxx2014年05月15日 19:01:02 +00:00Commented May 15, 2014 at 19:01
I'll add an additional suggestion based on @Mihai's answer to this similar cstheory.se question.
If each object in the set has a unique hash, then any commutative operation on the hashes of the members of the set will produce a decent insertion-order independent hash of the entire set. @Yuval's Bloom filters can be viewed as using the bitwise "or" operator. @Juho's approach is also using the bitwise "or" operator and the element hash is just the "one-hot" encoding of each member of the set. @Mihai is suggesting that the encoding of a member $i$ of the set be $a^i \mod p,ドル for some preselected $a$ that is relatively prime to $p,ドル and that the commutative operator be addition $\mod p$. If $p$ is larger than $N$ then the probability of a false equality between two different sets is $O(N/p)$ (according to @Mihai's answer). I'm sure that there are situations in which bitwise "xor" works as the commutative operator too.
-
1$\begingroup$ $\operatorname{lg}_2$ is strictly increasing, so the sentence involving it can be simplified. $\;$ $\endgroup$user12859– user128592014年05月15日 21:33:31 +00:00Commented May 15, 2014 at 21:33
-
$\begingroup$ Won't using bitwise OR eventually converge the entire bitset to be set to 1? $\endgroup$zetaprime– zetaprime2022年01月30日 14:18:48 +00:00Commented Jan 30, 2022 at 14:18
for bitvectors the number of items is not really important, 5 million bits can be stored in 1 megabyte. you should look at the total length of the biggest number or find a mapping into smaller numbers.
For comparison you can calculate the hash over this bitvector.
-
$\begingroup$ Every number between 0 and N can appear in a set, so I'm not sure what you mean by mapping into smaller numbers. If each set requires a MB bitvector and I want to have several thousand sets I'll need several GBs of memory. Unfortunately, that's too expensive, so some kind of hash seems necessary. $\endgroup$foobarbazquxx– foobarbazquxx2014年05月16日 13:35:58 +00:00Commented May 16, 2014 at 13:35