$$
\begin{align*}
\Pr\lbrace Y_j = 1 \lor Z_j = 1\rbrace
&= 1 - \Pr\Bigr\lbrace(Y_j = 0) \land (Z_j = 0)\Bigr\rbrace.
\end{align*}
$$
By independence:
$$
\Pr\lbrace Y_j = 1 \lor Z_j = 1\rbrace = 1 - \Pr\lbrace Y_j = 0\rbrace \Pr\lbrace Z_j = 0\rbrace,
$$
and since $h$ is uniform over $\lbrace0,1\rbrace^n,ドル each bit is 0 with probability 1ドル/2$:
$$
\Pr\lbrace Y_j = 1 \lor Z_j = 1\rbrace = 1 - 2^{-2}.
$$
Generalizing to $k$ unions of unique singletons:
$$
\Pr\lbrace X_j = 1\rbrace = 1 - 2^{-k}.
$$
Using the approximation $(1 - y) \approx e^{-y}$ for small $y,ドル with $y = 2^{-k}$:
$$
\Pr\lbrace X_j = 1\rbrace \approx e^{-1/2^k}.
$$
The probability that all bits are 1:
$$
\Pr\lbrace X = 1^n\rbrace = \prod_{j=1}^{n} \Pr\lbrace X_j = 1\rbrace = \bigl(1 - 2^{-k}\bigr)^n \approx e^{-n/2^k}.
$$
Since $e^{-n/2^k} \to 1$ as $k \to \infty,ドル the union of $k$ singletons converges in probability to the universal set 1ドル^n$. At that point, further unions change nothing.
False Negatives and False Positives
Suppose we have a set $W$ and want to test if a randomly chosen $x \in X$ is a member. Let $h$ be the hash of $x$ with $j$-th bit $h_j$. If $x$ actually is in $W,ドル all bits set to 1 in $F x$ are also 1 in $F W$ by construction. The false negative rate is 0ドル$.
If $x$ is not in $W,ドル a false positive may occur. For a false positive, every bit where $h_j = 1$ must also be 1 in $W_j$. This is logical implication:
$$
h_j = 1 \implies W_j = 1.
$$
Equivalently:
$$
\lnot ( h_j = 1 \land W_j = 0 ).
$$
The probability:
$$
\begin{align*}
\Pr\lbrace\lnot ( h_j = 1 \land W_j = 0)\rbrace
&= 1 - \Pr\lbrace h_j = 1 \rbrace \Pr\lbrace W_j = 0\rbrace.
\end{align*}
$$
With $h$ uniform, $\Pr\lbrace h_j = 1\rbrace = 1/2$. After $k$ unions, $\Pr\lbrace W_j = 0\rbrace = 2^{-k}$. Therefore:
$$
\Pr\lbrace\lnot ( h_j = 1 \land W_j = 0)\rbrace = 1 - 2^{-(k+1)}.
$$
For a false positive, this must hold for all $n$ bit positions:
$$
\begin{align*}
\varepsilon
&= \Pr\lbrace\text{$x \in W$ is a false positive}\rbrace\
&= \prod_{j=1}^{n} (1 - 2^{-(k+1)})\
&= (1 - 2^{-(k+1)})^n.
\end{align*}
$$
Asymptotic False Positive Rate
Using the approximation $(1 - y) \approx e^{-y}$ with $y = 2^{-(k+1)}$:
$$
\varepsilon_{\in} \approx e^{-n 2^{-(k+1)}},
$$
which holds even for relatively small $k$. In Big-O notation, fixing $n$:
$$
\varepsilon_{\in} = e^{\mathcal{O}(2^{-k})},
$$
meaning $\varepsilon$ approaches 1 exponentially fast as $k$ increases.
Space Complexity
Rewriting $n$ as a function of $\varepsilon$ and $k$:
$$
n = \frac{\log \varepsilon}{\log(1 - 2^{-(k+1)})}.
$$
For small $x,ドル $\log(1 - x) \approx -x,ドル so:
$$
n \approx \frac{\log \varepsilon}{2^{-(k+1)}}.
$$
Holding $\varepsilon$ constant:
$$
n = \mathcal{O}(2^k),
$$
meaning the number of bits must grow exponentially with the set size to maintain a fixed error rate. This limits the scheme to very small sets.
The false positive probability depends on both $k$ (set size) and $n$ (bits in the representation). Unsurprisingly, false positives increase with $k$ and decrease with $n,ドル but now we have a model that quantifies the relationship.
False Positive Rate Analysis for Different Byte and Kilobyte Ranges
In Figure 1, the false positive rate decreases exponentially as the byte size of the hash increases for smaller sets of elements (k=4 to k=10). In Figure 2, we observe the false positive rate behavior over larger kilobyte sizes for larger sets (k=12 to k=16). The green dashed line represents the 5% false positive rate threshold. As k increases, achieving this threshold requires significantly more space, highlighting the trade-off between set size and hash size.
Subset Relation
The subset relation predicate:
$$
\subseteq_A : 2^{X^{\ast}} \mapsto 2^{X^{\ast}} \mapsto \mathrm{bool}
$$
defined as
$$
a \subseteq_A b := \forall x \in a, x \in b.
$$
In approximate Boolean algebra $B$:
$$
\subseteq_B : \lbrace0,1\rbrace^n \times \lbrace0,1\rbrace^n \mapsto \lbrace0,1\rbrace
$$
defined as
$$
a \subseteq_B b := a | b = a,
$$
which looks identical to the set-indicator function but has different error probabilities.
Let $W$ and $X$ be bit strings for $F W$ and $F X$. The probability that $W$ is a subset of $X$ requires $W_j = 1 \implies X_j = 1$ for all $j$. Each is logical implication:
$$
W_j = 1 \implies X_j = 1,
$$
equivalently
$$
\lnot ( W_j = 1 \land X_j = 0 ).
$$
The probability:
$$
\begin{align*}
\Pr\lbrace\lnot ( W_j = 1 \land X_j = 0)\rbrace
&= 1 - \Pr\lbrace W_j = 1 \rbrace \Pr\lbrace X_j = 0\rbrace \
&= 1 - (1 - 2^{-k_1}) 2^{-k_2},
\end{align*}
$$
where $k_1$ and $k_2$ are the sizes of $W$ and $X$.
For a false positive on $W \subseteq X,ドル this must hold for all $n$ positions:
$$
\begin{align*}
\varepsilon
&= \Pr\lbrace\text{$W \subseteq X$ is a false positive}\rbrace\
&= \bigl(1 - (1 - 2^{-k_1}) 2^{-k_2}\bigr)^n.
\end{align*}
$$
Approximating:
$$
\varepsilon_{\subseteq} \approx e^{-n (1 - 2^{-k_1}) 2^{-k_2}},
$$
and applying the approximation again to 1ドル - 2^{-k_1}$:
$$
\varepsilon_{\subseteq} \approx e^{-n e^{-(k_1 + k_2 \log 2)}}.
$$
In Big-O with fixed $n$:
$$
\varepsilon_{\subseteq} = e^{\mathcal{O}(e^{-m})},
$$
where $m = k_1 + k_2 \log 2$. So $\varepsilon_{\subseteq}$ approaches 1 exponentially fast as $m$ increases. Consistent with the membership relation analysis.
Since equality can be written as $A \subseteq B \land B \subseteq A,ドル the false positive rate for equality is the product of the false positive rates for both subset directions.
Two-Level Hashing Scheme
The single-level scheme requires exponentially growing bit strings. To address this, I introduce a two-level scheme that reduces space complexity for practical use.
The construction:
- A hash function outputs $q$ bits.
- The first $w$ bits determine which bin the element goes in (2ドル^w$ bins total).
- The remaining $q-w$ bits represent the element within the bin.
- Total: $n = 2^w(q-w)$ bits.
False Positive Rate
Membership Relation
Assume $k$ elements are in the set and we test membership of $x$. No false negatives, as before. A false positive occurs when:
- With probability 2ドル^{-w},ドル $x$ maps to some particular bin.
- That bin, represented by $n-w$ bits, has an expected $k / 2^w$ elements.
- Using the earlier derivation for the per-bin false positive rate:
$$
\varepsilon(k,w,q) = (1 - 2^{-(k / 2^w + 1)})^{q-w}.
$$
Fixing $w$ and $q$:
$$
\varepsilon(k) = e^{\mathcal{O}(2^{-k})},
$$
asymptotically the same as the single-level scheme, but much more practical for reasonably large sets.
Space complexity: $n = 2^w (q-w)$ bits, or $m = 2^w (q-w) / k$ bits per element, for false positive rate $\varepsilon(k,w,q)$.
In Figure 3, we show the false positive rate for different values of $w$ and $q$.
C++ Implementation: Single-Level Hashing Scheme
Here is a C++ implementation modeling the Boolean algebra over trapdoors. I generalize trapdoors to any type $X,ドル parameterized over hash size $N$.
Each trapdoor<X,N> represents a value of type X as an $N$-byte hash using a cryptographic hash function. It models equality as a Bernoulli Boolean with a false positive rate of 2ドル^{-8N}$ and a false negative rate of 0ドル$. In Bernoulli Model notation:
$$
B_{\mathrm{X' \times X' \mapsto \mathrm{bool}}}(=(X',X'))
$$
where $X'$ is a trapdoor of type $X$ and the latent value is the equality predicate.
The trapdoor_set<X,N> class represents an approximate Boolean algebra over trapdoors. It extends trapdoor<X,N> with these operations:
-
empty_trapdoor_set<X,N>() returns the empty set (0ドル_B$).
-
universal_trapdoor_set<X,N>() returns the universal set (1ドル_B$).
-
operator+(trapdoor_set<X,N> const & x, trapdoor_set<X,N> const & y) returns the union ($\lor_B$).
-
operator~(trapdoor_set<X,N> const & x) returns the complement ($\neg_B$).
-
operator*(trapdoor_set<X,N> const & x, trapdoor_set<X,N> const & y) returns the intersection ($\land_B$).
-
empty(trapdoor_set<X,N> const & xs) returns a Bernoulli Boolean for emptiness, with false positive rate $\varepsilon = 2^{-8N}$.
-
in(trapdoor<X,N> const & x, trapdoor_set<X,N> const & xs) returns a Bernoulli Boolean for membership. See the Relational Predicates section for details.
For closure, the class has a hash member function returning the stored hash, enabling composition of operations (e.g., creating a powerset of trapdoor_set objects).
#include <array>
template <typename X, size_t N>
struct trapdoor
{
using value_type = X;
/**
* The constructor initializes the value hash to the given value hash.
* Since the hash is a cryptographic hash, the hash is one-way and so we
* cannot recover the original value from the hash.
*
* This also models the concept of an Oblivious Type, where the true value
* is latent and we permit some subset of operations on it. In this case,
* the only operation we permit is the equality and hashing operations.
*
* @param value_hash The value hash.
*/
trapdoor(std::array<char, N> value_hash) : value_hash(value_hash) {}
/**
* The default constructor initializes the value hash to zero. This value
* often denotes a special value of type X, but not necessarily.
*/
trapdoor() { value_hash.fill(0); }
std::array<char, N> value_hash;
};
/**
* The hash function for the trapdoor class. It returns the value hash.
*/
template <typename X, size_t N>
auto hash(trapdoor const & x)
{
return x.value_hash;
}
/**
* Basic equality operator. It returns a Bernoulli Boolean that represents
* a Boolean value indicating if the trapdoors are equal with a false positive rate
* of 2^{-8 N}. Different specializes of the trapdoor class may have different
* false positive rates, but this is a reasonable default value.
*/
template <typename X, size_t N>
auto operator==(trapdoor const & lhs, trapdoor const & rhs) const
{
return bernoulli<bool>{lhs.value_hash == rhs.value_hash, -8*N};
}
/**
* The `or` operation in the Boolean algebra over bit strings.
*/
template <typename X, size_t N>
auto lor(trapdoor lhs, trapdoor const & rhs)
{
for (size_t i = 0; i < N; ++i)
lhs.value_hash[i] |= rhs.value_hash[i];
return lhs;
}
/**
* The `and` operation in the Boolean algebra over bit strings.
*/
template <typename X, size_t N>
auto land(trapdoor lhs, trapdoor const & rhs)
{
for (size_t i = 0; i < N; ++i)
lhs.value_hash[i] &= rhs.value_hash[i];
return lhs;
}
/**
* The `not` operation in the Boolean algebra over bit strings.
*/
template <typename X, size_t N>
auto lnot(trapdoor x)
{
for (size_t i = 0; i < N; ++i)
lhs.value_hash[i] = ~lhs.value_hash[i];
return lhs;
}
template <typename T>
auto minimum() { return std::numeric_limits<T>::min(); }
template <typename T>
auto maximum() { return std::numeric_limits<T>::max(); }
/**
* The `minimum` operation in the Boolean algebra over trapdoors
*/
template <typename trapdoor<typename X, size_t N>>
auto minimum() { return trapdoor<X,N>(); }
/**
* The `maximum` operation in the Boolean algebra over trapdoors
*/
template <typename trapdoor<typename X, size_t N>>
auto maximum()
{
trapdoor<X,N> x;
x.value_hash.fill(std::numeric_limits<char>::min())
return x;
}
/**
* The trapdoor_set class models an approximate Boolean algebra over trapdoors.
* It is a specialization of the trapdoor class that implements additional operations
* that can take place over the domain of trapdoor and trapdoor_set.
*/
template <typename X, size_t N>
struct trapdoor_set: public trapdoor<X,N>
{
trapdoor_set(
double low_k = 0,
double high_k = std::numeric_limits<double>::infinity()),
std::array<char, N> value_hash)
: trapdoor<trapdoor_set<X,N>,N>(value_hash),
low_k(low_k),
high_k(high_k) {}
trapdoor_set() : trapdoor<trapdoor_set<X,N>,N>(), low_k(0), high_k(0) {}
trapdoor_set(trapdoor_set const &) = default;
trapdoor_set & operator=(trapdoor_set const &) = default;
double low_k;
double high_k;
};
/**
* The size (cardinality) of the latent set (the set the trapdoor_set represents).
*
* @param xs The trapdoor_set.
* @return The size range of the set.
*/
template <typename X, size_t N>
auto size(trapdoor_set<X,N> const & xs)
{
return std::make_pair(xs.low_k, xs.high_k);
}
/**
The union operation.
*/
template <typename X, size_t N>
auto operator+(
trapdoor_set<X,N> const & xs,
trapdoor_set<X,N> const & ys)
{
return trapdoor_set<X,N>(lor(xs, ys).value_hash,
std::max(xs.low_k, ys.low_k), xs.high_k + ys.high_k);
}
template <typename X, size_t N>
void insert(trapdoor<X,N> const & x, trapdoor_set<X,N> & xs)
{
for (size_t i = 0; i < N; ++i)
xs.value_hash[i] |= x.value_hash[i];
// since x could already be in xs, we do not increment the low_k
// only the high_k
xs.high_k += 1;
}
template <typename X, size_t N>
auto operator~(trapdoor_set<X,N> x)
{
for (size_t i = 0; i < N; ++i)
x.value_hash[i] = ~x.value_hash[i];
// assume |x| in [a, b]
// then |~x| has the following analysis.
// - if |x| = a, then |~x| = |U| - a, where |U| is universal set
// - if |x| = b, then |~x| = |U| - b
// so, |~x| in [|U|-b, |U|-a]
x.low_k = maximum<X>() - x.high_k;
x.high_k = maximum<X>() - x.low_k;
return x;
}
template <typename X, size_t N>
auto operator*(
trapdoor_set<X,N> const & x,
trapdoor_set<X,N> const & y)
{
return trapdoor_set<X>(x.value_hash & y.value_hash,
0, // intersection could be empty
std::min(x.high_k, y.high_k) // one could be a subset of the other
);
}
template <typename X, size_t N>
bernoulli<bool> in(trapdoor<X,N> const & x, trapdoor_set<X,N> const & xs
{
for (size_t i = 0; i < N; ++i)
if ((x.value_hash[i] & xs.value_hash[i]) != x.value_hash[i])
return bernoulli<bool>{true, 0};
// earlier, we showed that the fp rate is (1 - 2^{-k})^n
return bernoulli<bool>{false, };
}
/**
* The subseteq_B predicate, x \subseteq_B y, is defined as x | y = x.
* It has an false positive rate of eps = (1 - 2^{-(k_1 + k_2 log 2)})^n
* which we approxiate as eps ~ e^{-n e^{-(k_1 + k_2 log 2)}}.
* we store the log(eps) instead: -n e^{-(k_1 + k_2 log 2)}
*/
template <typename X>
bernoulli<bool> operator<=(
trapdoor_set<X> const & x,
trapdoor_set<X> const & y)
{
for (size_t i = 0; i < N; ++i)
if ((x.value_hash[i] & y.value_hash[i]) != x.value_hash[i])
{
return bernoulli<bool>(false, 0); // false negative rate is 0
}
// earlier, we showed that the fp rate is (1 - 2^{-(k_1 + k_2 log 2)})^n
return bernoulli<bool>{true, };
}
template <typename X>
bernoulli<bool> operator==(
trapdoor_set<X> const & x
trapdoor_set<X> const & y)
{
for (size_t i = 0; i < N; ++i)
if (x.value_hash[i] != y.value_hash[i])
return bernoulli<bool>{false, 0.5};
}
Appendix
Marginal Uniformity
Suppose we have a probability distribution over elements (unigrams) in $X^{\ast}$ such that
$$
\Pr_D\lbrace a\rbrace
$$
is the probability of $a$ being generated from some data generating process $D$. We can transform homomorphism $F$ to map each $a \in X$ to multiple hashes inversely proportional to $\Pr_D(a),ドル satisfying
$$
\Pr\lbrace h(a)\rbrace \approx \Pr\lbrace h(b)\rbrace
$$
even if $\Pr_D\lbrace a\rbrace$ and $\Pr_D\lbrace b\rbrace$ are significantly different.
When doing a membership query, we uniformly sample one of these representations so that the unigram distribution of elements in $\lbrace0,1\rbrace$ is uniform. This is a kind of marginal uniformity.
However, this approach has serious shortcomings:
Only the marginal distribution of unigrams is uniformly distributed. Correlations in the joint distributions of $X^{\ast},ドル such as bigrams, are not accounted for. We can apply the transformation to larger sequences, but space complexity grows exponentially with sequence length for a fixed false positive rate. In practice, even the unigram model may need approximation due to space limits.
When we apply Boolean operations on an untrusted system, it cannot be given distribution information about elements in $X$. This means Boolean operations like $\land$ and $\lor$ cannot be performed on the untrusted system, only relational queries like $\in$ and $\subseteq$.