next up previous contents
Next: 5.2 LinearHashTable: Linear Probing Up: 5. Hash Tables Previous: 5. Hash Tables Contents

Subsections

5.1 `ChainedHashTable`: Hashing with Chaining

A ChainedHashTable data structure uses hashing with chaining to store data as an array, $ \mathtt{t}$, of lists. An integer, $ \mathtt{n}$, keeps track of the total number of items in all lists (see Figure 5.1):

 List<T>[] t;
 int n;

Figure 5.1: An example of a ChainedHashTable with $ \ensuremath{\mathtt{n}}=14$ and $ \ensuremath{\mathtt{t.length}}=16$. In this example $ \ensuremath{\mathtt{hash(x)}}=6$

[画像:\includegraphics{figs/chainedhashtable}]

The hash value of a data item $ \mathtt{x}$, denoted $ \mathtt{hash(x)}$ is a value in the range $ \{0,\ldots,\ensuremath{\mathtt{t.length}}-1\}$. All items with hash value $ \mathtt{i}$ are stored in the list at $ \mathtt{t[i]}$. To ensure that lists don't get too long, we maintain the invariant

$\displaystyle \ensuremath{\mathtt{n}} \le \ensuremath{\mathtt{t.length}} $

so that the average number of elements stored in one of these lists is $ \ensuremath{\mathtt{n}}/\ensuremath{\mathtt{t.length}} \le 1$.

To add an element, $ \mathtt{x}$, to the hash table, we first check if the length of $ \mathtt{t}$ needs to be increased and, if so, we grow $ \mathtt{t}$. With this out of the way we hash $ \mathtt{x}$ to get an integer, $ \mathtt{i}$, in the range $ \{0,\ldots,\ensuremath{\mathtt{t.length}}-1\}$ and we append $ \mathtt{x}$ to the list $ \mathtt{t[i]}$:

 boolean add(T x) {
 if (find(x) != null) return false;
 if (n+1 > t.length) resize();
 t[hash(x)].add(x);
 n++;
 return true;
 }

Growing the table, if necessary, involves doubling the length of $ \mathtt{t}$ and reinserting all elements into the new table. This is exactly the same strategy used in the implementation of ArrayStack and the same result applies: The cost of growing is only constant when amortized over a sequence of insertions (see Lemma 2.1 on page [*]).

Besides growing, the only other work done when adding $ \mathtt{x}$ to a ChainedHashTable involves appending $ \mathtt{x}$ to the list $ \mathtt{t[hash(x)]}$. For any of the list implementations described in Chapters 2 or 3, this takes only constant time.

To remove an element $ \mathtt{x}$ from the hash table we iterate over the list $ \mathtt{t[hash(x)]}$ until we find $ \mathtt{x}$ so that we can remove it:

 T remove(T x) {
 Iterator<T> it = t[hash(x)].iterator();
 while (it.hasNext()) {
 T y = it.next();
 if (y.equals(x)) {
 it.remove();
 n--;
 return y;
 }
 }
 return null;
 }

This takes $ O(\ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{hash(x)}}})$ time, where $ \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{i}}}$ denotes the length of the list stored at $ \mathtt{t[i]}$.

Searching for the element $ \mathtt{x}$ in a hash table is similar. We perform a linear search on the list $ \mathtt{t[hash(x)]}$:

 T find(Object x) {
 for (T y : t[hash(x)])
 if (y.equals(x))
 return y;
 return null;
 }

Again, this takes time proportional to the length of the list $ \mathtt{t[hash(x)]}$.

The performance of a hash table depends critically on the choice of the hash function. A good hash function will spread the elements evenly among the $ \mathtt{t.length}$ lists, so that the expected size of the list $ \mathtt{t[hash(x)]}$ is $ O(\ensuremath{\mathtt{n}}/\ensuremath{\mathtt{t.length)}} = O(1)$. On the other hand, a bad hash function will hash all values (including $ \mathtt{x}$) to the same table location, in which case the size of the list $ \mathtt{t[hash(x)]}$ will be $ \mathtt{n}$. In the next section we describe a good hash function.

5.1.1 Multiplicative Hashing

Multiplicative hashing is an efficient method of generating hash values based on modular arithmetic (discussed in Section 2.3) and integer division. It uses the $ \ddiv $ operator, which calculates the integral part of a quotient, while discarding the remainder. Formally, for any integers $ a\ge 0$ and $ b\ge 1$, $ a\ddiv b = \lfloor a/b\rfloor$.

In multiplicative hashing, we use a hash table of size $ 2^{\ensuremath{\mathtt{d}}}$ for some integer $ \mathtt{d}$ (called the dimension). The formula for hashing an integer $ \ensuremath{\mathtt{x}}\in\{0,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$ is

[画像:$\displaystyle \ensuremath{\mathtt{hash}}(\ensuremath{\mathtt{x}}) = ((\ensurema... ...htt{w}}}) \ddiv 2^{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}} \enspace . $]

Here, $ \mathtt{z}$ is a randomly chosen odd integer in $ \{1,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$. This hash function can be realized very efficiently by observing that, by default, operations on integers are already done modulo $ 2^{\ensuremath{\mathtt{w}}}$ where $ \ensuremath{\mathtt{w}}$ is the number of bits in an integer. (See Figure 5.2.) Furthermore, integer division by [画像:$ 2^{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}}$] is equivalent to dropping the rightmost $ \ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$ bits in a binary representation (which is implemented by shifting the bits right by $ \ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$). In this way, the code that implements the above formula is simpler than the formula itself:

 int hash(Object x) {
 return (z * x.hashCode()) >>> (w-d);
 }

Figure 5.2: The operation of the multiplicative hash function with $ \ensuremath {\mathtt {w}}=32$ and $ \ensuremath {\mathtt {d}}=8$.

[画像:\begin{figure}\begin{center} \begin{tabular}{\vert lr@{}r\vert}\hline 2ドル^\ensu... ...suremath{\mathtt{00011110}}} \\ \hline \end{tabular} \end{center} \end{figure}]

The following lemma, whose proof is deferred until later in this section, shows that multiplicative hashing does a good job of avoiding collisions:

Lemma 5..1 Let $ \mathtt{x}$ and $ \mathtt{y}$ be any two values in $ \{0,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$ with $ \ensuremath{\mathtt{x}}\neq \ensuremath{\mathtt{y}}$. Then [画像:$ \Pr\{\ensuremath{\mathtt{hash(x)}}=\ensuremath{\mathtt{hash(y)}}\} \le 2/2^{\ensuremath{\mathtt{d}}}$].

With Lemma 5.1, the performance of $ \mathtt{remove(x)}$, and $ \mathtt{find(x)}$ are easy to analyze:

Lemma 5..2 For any data value $ \mathtt{x}$, the expected length of the list $ \mathtt{t[hash(x)]}$ is at most $ \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + 2$, where $ \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}}$ is the number of occurrences of $ \mathtt{x}$ in the hash table.

Proof. Let $ S$ be the (multi-)set of elements stored in the hash table that are not equal to $ \mathtt{x}$. For an element $ \ensuremath{\mathtt{y}}\in S$, define the indicator variable

[画像:$\displaystyle I_{\ensuremath{\mathtt{y}}} = \left\{\begin{array}{ll} 1 & \mbox... ...\ensuremath{\mathtt{hash(y)}}$} \\ 0 & \mbox{otherwise} \end{array}\right. $]

and notice that, by Lemma 5.1, [画像:$ \mathrm{E}[I_{\ensuremath{\mathtt{y}}}] \le 2/2^{\ensuremath{\mathtt{d}}}=2/\ensuremath{\mathtt{t.length}}$]. The expected length of the list $ \mathtt{t[hash(x)]}$ is given by

$\displaystyle \mathrm{E}\left[\ensuremath{\mathtt{t[hash(x)].size()}}\right]$ $\displaystyle =$ [画像:$\displaystyle \mathrm{E}\left[\ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} I_{\ensuremath{\mathtt{y}}}\right]$]

$\displaystyle =$ [画像:$\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} \mathrm{E}[I_{\ensuremath{\mathtt{y}}} ]$]

$\displaystyle \le$ [画像:$\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} 2/\ensuremath{\mathtt{t.length}}$]

$\displaystyle \le$ [画像:$\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} 2/\ensuremath{\mathtt{n}}$]

$\displaystyle \le$ $\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + (\ensuremath{... ...n}}-\ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}})2/\ensuremath{\mathtt{n}}$

$\displaystyle \le$ $\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + 2 \enspace ,$

as required. $ \qedsymbol$

Now, we want to prove Lemma 5.1, but first we need a result from number theory. In the following proof, we use the notation $ (b_r,\ldots,b_0)_2$ to denote [画像:$ \sum_{i=0}^r b_i2^i$], where each $ b_i$ is a bit, either 0 or 1. In other words, $ (b_r,\ldots,b_0)_2$ is the integer whose binary representation is given by $ b_r,\ldots,b_0$. We use $ \star$ to denote a bit of unknown value.

Lemma 5..3 Let $ S$ be the set of odd integers in $ \{1,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$, Let $ q$ and $ i$ be any two elements in $ S$. Then there is exactly one value $ \ensuremath{\mathtt{z}}\in S$ such that $ \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = i$.

Proof. Since the number of choices for $ \ensuremath{\mathtt{z}}$ and $ i$ is the same, it is sufficient to prove that there is at most one value $ \ensuremath{\mathtt{z}}\in S$ that satisfies $ \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = i$.

Suppose, for the sake of contradiction, that there are two such values $ \mathtt{z}$ and $ \mathtt{z'}$, with $ \ensuremath{\mathtt{z}}>\ensuremath{\mathtt{z}}'$. Then

$\displaystyle \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = \ensuremath{\mathtt{z}}'q \bmod 2^{\ensuremath{\mathtt{w}}} = i $

$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q\bmod 2^{\ensuremath{\mathtt{w}}} = 0 $

But this means that

$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q = k 2^{\ensuremath{\mathtt{w}}}$ (5.1)

for some integer $ k$. Thinking in terms of binary numbers, we have

[画像:$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q = k\cdot(1,\underbrace{0,\ldots,0}_{\ensuremath{\mathtt{w}}})_2 \enspace , $]

so that the $ \mathtt{w}$ trailing bits in the binary representation of $ (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q$ are all 0's.

Furthermore $ k\neq 0$ since $ q\neq 0$ and $ \ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}'\neq 0$. Since $ q$ is odd, it has no trailing 0's in its binary representation:

$\displaystyle q = (\star,\ldots,\star,1)_2 \enspace . $

Since $ \vert\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}'\vert < 2^{\ensuremath{\mathtt{w}}}$, $ \ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}'$ has fewer than $ \mathtt{w}$ trailing 0's in its binary representation:

[画像:$\displaystyle \ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}' = (\star,\ldots,\star,1,\underbrace{0,\ldots,0}_{<\ensuremath{\mathtt{w}}})_2 \enspace . $]

Therefore, the product $ (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q$ has fewer than $ \mathtt{w}$ trailing 0's in its binary representation:

[画像:$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q = (\star,\cdots,\star,1,\underbrace{0,\ldots,0}_{<\ensuremath{\mathtt{w}}})_2 \enspace . $]

Therefore $ (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q$ cannot satisfy (5.1), yielding a contradiction and completing the proof. $ \qedsymbol$

The utility of Lemma 5.3 comes from the following observation: If $ \mathtt{z}$ is chosen uniformly at random from $ S$, then $ \mathtt{zt}$ is uniformly distributed over $ S$. In the following proof, it helps to think of the binary representation of $ \mathtt{z}$, which consists of $ \ensuremath{\mathtt{w}}-1$ random bits followed by a 1.

Proof. [Proof of Lemma 5.1] First we note that the condition $ \ensuremath{\mathtt{hash(x)}}=\ensuremath{\mathtt{hash(y)}}$ is equivalent to the statement ``the highest-order $ \mathtt{d}$ bits of $ \ensuremath{\mathtt{z}} \ensuremath{\mathtt{x}}\bmod2^{\ensuremath{\mathtt{w}}}$ and the highest-order $ \mathtt{d}$ bits of $ \ensuremath{\mathtt{z}} \ensuremath{\mathtt{y}}\bmod 2^{\ensuremath{\mathtt{w}}}$ are the same.'' A necessary condition of that statement is that the highest-order $ \mathtt{d}$ bits in the binary representation of $ \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ are either all 0's or all 1's. That is,

[画像:$\displaystyle \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\math... ...rbrace{\star,\ldots,\star}_{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}})_2$] (5.2)

when $ \ensuremath{\mathtt{zx}}\bmod 2^{\ensuremath{\mathtt{w}}} > \ensuremath{\mathtt{zy}}\bmod 2^{\ensuremath{\mathtt{w}}}$ or

[画像:$\displaystyle \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\math... ...r,\ldots,\star}_{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}})_2 \enspace .$] (5.3)

when $ \ensuremath{\mathtt{zx}}\bmod 2^{\ensuremath{\mathtt{w}}} < \ensuremath{\mathtt{zy}}\bmod 2^{\ensuremath{\mathtt{w}}}$. Therefore, we only have to bound the probability that $ \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ looks like (5.2) or (5.3).

Let $ q$ be the unique odd integer such that $ (\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}=q2^r$ for some integer $ r\ge 0$. By Lemma 5.3, the binary representation of $ \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}}$ has $ \ensuremath{\mathtt{w}}-1$ random bits, followed by a 1:

[画像:$\displaystyle \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = (\und... ...{b_{\ensuremath{\mathtt{w}}-1},\ldots,b_{1}}_{\ensuremath{\mathtt{w}}-1},1)_2 $]

Therefore, the binary representation of $ \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod ... ...emath{\mathtt{w}}}=\ensuremath{\mathtt{z}}q2^r\bmod 2^{\ensuremath{\mathtt{w}}}$ has $ \ensuremath{\mathtt{w}}-r-1$ random bits, followed by a 1, followed by $ r$ 0's:

[画像:$\displaystyle \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\math... ...ldots,b_{1}}_{\ensuremath{\mathtt{w}}-r-1},1,\underbrace{0,0,\ldots,0}_{r})_2 $]

We can now finish the proof: If $ r > \ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$, then the $ \mathtt{d}$ higher order bits of $ \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ contain both 0's and 1's, so the probability that $ \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ looks like (5.2) or (5.3) is 0. If $ \ensuremath{\mathtt{r}}=\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$, then the probability of looking like (5.2) is 0, but the probability of looking like (5.3) is [画像:$ 1/2^{\ensuremath{\mathtt{d}}-1}=2/2^{\ensuremath{\mathtt{d}}}$] (since we must have $ b_1,\ldots,b_{d-1}=1,\ldots,1$). If $ r < \ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$ then we must have $ b_{\ensuremath{\mathtt{w}}-r-1},\ldots,b_{\ensuremath{\mathtt{w}}-r-\ensuremath{\mathtt{d}}}=0,\ldots,0$ or $ b_{\ensuremath{\mathtt{w}}-r-1},\ldots,b_{\ensuremath{\mathtt{w}}-r-\ensuremath{\mathtt{d}}}=1,\ldots,1$. The probability of each of these cases is [画像:$ 1/2^{\ensuremath{\mathtt{d}}}$] and they are mutually exclusive, so the probability of either of these cases is [画像:$ 2/2^{\ensuremath{\mathtt{d}}}$]. This completes the proof. $ \qedsymbol$

5.1.2 Summary

The following theorem summarizes the performance of the ChainedHashTable data structure:

Theorem 5..1 A ChainedHashTable implements the USet interface. Ignoring the cost of calls to $ \mathtt{grow()}$, a ChainedHashTable supports the operations $ \mathtt{add(x)}$, $ \mathtt{remove(x)}$, and $ \mathtt{find(x)}$ in $ O(1)$ expected time per operation.

Furthermore, beginning with an empty ChainedHashTable, any sequence of $ m$ $ \mathtt{add(x)}$ and $ \mathtt{remove(x)}$ operations results in a total of $ O(m)$ time spent during all calls to $ \mathtt{grow()}$.

next up previous contents
Next: 5.2 LinearHashTable: Linear Probing Up: 5. Hash Tables Previous: 5. Hash Tables Contents

opendatastructures.org

5.1 ChainedHashTable: Hashing with Chaining

5.1.1 Multiplicative Hashing

5.1.2 Summary

5.1 `ChainedHashTable`: Hashing with Chaining