Uniform Hashing. Understanding space occupancy and choice of functions

Question 1

I'm having troubles understanding two things from some notes about Uniform Hashing. Here's the copy-pasted part of the notes:

Let us first argue by a counting argument why the uniformity property, we required to good hash functions, is computationally hard to guarantee. Recall that we are interested in hash functions which map keys in $U$ to integers in $\{0, 1, ..., m-1\}$. The total number of such hash functions is $m^{|U|}$, given that each key among the $|U|$ ones can be mapped into $m$ slots of the hash table. In order to guarantee uniform distribution of the keys and independence among them, our hash function should be anyone of those ones. But, in this case, its representation would need $\Omega(log_2 m^{|U|}) = \Omega(|U| log_2 m)$ bits, which is really too much in terms of space occupancy and in the terms of computing time (i.e. it would take at least $\Omega(\frac{|U|log_2 m}{log_2 |U|})$ time to just read the hash encoding).

The part I put in bold is the first thing is confusing me.

Why the function should be any one of those? Shouldn't you avoid a good part of them, like the ones sending every element from the universe $U$ into the same number and thus not distributing the elements?

The second thing is the last "$\Omega$". Why would it take $\Omega(\frac{|U|log_2 m}{log_2 |U|})$ time just to read the hash encoding?

The numerator is the number of bits needed to index every hash function in the space of such functions, the denominator is the size in bits of a key. Why this ratio gives a lower bound on the time needed to read the encoding? And what hash encoding?

Question 2

Suppose that we sample a hash function $h$ at random from a collection $\mathcal{H}\subseteq [m]^U$ (we denote by $[m]$ the set $\{0,1,...,m-1\}$). The best possible thing we can require from this choice is that $\left(h(x_1),h(x_2),....,h(x_{|u|})\right)$ are jointly uniformly distributed over $[m]^U$, where $U=\{x_1,...,x_{|U|}\}$ (this in turn implies independence). Fix some $h_0\in [m]^U$, then under the previous requirement we have:

$\Pr\limits_{h\sim\mathcal{H}}[h=h_0]=\Pr\limits_{h\sim\mathcal{H}}\left[h(x_1)=h_0(x_1)\land...\land h(x_{|U|})=h_0(x_{|U|})\right]=\frac{1}{m^{|U|}}$.

It follows that $h$ is uniformly distributed over all possible functions in $[m]^U$ (in any other case we will fail either in uniformity or in independence). This is what is meant by "$h$ should be anyone of these functions".

Question 3

The thing I don't get is why are we talking about the time it takes to read a function and not the time it takes to compute it

Question 4

I'm removing the second part of my answer, as it's obviously false (encoding of an element in $U$ requires $\log U$ bits, but repeatedly asking for $h(x)$ for all $x\in U$ where each value requires time $T$ to compute takes $|U|T$ time and not $T\log |U|$. Since we can generate a truly random $h$ by sampling a uniform binary string of length $U\log m$ where $h(x_i)$ is given by the $i$'th block of size $\log m,ドル which makes reading $h(x)$ easy, I'm not sure what is meant here. Perhaps you can link the full text to clarify.

Ariel 13.6k1 gold badge22 silver badges39 bronze badges · Accepted Answer · 2020-06-24 19:38:51Z

Suppose that we sample a hash function $h$ at random from a collection $\mathcal{H}\subseteq [m]^U$ (we denote by $[m]$ the set $\{0,1,...,m-1\}$). The best possible thing we can require from this choice is that $\left(h(x_1),h(x_2),....,h(x_{|u|})\right)$ are jointly uniformly distributed over $[m]^U$, where $U=\{x_1,...,x_{|U|}\}$ (this in turn implies independence). Fix some $h_0\in [m]^U$, then under the previous requirement we have:

$\Pr\limits_{h\sim\mathcal{H}}[h=h_0]=\Pr\limits_{h\sim\mathcal{H}}\left[h(x_1)=h_0(x_1)\land...\land h(x_{|U|})=h_0(x_{|U|})\right]=\frac{1}{m^{|U|}}$.

It follows that $h$ is uniformly distributed over all possible functions in $[m]^U$ (in any other case we will fail either in uniformity or in independence). This is what is meant by "$h$ should be anyone of these functions".

The thing I don't get is why are we talking about the time it takes to read a function and not the time it takes to compute it
I'm removing the second part of my answer, as it's obviously false (encoding of an element in $U$ requires $\log U$ bits, but repeatedly asking for $h(x)$ for all $x\in U$ where each value requires time $T$ to compute takes $|U|T$ time and not $T\log |U|$. Since we can generate a truly random $h$ by sampling a uniform binary string of length $U\log m$ where $h(x_i)$ is given by the $i$'th block of size $\log m,ドル which makes reading $h(x)$ easy, I'm not sure what is meant here. Perhaps you can link the full text to clarify.

Stack Exchange Network

Uniform Hashing. Understanding space occupancy and choice of functions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Uniform Hashing. Understanding space occupancy and choice of functions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions