In 1989, Bruce McKenzie and his co-workers at the University of Canterbury, Christchurch, New Zealand, developed several methods for evaluating hash functions and by studying and measuring many hash functions they empirically discovered odd behavioral properties of most of the commonly used hash functions [mckenzie90].
The performance problems exposed by this hash function
testing showed that the average cost of searching can range over
several orders of magnitude using the same hash function varying
only the output range. This seemed to be directly attributable to
the sensitivity of the result on the modulus applied at the end
of the hashing operation. This property makes most functions
unacceptable as a library hash function because they need to be
tested for the specific range over which output is expected.
In the McKenzie paper the authors conclude that for hashing
program identifiers, the following linear hash function,
presented here in its simplest form, is a good hash function. 1
This function is named cbuhash for CanterBury University, in Christchurch, New Zealand, where it was developed. It works by using random variations in low-order bits of the key characters to create noise which masks the high order bits of the characters. To finish the mixing the resulting hash value must be divided by an odd number and the remainder used as the operational hash. This means that one must always have a odd number of bins in the hash table. This hash function works particularly badly for hash tables whose size is a integral power of two.
This cbuhash hash function has the advantage that it is
fast to compute, it seems to work well for an ASCII character
string domain, and it has a broad range for which it works. This
algorithm and variations of it have been used in modern programs,
for example the Revision Control System [rcs91] uses the same
algorithm but with an exclusive-or replacing the plus in the
computation of the hash.
In 1990, Peter K. Pearson published a random walk hash
algorithm which produced integers from 0 to 255. He claimed
this algorithm worked well for text strings (with 8-bit bytes).
(See [pearson90].) Pearson uses a compile time static
table, ptab, of unsigned byte values [0-255]
that have been randomly permuted which he uses to introduce noise
into the key. The Pearson algorithm is:
A defect in this hash algorithm is that for hash ranges greater than 0-255 pkphash becomes more costly. In fact for a 64-bit
range it will cost roughly 7 times more to compute than for a
range of 0-255. With the inner loop expanded for
efficiency, the algorithm is as follows:
At about the same time Pearson's hash function was published
I was looking for a hash function. Reading his algorithm I
was sure that a new algorithm could be designed that would
compute a large hash value more cheaply. At the same time a
friend suggested I look at the McKenzie article in Software Practice and Experience. Using McKenzie's techniques for
measurement I found a hash algorithm that requires the use of a
random table 256 words large, where a word is defined as the
maximum data size that an XOR function can be applied to easily
in the programming language. Depending on the programming
language and the machine this word size can assume different
values: using C on a modern PC this is 32 bits, in PL/I on an IBM
360 this is 2048 bits, and in Java on a byte-code machine it is
64 bits. If you assume unsigned long is 64-bits, and
strings have 8-bit bytes then a library hash function suitable
for Java might look like the following:
The cshift is a circular shift (or rotation), that
can be applied in either direction. The rtab table is a compile time constant array
that should have the property that in every bit position
vertically in the table, there should be exactly 128 zeros and
128 ones. This makes the probability of change in all the hash
bit positions ス, giving a uniform random distribution. The Java buzhash is designed to work for keys shorter than
65 characters; this does not limit programs much because rarely
are keys encountered that are greater than 64 characters long.
A hash function based on a pseudo-random number generator has
been around for several years. I have not been able to
trace its source to a publication. It works on a linear
congruential random number generator, but any generator that
provides a satisfactory random distribution modified to fold in
the bytes of the key should also work. Here is the code for
a 64-bit version of this function; here it makes no difference
what byte size comes from the string.
Note that this function tends to be a bit slower than pkphash and buzhash since it requires one multiply for every byte of key. To really use this function a good value for the multiplier must be found and the resulting hash function tested. As far as I know this has not been done for a 64-bit word.
The academic cryptographic community has produced several
secure hash functions that distribute keys randomly [pieprzyk93],
but for general retrieval use cryptographic hash functions tend
to work too slowly to satisfy most users.
last
next beginning
Copyright ゥ 1996 Robert Uzgalis,
All Rights Reserved.
Contact: Robert Uzgalis <buz@zis.com>
home
This paper was automatically translated from troff into html by
t2h.