I've built a small library for comparing objects based on their members here (related SO question).
There's a need to compute hash function in order to conform to IEqualityComparer<T>
interface. In my case this function has to be computed based on the hash functions of members. So the main problem is how to compose them.
I use the following approach currently:
public int GetHashCode(T obj)
{
VerifyHaveMembersSetup();
if (TypeTester<T>.IsNull(obj))
{
return 0;
}
if (getHashCodeBehavior == GetHashCodeBehavior.Cache && cachedHashCode.HasValue)
{
return cachedHashCode.Value;
}
int hashCode = 0;
for (int i = 0; i < memberSetups.Count; ++i)
{
int leftShift = i % 32;
int rightShift = 32 - leftShift;
int memberHashCode = memberSetups[i].GetMemberHashCode(obj);
hashCode = hashCode ^ ((memberHashCode << leftShift) | (memberHashCode >> rightShift));
}
if (getHashCodeBehavior == GetHashCodeBehavior.ImmutableCheck
&& cachedHashCode.HasValue && cachedHashCode.Value != hashCode)
{
throw new InvalidOperationException("Hash code value changed");
}
cachedHashCode = hashCode;
return hashCode;
}
Are there any problems with this code?
I don't like the way many people implement composite functions. With multiplication and addition, significant bits are shifted away and are lost.
That's why I use bitshift operators <<
and >>
. Still I'm not completely sure how good would be a hash function generated such way. Especially when it's computed over small collection of memberSetups
of bool
s for example (since bool
has only 2 hash values in .NET: 1 and 0).
-
\$\begingroup\$ "with multiplications and additions significant bits are shifted away and are lost." I'm not buying that argument. The very nature of hashing throws away information, but keeps it within a particular space (in our .NET world, it's a 32-bit signed integer). Read the pages linked to from that page (eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx, blogs.msdn.com/b/ericlippert/archive/2011/02/28/… and amazon.com/dp/0321356683/?tag=stackoverfl08-20). I think you'll find Skeet's implementation solid. \$\endgroup\$Jesse C. Slicer– Jesse C. Slicer2013年07月11日 16:00:22 +00:00Commented Jul 11, 2013 at 16:00
-
\$\begingroup\$ Okay I was wrong. 23 used in multiplication there has lowest bit of 1 (the number is odd) that's why it will not be shifting everything out completely. \$\endgroup\$Mike– Mike2013年07月11日 18:16:34 +00:00Commented Jul 11, 2013 at 18:16
2 Answers 2
Shifting the values before xor:ing them is somewhat better than the worst thinkable ways of combining the hash codes, but it's not very good. You will easily get combinations that cancel each other out, so you get hash codes with a distribution heavily skewed towards zero.
Multiplying values by a prime number on the other hand gives a pretty good distribution. Back in the day it was actually used to produce random numbers, partly because of how it spreads the values reasonably even over the number range.
A method that would give an even better distribution would be to use an advanced hashing algorithm like MD5 or CRC32, but the drawback would be that they are a lot slower. You would lose much of the advantage of using a hash code in the first place.
All in all, multiplying by a prime number gives a very good distribution in relation to the processing time needed. It's a compromise well suited for the GetHashCode
method.
-
\$\begingroup\$ Why are prime numbers so important? Why is it better than just odd numbers? \$\endgroup\$Mike– Mike2013年07月12日 22:53:44 +00:00Commented Jul 12, 2013 at 22:53
-
\$\begingroup\$ @Mike: I don't know the mathematics behind that, but Derrick Henry Lehmer did. Here is a start for further reading: en.wikipedia.org/wiki/Lehmer_random_number_generator \$\endgroup\$Guffa– Guffa2013年07月13日 12:54:42 +00:00Commented Jul 13, 2013 at 12:54
If I'm not wrong, your hashcode starts using the most significant bit only when memberHashCode << leftShift
is using it. This may very well never be the case, if memberSetups.Count is low, and the hashcode of the object you're using is not perfect. For example, ints or bool.
As so, in the worst cases, your hashes are only spread a bit better than your inputs over al the integers, which is not 100% satisfying. Maybe instead of ++i
in the loop, you could go with i+=32/memberSetups.Count
to fasten the process of using the most significant bits (even if that's still not perfect).