I am creating a unique "key" object type for a tree, we will call this object type a TreeCoordinate
.
A tree in this case is a fairly standard mapping of nodes from parent to children where each child holds a reference to the parent. The parent holds a list of children based on their index. A node can have zero to N number of children.
Therefore, any position in the tree can be mapped to an exact array of integers representing an ordered list of indices, which will have no collision and can be reliably used to look up the object at the given coordinate in the tree. The code of an equality check would look like this:
TreeCoordinate lefttest = new(){Value = [0,3,1]};
TreeCoordinate righttest = new(){Value = [5,3,1,0]};
bool b = Equals(lefttest,righttest); //False
public bool Equals (TreeCoordinate left, TreeCoordinate right)
{
if(left.Length != right.Length) return false; //fast check
foreach (uint i in left.Value) //Value is uint[]
{
if(i != right.Value[i]) return false;
}
return true;
}
The equality check is easy. Now I would like to get a hash code.
The issue I see is that the default .Net Core implementation of GetHashCode()
is non-deterministic when called in different program executions, but for my app logic's purpose, I want a deterministic hash that is a reliable representation of the index-based coordinate value.
I thought I could just append each position in the coordinate to the previous, i.e. [0,3,1] becomes 031, but that's not correct as that could mean node at [0,31] or node at [0,3,1] since index could be any length integer.
So it would mean that I have to hash a separator as well.
Now it seems I need to hash an array of char
and not uint
. Which I don't want to get into dealing with encoding systems and cultural specific chars
.
Do you have any guidance on how I can create this deterministic hash of a TreeCoordinate
data structure?
EDIT: I'm thinking that given an array of uint
, I can string-ify the array into a A-Za-z prefix + uint
. This would give me a logical separator with the ability to express a nested coordinate 52 levels deep, which I think would cover all of my use cases.
So [0,3,1] becomes A0B3C1
, [5,0,4,1,6] becomes A5B0C4D1E6
. This would help enforce complete addresses, (must start with A) or partial (Must start with {A-Za-z}).
At which point I use any of the given hashing algorithms.
Edit: @DocBrown is correct in his assumption I am misunderstanding. The refinement of the question based on your help: I need a deterministic encoding for business logic and a standard hash for use in dictionaries, etc.
I thought that a hashing algorithm could suffice as an encoding scheme, because there’s a clear deterministic definition of equality for the coordinate, which should have an equally deterministic mapping to an encoded version of itself.
I understand now conversion to an int during a hash will fundamentally introduce some acceptable potential for a collision due to the pidgeonhole principle.
Another commenter discussed fitting nodes into a byte, which is the kind of discussion I hoped to foster, because this is intended to be a simple struct type with low memory overhead, and not allocating additional strings is preferable.
If we hash some sort of encoding of bytes, that’s great and it avoids hashing an array of chars ostensibly extracted from a string. I just needed some direction on how to go from the array of uint to a low memory lightweight struct encoding. Then I can then completely satisfy equality checks with implementing GetHashCode().
3 Answers 3
One could do worse than to use this well-known hashing approach:
The java.lang.String.hashCode() method returns a hash code ... computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
It is not cryptographically secure, but that wasn't among your requirements. It is fast, and amenable to implementation via Horner's method.
Either serialize your unsigned integers into a JSON string to be hashed,
or simply fill s
with them directly.
Notice that (101, 1) and (1, 101) won't collide.
Collisions are definitely possible,
and an attacker can readily provoke them.
Feel free to use mod M on the result, with as many bits in M as your use case requires to produce a suitably low collision rate.
Collisions will happen, so your code will need to have some story for coping with that. Some folks prepend a constant number to everything hashed, and then in the rare event of a collision they increment it and rehash everything in sight.
If you later decide that attackers turn out to be a problem, then a secret salt used with any number of secure hashes will still be available to you when you code up V2.
If collisions are unacceptable then you should be assigning serial numbers and persisting records in a hashmap, RDBMS, redis, or other datastore.
A hash never guarantees that any two different values have different hashes except if the number of possible values is limited. Hash codes used for fast lookup return integers so that we can quickly translate a hash code into an array index.
Maybe what you really want is to compress your coordinates into a small number of words that can be decompressed back to the same coordinates.
Apple and others believe that having a hash that survives restarting an app is a very bad idea. So you would need a very good reason for this. Maybe you tell us what you actually want to achieve.
-
In a sense, maybe this is what I need. I saw the source code of how the c# implementation of
int
type overridesGetHashCode()
and returns itself. I assumed an array of a uint would be similarly able to return itself, but apparently adding a separator makes this a difficult task. Maybe there isn't enough significant digits to express a largeTreeCoordinate
so i'm back to using a hashing algorithm.NWoodsman– NWoodsman2023年03月05日 20:34:13 +00:00Commented Mar 5, 2023 at 20:34 -
6Collisions are fundamental to the notion of "hashing" (modulo the special case of perfect hashes on small restricted inputs, as sometimes seen for lexers). Either occasional collisions are acceptable to your use case, or they are not. You need to write down an answer to that design consideration first, and then worry about whether one hashing approach is better than another. If you write down a negative answer, then we shouldn't even be discussing universal hashing.J_H– J_H2023年03月05日 20:50:00 +00:00Commented Mar 5, 2023 at 20:50
-
+1 Hashing is completely inappropriate for this use case. Encoding is what the OP needs.JimmyJames– JimmyJames2023年03月06日 18:48:18 +00:00Commented Mar 6, 2023 at 18:48
-
What has this answer to do with a hash which is consistent across program executions?Doc Brown– Doc Brown2023年03月06日 19:30:12 +00:00Commented Mar 6, 2023 at 19:30
-
@DocBrown Pretty sure this is an XY problem.JimmyJames– JimmyJames2023年03月06日 20:49:18 +00:00Commented Mar 6, 2023 at 20:49
If you need consistent hash codes across program executions, I'd recommend to use xxHash. (if not use the MS HashCode.Combine()
)
xxHash is pretty fast, deterministic and ported to lots of programming languages. For C# there is a variety of nuget packages out there to choose from.
I tried to answer the question, from the information given, I am not certain that a hash function is the best solution to your problem.
-
I took the freedom to add a reference. Please double check if I got it right.Doc Brown– Doc Brown2023年03月06日 19:36:00 +00:00Commented Mar 6, 2023 at 19:36
GetHashCode()
implementation, as most hash functions, is fully deterministic - the same input always leads to the same output. Hence I don't understand what you are after.