Compute hash that's consistent across program executions

Question 1

I am creating a unique "key" object type for a tree, we will call this object type a TreeCoordinate.

A tree in this case is a fairly standard mapping of nodes from parent to children where each child holds a reference to the parent. The parent holds a list of children based on their index. A node can have zero to N number of children.

Therefore, any position in the tree can be mapped to an exact array of integers representing an ordered list of indices, which will have no collision and can be reliably used to look up the object at the given coordinate in the tree. The code of an equality check would look like this:

TreeCoordinate lefttest = new(){Value = [0,3,1]};
TreeCoordinate righttest = new(){Value = [5,3,1,0]};
bool b = Equals(lefttest,righttest); //False
public bool Equals (TreeCoordinate left, TreeCoordinate right)
{
 if(left.Length != right.Length) return false; //fast check
 foreach (uint i in left.Value) //Value is uint[]
 {
 if(i != right.Value[i]) return false;
 }
 return true;
}

The equality check is easy. Now I would like to get a hash code.

The issue I see is that the default .Net Core implementation of GetHashCode() is non-deterministic when called in different program executions, but for my app logic's purpose, I want a deterministic hash that is a reliable representation of the index-based coordinate value.

I thought I could just append each position in the coordinate to the previous, i.e. [0,3,1] becomes 031, but that's not correct as that could mean node at [0,31] or node at [0,3,1] since index could be any length integer.

So it would mean that I have to hash a separator as well.

Now it seems I need to hash an array of char and not uint. Which I don't want to get into dealing with encoding systems and cultural specific chars.

Do you have any guidance on how I can create this deterministic hash of a TreeCoordinate data structure?

EDIT: I'm thinking that given an array of uint, I can string-ify the array into a A-Za-z prefix + uint. This would give me a logical separator with the ability to express a nested coordinate 52 levels deep, which I think would cover all of my use cases.

So [0,3,1] becomes A0B3C1, [5,0,4,1,6] becomes A5B0C4D1E6. This would help enforce complete addresses, (must start with A) or partial (Must start with {A-Za-z}).

At which point I use any of the given hashing algorithms.

Edit: @DocBrown is correct in his assumption I am misunderstanding. The refinement of the question based on your help: I need a deterministic encoding for business logic and a standard hash for use in dictionaries, etc.

I thought that a hashing algorithm could suffice as an encoding scheme, because there’s a clear deterministic definition of equality for the coordinate, which should have an equally deterministic mapping to an encoded version of itself.

I understand now conversion to an int during a hash will fundamentally introduce some acceptable potential for a collision due to the pidgeonhole principle.

Another commenter discussed fitting nodes into a byte, which is the kind of discussion I hoped to foster, because this is intended to be a simple struct type with low memory overhead, and not allocating additional strings is preferable.

If we hash some sort of encoding of bytes, that’s great and it avoids hashing an array of chars ostensibly extracted from a string. I just needed some direction on how to go from the array of uint to a low memory lightweight struct encoding. Then I can then completely satisfy equality checks with implementing GetHashCode().

Question 2

"which will have no collision" <- generally, you expect hash functions to have some collisions. They must if the size of the hash is less than the size of all inputs due to the pigeon-hole principle.

Question 3

The default GetHashCode() implementation, as most hash functions, is fully deterministic - the same input always leads to the same output. Hence I don't understand what you are after.

Question 4

@DocBrown To protect against DoS via malicious input causing hash function collisions, many language runtimes now include a per program invocation random seed in their hash implementations. However, this doesn't seem relevant here unless the OP is intending to persist the hash codes beyond the lifetime of a program execution; the hash function is deterministic within one program.

Question 5

Most hashing functions work the same. They take each number, add it in some twisted way into accumulator and continue to do that for all numbers. A string is just list of numbers pretending to be characters. It should be trivial to take an algorithm hashing a string and turn it into algorithm hashing list of numbers.

Question 6

When I see that an answer which just makes a wild guess what you meant already got 3 upvotes, I think this question needs some clarification. Would you mind to explain what kind of use case you have which requires a hash which behaves deterministicly across multiple program executions? Thanks.

Question 7

One could do worse than to use this well-known hashing approach:

The java.lang.String.hashCode() method returns a hash code ... computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

It is not cryptographically secure, but that wasn't among your requirements. It is fast, and amenable to implementation via Horner's method.

Either serialize your unsigned integers into a JSON string to be hashed, or simply fill s with them directly. Notice that (101, 1) and (1, 101) won't collide. Collisions are definitely possible, and an attacker can readily provoke them.

Feel free to use mod M on the result, with as many bits in M as your use case requires to produce a suitably low collision rate.

Collisions will happen, so your code will need to have some story for coping with that. Some folks prepend a constant number to everything hashed, and then in the rare event of a collision they increment it and rehash everything in sight.

If you later decide that attackers turn out to be a problem, then a secret salt used with any number of secure hashes will still be available to you when you code up V2.

If collisions are unacceptable then you should be assigning serial numbers and persisting records in a hashmap, RDBMS, redis, or other datastore.

Question 8

A hash never guarantees that any two different values have different hashes except if the number of possible values is limited. Hash codes used for fast lookup return integers so that we can quickly translate a hash code into an array index.

Maybe what you really want is to compress your coordinates into a small number of words that can be decompressed back to the same coordinates.

Apple and others believe that having a hash that survives restarting an app is a very bad idea. So you would need a very good reason for this. Maybe you tell us what you actually want to achieve.

Question 9

In a sense, maybe this is what I need. I saw the source code of how the c# implementation of int type overrides GetHashCode() and returns itself. I assumed an array of a uint would be similarly able to return itself, but apparently adding a separator makes this a difficult task. Maybe there isn't enough significant digits to express a large TreeCoordinate so i'm back to using a hashing algorithm.

Question 10

Collisions are fundamental to the notion of "hashing" (modulo the special case of perfect hashes on small restricted inputs, as sometimes seen for lexers). Either occasional collisions are acceptable to your use case, or they are not. You need to write down an answer to that design consideration first, and then worry about whether one hashing approach is better than another. If you write down a negative answer, then we shouldn't even be discussing universal hashing.

Question 11

+1 Hashing is completely inappropriate for this use case. Encoding is what the OP needs.

Question 12

What has this answer to do with a hash which is consistent across program executions?

Question 13

@DocBrown Pretty sure this is an XY problem.

Question 14

If you need consistent hash codes across program executions, I'd recommend to use xxHash. (if not use the MS HashCode.Combine())

xxHash is pretty fast, deterministic and ported to lots of programming languages. For C# there is a variety of nuget packages out there to choose from.

I tried to answer the question, from the information given, I am not certain that a hash function is the best solution to your problem.

Question 15

I took the freedom to add a reference. Please double check if I got it right.

J_H J_H 7,6451 gold badge18 silver badges27 bronze badges · Answer 1 · 2023-03-05 01:20:47Z

One could do worse than to use this well-known hashing approach:

The java.lang.String.hashCode() method returns a hash code ... computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

It is not cryptographically secure, but that wasn't among your requirements. It is fast, and amenable to implementation via Horner's method.

Either serialize your unsigned integers into a JSON string to be hashed, or simply fill s with them directly. Notice that (101, 1) and (1, 101) won't collide. Collisions are definitely possible, and an attacker can readily provoke them.

Feel free to use mod M on the result, with as many bits in M as your use case requires to produce a suitably low collision rate.

Collisions will happen, so your code will need to have some story for coping with that. Some folks prepend a constant number to everything hashed, and then in the rare event of a collision they increment it and rehash everything in sight.

If you later decide that attackers turn out to be a problem, then a secret salt used with any number of secure hashes will still be available to you when you code up V2.

If collisions are unacceptable then you should be assigning serial numbers and persisting records in a hashmap, RDBMS, redis, or other datastore.

gnasher729 gnasher729 49.3k4 gold badges71 silver badges137 bronze badges · Answer 2 · 2023-03-05 20:22:57Z

2

A hash never guarantees that any two different values have different hashes except if the number of possible values is limited. Hash codes used for fast lookup return integers so that we can quickly translate a hash code into an array index.

Maybe what you really want is to compress your coordinates into a small number of words that can be decompressed back to the same coordinates.

Apple and others believe that having a hash that survives restarting an app is a very bad idea. So you would need a very good reason for this. Maybe you tell us what you actually want to achieve.

Share

Improve this answer

edited Mar 6, 2023 at 18:05

answered Mar 5, 2023 at 20:22

gnasher729's user avatar

gnasher729 gnasher729

49.3k4 gold badges71 silver badges137 bronze badges

8

In a sense, maybe this is what I need. I saw the source code of how the c# implementation of int type overrides GetHashCode() and returns itself. I assumed an array of a uint would be similarly able to return itself, but apparently adding a separator makes this a difficult task. Maybe there isn't enough significant digits to express a large TreeCoordinate so i'm back to using a hashing algorithm.

NWoodsman
– NWoodsman

2023年03月05日 20:34:13 +00:00
Commented Mar 5, 2023 at 20:34
6

Collisions are fundamental to the notion of "hashing" (modulo the special case of perfect hashes on small restricted inputs, as sometimes seen for lexers). Either occasional collisions are acceptable to your use case, or they are not. You need to write down an answer to that design consideration first, and then worry about whether one hashing approach is better than another. If you write down a negative answer, then we shouldn't even be discussing universal hashing.

J_H
– J_H

2023年03月05日 20:50:00 +00:00
Commented Mar 5, 2023 at 20:50
+1 Hashing is completely inappropriate for this use case. Encoding is what the OP needs.

JimmyJames
– JimmyJames

2023年03月06日 18:48:18 +00:00
Commented Mar 6, 2023 at 18:48
What has this answer to do with a hash which is consistent across program executions?

Doc Brown
– Doc Brown

2023年03月06日 19:30:12 +00:00
Commented Mar 6, 2023 at 19:30
@DocBrown Pretty sure this is an XY problem.

JimmyJames
– JimmyJames

2023年03月06日 20:49:18 +00:00
Commented Mar 6, 2023 at 20:49

| Show 3 more comments

DasKrümelmonster DasKrümelmonster 4102 silver badges6 bronze badges · Answer 3 · 2023-03-05 16:16:45Z

If you need consistent hash codes across program executions, I'd recommend to use xxHash. (if not use the MS HashCode.Combine())

xxHash is pretty fast, deterministic and ported to lots of programming languages. For C# there is a variety of nuget packages out there to choose from.

I tried to answer the question, from the information given, I am not certain that a hash function is the best solution to your problem.

I took the freedom to add a reference. Please double check if I got it right.

Stack Exchange Network

Compute hash that's consistent across program executions

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Compute hash that's consistent across program executions

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions