Generating Signatures of Tree Structures using an Order-Independent Hash

Question 1

I'm working with a file format which stores data as a tree. Each node in the tree is known as a Tag, and every tag has a name and a value. The order of tags is not guaranteed, but the general structure is. For example, in the diagram below, the CompoundTag "John Smith" and "Albert Einstein" have the same structure even though their child tags are in a different order:

Formatted as: [TagType] TagName : TagValue
[ListTag] Employees :
 [CompoundTag] :
 [StringTag] FirstName : John
 [StringTag] LastName : Smith
 [IntTag] EmployeeId : 123
 [CompoundTag] :
 [StringTag] FirstName : Albert
 [IntTag] EmployeeId : 456
 [StringTag] LastName : Einstein

I'm creating signatures of certain tags so that I don't have to parse the whole tree structure, as I'm only interested in specific tags in the tree. To do that, I've created an order-independent hash function which hashes the name of each tag and the type. The function then combines those hashes into one signature which I use to identify other tags with the same structure.

The function hashes tag names using the standard String.hashCode() method, then adds the tag Id (signifying the type) to the upper bits of the hash. It then combines the hash with two previous values which keep track of hashes combined using two commutative operations: addition and exclusive or. I keep track of two combined hashes because I found that addition and xor on their own are easily susceptible to collision. Finally, I combine the two hashes as a long and return that as the tag signature.

I have an optional parameter to include the parent tag in the signature. The parent tag is hashed the same way as child tags are, but I also add a small constant to the parent tag hash to make it slightly different from child tags. I add the constant in-case a child tag has the same name and type as the parent, which would generate the same signature in some scenarios.

The code targets Java 7, though I'm considering switching to Java 8 in the near future. The function itself is as follows:

/* 
 * Generate an order-independent signature of a given tag.
 * If parentTag is specified, it'll include the parentTag 
 * in the signature.
 * 
 * @param rootTag is the tag currently being hashed
 * @param parentTag is an optional parent to the rootTag
 * @return the signature for the given tag
 */
public static long calculateTagSignature(Tag tag, Tag parentTag) {
 LinkedList<Tag> stack = new LinkedList<>();
 stack.add(tag);
 // If the tag is a List/Compound tag, add immediate child tags to stack
 // Doesn't add tags deeper than 1 level down
 switch (tag.getType()) {
 case COMPOUND:
 // ArrayList of tags
 for (Tag child : ((CompoundTag) tag).getValue()) {
 stack.add(child);
 }
 break;
 case LIST:
 // Array of a single tag type
 for (Tag child : ((ListTag) tag).getValue()) {
 stack.add(child);
 }
 }
 // Keep track of two hashes using two commutative methods
 long sigA = 0; // stores hashes combined using addition
 long sigB = 0; // stores hashes combined using xor
 long hash; // temporary variable which stores computed hashes
 // If parentTag is specified, include it in the signature
 if (parentTag != null) {
 // Hash the name of the tag
 // Parent tag should be hashed different from child tags
 hash = 0xD0EF + ((long) parentTag.getName().hashCode()) & 0xffffffffL;
 hash += (long) (parentTag.getType().id << 24); // Add tag type to upper bits
 // Combine the hash with the two previous hashes
 sigA += hash; // ADD
 sigB ^= hash; // XOR
 }
 // Iterate over all child tags
 while (!stack.isEmpty()) {
 Tag current = stack.remove();
 // Hash the name of the tag
 hash = ((long) current.getName().hashCode()) & 0xffffffffL;
 hash += (long) (current.getType().id << 24); // Include tag type in the upper bits
 // Combine the hash with the two previous hashes
 sigA += hash; // ADD
 sigB ^= hash | (hash << 32); // XOR 
 }
 // Print out the two input signatures for testing
 //System.out.format("Add Hash: %X, XOR Hash: %X%n", sigA, sigB);
 // Combine the two hashes to generate a full signature based on sigA and sigB
 // Magic constant is just a series of randomly generated bits
 long signature = sigB + 0xD1D89955F542E43AL + (sigA << 28) + (sigA >>> 2);
 return signature;
}

I'm wondering if this is one of the simplest ways to generate an order-independent hash of these objects with few collisions, or if there is a dramatically better approach.

Question 2

Nitpick: there's a typo in your data. It's Einstein, not Einstien.

Question 3

Is it possible to add everything in a Set, and rely on the hashing of that instead of trying to compute it manually? The Set would also be order independent.

Question 4

@HeshamAttia The hashCode function for Tag currently computes the hash using the name, type, and value, which is necessary for other areas of the program. Thus if I were to use a Set to compute the signature, the computed signature would include the tag values as well. I could add a second hash function to the Tag class which hashes only the name and type and override the Set's hashCode function to call it that function, but that'd essentially amount to computing the hash manually anyway.

Question 5

Overall, this is very good code. Maybe there is another hashing strategy to have, but I'll focus on the current code in this answer. There is one point that can be greatly improved here: the comments. There are certainly a lot of them, but...

sigA += hash; // ADD
sigB ^= hash; // XOR

or

// Iterate over all child tags
while (!stack.isEmpty()) {

are not really useful. The method is actually quite complicated, with magic numbers, additions and bit-shifting left and right, which lots of people struggle with alone. Instead of those small one-line comments that consider just a single line of code, I would suggest having a long comment on top of it explaining what that particular algorithm, as a whole, is doing and why it is done this way. It could even be part of the Javadoc.

There are two cases: either the person reading those comments (like the ones above) already know what the line is doing, and the comment is not really useful, or they don't, and the comment doesn't really explain why that line of code is there, what its purpose is, and ends up being not really useful as well.

Still, some of those are useful, like

// Parent tag should be hashed different from child tags
hash = 0xD0EF + ((long) parentTag.getName().hashCode()) & 0xffffffffL;

It does more than saying it hashes something. It explains why there is a constant added to the hash: to make it different from hashes of child tags. It could be even better by explaining why that particular constant was chosen; otherwise the reader is thinking: "Was that chosen randomly? Or is it one of those magical constants that make everything work?"

A couple of other notes:

That part of the code looks like it could use some OOP:
```
switch (tag.getType()) {
 case COMPOUND:
 // ArrayList of tags
 for (Tag child : ((CompoundTag) tag).getValue()) {
 stack.add(child);
 }
 break;
 case LIST:
 // Array of a single tag type
 for (Tag child : ((ListTag) tag).getValue()) {
 stack.add(child);
 }
}
```
Perhaps it doesn't make sense with regard to how the rest of the code is structured, but you could consider adding a getValue() abstract method to Tag, perhaps defaulting to return an empty list or without defaults. The idea is that it could make the code be refactored to
```
for (Tag child : tag.getValue()) {
 stack.add(child);
}
```
without any switch statement: some tags would return an empty list, and compound or list tags would return their children. If Tag cannot have a getValue() method because it doesn't make sense for all tags to have a value, perhaps another possibility would be to make a specialized subclass/subinterface of Tag for which having a value would make sense, and let the method take that instead. Of course, it could be the case that it would be too much for that lone piece of code.
I wouldn't let a temporary variable have a broader scope than it should have because it could be reused somewhere else.
```
long hash; // temporary variable which stores computed hashes
```
Generally speaking, variables should have the tighter scope possible. In this case, I would just declare two hash variables inside the two for loops. I don't think it would really hurt performance, and it prevents that variable from leaking to other parts of the code.

Question 6

Thanks for the suggestions. I can change the ListTag to return an ArrayList of tags so that I don't need two for-loops. The switch statement is still necessary as Tag is defined as public abstract class Tag<T> where T is used as the return type for getValue(), and changing that would cause more headaches elsewhere. As for comments: since this is part of a personal project and not part of a production, I'm not too worried as long as I can read it; though I should practice writing clear and concise commenting regardless, so I'll take your tips into consideration.

Question 7

@jocopa3 I didn't realize Tag was a generic class. Then, in the code, it is always used as a raw-type so you should consider using a proper type.

Tunaki Tunaki 9,3011 gold badge31 silver badges46 bronze badges · Accepted Answer · 2017-01-31 23:27:18Z

Overall, this is very good code. Maybe there is another hashing strategy to have, but I'll focus on the current code in this answer. There is one point that can be greatly improved here: the comments. There are certainly a lot of them, but...

sigA += hash; // ADD
sigB ^= hash; // XOR

or

// Iterate over all child tags
while (!stack.isEmpty()) {

are not really useful. The method is actually quite complicated, with magic numbers, additions and bit-shifting left and right, which lots of people struggle with alone. Instead of those small one-line comments that consider just a single line of code, I would suggest having a long comment on top of it explaining what that particular algorithm, as a whole, is doing and why it is done this way. It could even be part of the Javadoc.

There are two cases: either the person reading those comments (like the ones above) already know what the line is doing, and the comment is not really useful, or they don't, and the comment doesn't really explain why that line of code is there, what its purpose is, and ends up being not really useful as well.

Still, some of those are useful, like

// Parent tag should be hashed different from child tags
hash = 0xD0EF + ((long) parentTag.getName().hashCode()) & 0xffffffffL;

It does more than saying it hashes something. It explains why there is a constant added to the hash: to make it different from hashes of child tags. It could be even better by explaining why that particular constant was chosen; otherwise the reader is thinking: "Was that chosen randomly? Or is it one of those magical constants that make everything work?"

A couple of other notes:

That part of the code looks like it could use some OOP:
```
switch (tag.getType()) {
 case COMPOUND:
 // ArrayList of tags
 for (Tag child : ((CompoundTag) tag).getValue()) {
 stack.add(child);
 }
 break;
 case LIST:
 // Array of a single tag type
 for (Tag child : ((ListTag) tag).getValue()) {
 stack.add(child);
 }
}
```
Perhaps it doesn't make sense with regard to how the rest of the code is structured, but you could consider adding a getValue() abstract method to Tag, perhaps defaulting to return an empty list or without defaults. The idea is that it could make the code be refactored to
```
for (Tag child : tag.getValue()) {
 stack.add(child);
}
```
without any switch statement: some tags would return an empty list, and compound or list tags would return their children. If Tag cannot have a getValue() method because it doesn't make sense for all tags to have a value, perhaps another possibility would be to make a specialized subclass/subinterface of Tag for which having a value would make sense, and let the method take that instead. Of course, it could be the case that it would be too much for that lone piece of code.
I wouldn't let a temporary variable have a broader scope than it should have because it could be reused somewhere else.
```
long hash; // temporary variable which stores computed hashes
```
Generally speaking, variables should have the tighter scope possible. In this case, I would just declare two hash variables inside the two for loops. I don't think it would really hurt performance, and it prevents that variable from leaking to other parts of the code.

Thanks for the suggestions. I can change the ListTag to return an ArrayList of tags so that I don't need two for-loops. The switch statement is still necessary as Tag is defined as public abstract class Tag<T> where T is used as the return type for getValue(), and changing that would cause more headaches elsewhere. As for comments: since this is part of a personal project and not part of a production, I'm not too worried as long as I can read it; though I should practice writing clear and concise commenting regardless, so I'll take your tips into consideration.
@jocopa3 I didn't realize Tag was a generic class. Then, in the code, it is always used as a raw-type so you should consider using a proper type.

Stack Exchange Network

Generating Signatures of Tree Structures using an Order-Independent Hash

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Generating Signatures of Tree Structures using an Order-Independent Hash

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions