I'm working with a file format which stores data as a tree. Each node in the tree is known as a Tag
, and every tag has a name
and a value
. The order of tags is not guaranteed, but the general structure is. For example, in the diagram below, the CompoundTag
"John Smith" and "Albert Einstein" have the same structure even though their child tags are in a different order:
Formatted as: [TagType] TagName : TagValue
[ListTag] Employees :
[CompoundTag] :
[StringTag] FirstName : John
[StringTag] LastName : Smith
[IntTag] EmployeeId : 123
[CompoundTag] :
[StringTag] FirstName : Albert
[IntTag] EmployeeId : 456
[StringTag] LastName : Einstein
I'm creating signatures of certain tags so that I don't have to parse the whole tree structure, as I'm only interested in specific tags in the tree. To do that, I've created an order-independent hash function which hashes the name of each tag and the type. The function then combines those hashes into one signature which I use to identify other tags with the same structure.
The function hashes tag names using the standard String.hashCode()
method, then adds the tag Id (signifying the type) to the upper bits of the hash. It then combines the hash with two previous values which keep track of hashes combined using two commutative operations: addition and exclusive or. I keep track of two combined hashes because I found that addition and xor on their own are easily susceptible to collision. Finally, I combine the two hashes as a long and return that as the tag signature.
I have an optional parameter to include the parent tag in the signature. The parent tag is hashed the same way as child tags are, but I also add a small constant to the parent tag hash to make it slightly different from child tags. I add the constant in-case a child tag has the same name and type as the parent, which would generate the same signature in some scenarios.
The code targets Java 7, though I'm considering switching to Java 8 in the near future. The function itself is as follows:
/*
* Generate an order-independent signature of a given tag.
* If parentTag is specified, it'll include the parentTag
* in the signature.
*
* @param rootTag is the tag currently being hashed
* @param parentTag is an optional parent to the rootTag
* @return the signature for the given tag
*/
public static long calculateTagSignature(Tag tag, Tag parentTag) {
LinkedList<Tag> stack = new LinkedList<>();
stack.add(tag);
// If the tag is a List/Compound tag, add immediate child tags to stack
// Doesn't add tags deeper than 1 level down
switch (tag.getType()) {
case COMPOUND:
// ArrayList of tags
for (Tag child : ((CompoundTag) tag).getValue()) {
stack.add(child);
}
break;
case LIST:
// Array of a single tag type
for (Tag child : ((ListTag) tag).getValue()) {
stack.add(child);
}
}
// Keep track of two hashes using two commutative methods
long sigA = 0; // stores hashes combined using addition
long sigB = 0; // stores hashes combined using xor
long hash; // temporary variable which stores computed hashes
// If parentTag is specified, include it in the signature
if (parentTag != null) {
// Hash the name of the tag
// Parent tag should be hashed different from child tags
hash = 0xD0EF + ((long) parentTag.getName().hashCode()) & 0xffffffffL;
hash += (long) (parentTag.getType().id << 24); // Add tag type to upper bits
// Combine the hash with the two previous hashes
sigA += hash; // ADD
sigB ^= hash; // XOR
}
// Iterate over all child tags
while (!stack.isEmpty()) {
Tag current = stack.remove();
// Hash the name of the tag
hash = ((long) current.getName().hashCode()) & 0xffffffffL;
hash += (long) (current.getType().id << 24); // Include tag type in the upper bits
// Combine the hash with the two previous hashes
sigA += hash; // ADD
sigB ^= hash | (hash << 32); // XOR
}
// Print out the two input signatures for testing
//System.out.format("Add Hash: %X, XOR Hash: %X%n", sigA, sigB);
// Combine the two hashes to generate a full signature based on sigA and sigB
// Magic constant is just a series of randomly generated bits
long signature = sigB + 0xD1D89955F542E43AL + (sigA << 28) + (sigA >>> 2);
return signature;
}
I'm wondering if this is one of the simplest ways to generate an order-independent hash of these objects with few collisions, or if there is a dramatically better approach.
1 Answer 1
Overall, this is very good code. Maybe there is another hashing strategy to have, but I'll focus on the current code in this answer. There is one point that can be greatly improved here: the comments. There are certainly a lot of them, but...
sigA += hash; // ADD sigB ^= hash; // XOR
or
// Iterate over all child tags while (!stack.isEmpty()) {
are not really useful. The method is actually quite complicated, with magic numbers, additions and bit-shifting left and right, which lots of people struggle with alone. Instead of those small one-line comments that consider just a single line of code, I would suggest having a long comment on top of it explaining what that particular algorithm, as a whole, is doing and why it is done this way. It could even be part of the Javadoc.
There are two cases: either the person reading those comments (like the ones above) already know what the line is doing, and the comment is not really useful, or they don't, and the comment doesn't really explain why that line of code is there, what its purpose is, and ends up being not really useful as well.
Still, some of those are useful, like
// Parent tag should be hashed different from child tags hash = 0xD0EF + ((long) parentTag.getName().hashCode()) & 0xffffffffL;
It does more than saying it hashes something. It explains why there is a constant added to the hash: to make it different from hashes of child tags. It could be even better by explaining why that particular constant was chosen; otherwise the reader is thinking: "Was that chosen randomly? Or is it one of those magical constants that make everything work?"
A couple of other notes:
That part of the code looks like it could use some OOP:
switch (tag.getType()) { case COMPOUND: // ArrayList of tags for (Tag child : ((CompoundTag) tag).getValue()) { stack.add(child); } break; case LIST: // Array of a single tag type for (Tag child : ((ListTag) tag).getValue()) { stack.add(child); } }
Perhaps it doesn't make sense with regard to how the rest of the code is structured, but you could consider adding a
getValue()
abstract method toTag
, perhaps defaulting to return an empty list or without defaults. The idea is that it could make the code be refactored tofor (Tag child : tag.getValue()) { stack.add(child); }
without any
switch
statement: some tags would return an empty list, and compound or list tags would return their children. IfTag
cannot have agetValue()
method because it doesn't make sense for all tags to have a value, perhaps another possibility would be to make a specialized subclass/subinterface ofTag
for which having a value would make sense, and let the method take that instead. Of course, it could be the case that it would be too much for that lone piece of code.I wouldn't let a temporary variable have a broader scope than it should have because it could be reused somewhere else.
long hash; // temporary variable which stores computed hashes
Generally speaking, variables should have the tighter scope possible. In this case, I would just declare two
hash
variables inside the twofor
loops. I don't think it would really hurt performance, and it prevents that variable from leaking to other parts of the code.
-
\$\begingroup\$ Thanks for the suggestions. I can change the
ListTag
to return anArrayList
of tags so that I don't need two for-loops. The switch statement is still necessary asTag
is defined aspublic abstract class Tag<T>
whereT
is used as the return type forgetValue()
, and changing that would cause more headaches elsewhere. As for comments: since this is part of a personal project and not part of a production, I'm not too worried as long as I can read it; though I should practice writing clear and concise commenting regardless, so I'll take your tips into consideration. \$\endgroup\$jocopa3– jocopa32017年02月01日 04:33:12 +00:00Commented Feb 1, 2017 at 4:33 -
1\$\begingroup\$ @jocopa3 I didn't realize
Tag
was a generic class. Then, in the code, it is always used as a raw-type so you should consider using a proper type. \$\endgroup\$Tunaki– Tunaki2017年02月01日 08:59:01 +00:00Commented Feb 1, 2017 at 8:59
hashCode
function forTag
currently computes the hash using the name, type, and value, which is necessary for other areas of the program. Thus if I were to use aSet
to compute the signature, the computed signature would include the tag values as well. I could add a second hash function to theTag
class which hashes only the name and type and override the Set'shashCode
function to call it that function, but that'd essentially amount to computing the hash manually anyway. \$\endgroup\$