I'm implementing a variation of the SuperFastHash in VBA for use in Excel (32-bit version, so no LongLong available) to hash strings.
To get around the limitations of signed 32-bit Long values, I'm doing the addition and bit-shifting using Double types, and then converting from Double to Long in a way that truncates it at 31 bits (the maximum positive value -- don't want to deal with two's complement and signs).
I'm getting answers and avoiding overflows so far, but I have a suspicion I'm making some mistakes in translation, since most implementations use all 32 bits of a uint and also deal with individual bytes from an array rather than 16-bit values coming from AscW().
Any suggestions of improvement, especially of the string-handling, bit-shifting, or the final avalanche?
Public Function shr(ByVal Value As Long, ByVal Shift As Byte) As Long
shr = Value
If Shift > 0 Then shr = shr \ (2 ^ Shift)
End Function
Public Function shl(ByVal Value As Long, ByVal Shift As Byte) As Long
If Shift > 0 Then
shl = LimitDouble(CDbl(Value) * (2& ^ Shift))
Else
shl = Value
End If
End Function
Public Function LimitDouble(ByVal d As Double) As Long
'' Prevent overflow by lopping off anything beyond 31 bits
Const MaxNumber As Double = 2 ^ 31
LimitDouble = CLng(d - (Fix(d / MaxNumber) * MaxNumber))
End Function
Public Function SuperFastHash(ByVal dataToHash As String) As Long
Dim dataLength As Long
dataLength = Len(dataToHash)
If (dataLength = 0) Then
SuperFastHash = 0
Exit Function
End If
Dim hash As Long
hash = dataLength
Dim remainingBytes As Integer
remainingBytes = dataLength Mod 2
Dim numberOfLoops As Integer
numberOfLoops = dataLength \ 2
Dim currentIndex As Integer
currentIndex = 0
Dim tmp As Double
Do While (numberOfLoops > 0)
hash = LimitDouble(CDbl(hash) + AscW(Mid$(dataToHash, currentIndex + 1, 1)))
tmp = shl(AscW(Mid$(dataToHash, currentIndex + 2, 1)), 11) Xor hash
hash = shl(hash, 16) Xor tmp
hash = LimitDouble(CDbl(hash) + shr(hash, 11))
currentIndex = currentIndex + 2
numberOfLoops = numberOfLoops - 1
Loop
If remainingBytes = 1 Then
hash = LimitDouble(CDbl(hash) + AscW(Mid$(dataToHash, currentIndex + 1, 1)))
hash = hash Xor shl(hash, 10)
hash = LimitDouble(CDbl(hash) + shr(hash, 1))
End If
'' Final avalanche
hash = hash Xor shl(hash, 3)
hash = LimitDouble(CDbl(hash) + shr(hash, 5))
hash = hash Xor shl(hash, 4)
hash = LimitDouble(CDbl(hash) + shr(hash, 17))
hash = hash Xor shl(hash, 25)
hash = LimitDouble(CDbl(hash) + shr(hash, 6))
SuperFastHash = hash
End Function
3 Answers 3
This is quite possibly the most clever VBA code I have ever seen.
There's not much room for improvement here, except a couple naming nitpicks:
- If that's what they stand for, functions
shr
andshl
could afford to be calledShiftRight
andShiftLeft
. - I don't see a reason for
shr
andshl
to not follow the PascalCasing naming convention. - Given the almost-flawlessly consistent camelCasing convention for locals & parameters, I don't see a reason for
Value
andShift
parameters to not stick to it. (except perhaps VBA/VB6's stubornness with casing?)
As for string handling, as per this source AscW
is faster than Asc
so again, good call:
AscW(Mid$(..)) considerations
Don't copy too many characters with Mid$. AscW only examines the first character anyway. AscW(Mid$(s, x, 1)) is the best call. Note that if you call AscW(Mid$(s, x)) without the third parameter, Mid$ executes slowly when s is a long string.
Example of potentially slow code:
For x = 1 To Len(s) If AscW(Mid$(s, x)) = ... Then ... Next
The above is better written as:
For x = 1 To Len(s) If AscW(Mid$(s, x, 1)) = ... Then ... Next
Nothing major, just some stylistic comments....
The Do While loop can be rewritten more clearly as a
For i = 0 to dataLength \ 2
the variable
remainingBytes
is unnecessary, and if used should be a boolean not an integer.tmp
can be declared inside the loop as it is never used outside it.
-
\$\begingroup\$ I find a better name for a Boolean would be
hasRemainingBytes
-remainingBytes
does sound like an integer to me, although I agree with your point. Also in VBA declaring variables inside a loop affects readability negatively imho - VBA doesn't scope variables at that level. \$\endgroup\$Mathieu Guindon– Mathieu Guindon2013年11月17日 20:52:35 +00:00Commented Nov 17, 2013 at 20:52 -
1\$\begingroup\$ @retailcoder: I'd prefer to just get rid of
remaingBytes
, but if not changing the name as well would be good. As fortmp
and scope, I think VBA gives you a choice of evils, and putting it where it should be if there was a good choice is the best you can do in this case. Fortunately, the value truly is temporary in this case, assigned and then read just once per iteration. Declaring it outside of the loop gives the impression that it will be used outside of the loop, when the truth is it could be, but isn't, no matter where it is declared. \$\endgroup\$jmoreno– jmoreno2013年11月18日 05:56:57 +00:00Commented Nov 18, 2013 at 5:56
in your SuperFastHash
function you are Dimming and setting variables and then using them in a fashion that looks messy to me even though it is in a logical order.
the set up should be separate from the actual logic of the code to make it more clear and readable.
you should Dim all the variables at the beginning of the function and then follow with the logic so that it the flow is clear and understandable
it should look like this:
Public Function SuperFastHash(ByVal dataToHash As String) As Long
Dim dataLength As Long
Dim hash As Long
Dim remainingBytes As Integer
Dim numberOfLoops As Integer
Dim currentIndex As Integer
Dim tmp As Double
dataLength = Len(dataToHash)
If (dataLength = 0) Then
SuperFastHash = 0
Exit Function
End If
hash = dataLength
remainingBytes = dataLength Mod 2
numberOfLoops = dataLength \ 2
currentIndex = 0
this looks much clearer about what is going on and is a good coding practice
(as I was writing this I thought this was going to be the only answer, but then retailcoder posted an answer)
-
\$\begingroup\$ +1 for finding something to say on that one - although I didn't mention this point because I felt (with the camelCasing of locals) the programmer came from a C or C# background, where the norm is more to declare as close as possible to usage. But I agree in short VBA/VB6 functions it does look cleaner that way. \$\endgroup\$Mathieu Guindon– Mathieu Guindon2013年11月16日 07:23:40 +00:00Commented Nov 16, 2013 at 7:23
-
\$\begingroup\$ I mostly Code in C# and I disagree with declare as close as possible to usage if it isn't super huge then I declare at the beginning, and never staggered like that. \$\endgroup\$Malachi– Malachi2013年11月16日 07:31:54 +00:00Commented Nov 16, 2013 at 7:31
-
1\$\begingroup\$ I cut my teeth in BASIC, followed by QB, VBA, etc., but over the last 10 years or so I spend most of my time in C#, where variables are more traditionally declared before first use. I struggle with being consistent here between C#, VBA, JS, T-SQL, and other languages I switch among during the day. Advantage for top-declaration is clarity of what variables are needed. Advantage of JIT declaration is better control over when a variable should be useable. I think you're right in this case -- the function is short enough that top declaration would be cleaner. \$\endgroup\$richardtallent– richardtallent2017年02月18日 01:54:24 +00:00Commented Feb 18, 2017 at 1:54
Decimal
data type? From excel helpDecimal variables are stored as 96-bit (12-byte) signed integers scaled by a variable power of 10
.Decimal
can store 32bit or 64 bit unsigned integers \$\endgroup\$