This Scala code snippet is supposed to encode a SHA1 hash in base 62.
Can you find any issues? I'm asking since I might not be able to change the algorithm and, for example, fix issues in the future.
I'd like to be able to also implement it in JavaScript in the future.
def mdSha1() = java.security.MessageDigest.getInstance("SHA-1") // not thread safe
def hashSha1Base62DontPad(text: String): String = {
val bytes = mdSha1().digest(text.getBytes("UTF-8"))
val bigint = new java.math.BigInteger(1, bytes)
val result = encodeInBase62(bigint)
result
}
private val Base62Alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
def encodeInBase62(number: BigInt): String = {
// Base 62 is > 5 but < 6 bits per char.
var result = new StringBuilder(capacity = number.bitCount / 5 + 1)
var left = number
do {
val remainder = (left % 62).toInt
left /= 62
val char = Base62Alphabet.charAt(remainder)
result += char
}
while (left > 0)
result.toString
}
1 Answer 1
It is 'standard' to have 0-9
at the beginning of the 'alphabet' for numbers....
I would have suggested that you use the native functionality in BigInteger to convert the value to a String in any given radix, but unfortunately, it does not support more than radix 36. Still, you should follow that standard and start with 0-9
instead of ending with it.
I would also suggest two things:
you should convert the alphabet to an array immediately:
private val Base62Alphabet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".toCharArray()
you should not have the magic number '62' in your code, but should base it off the array size:
private val Radix = Base62Alphabet.length
your code then becomes:
val remainder = (left % Radix).toInt left /= Radix
Finally, I don't like that you have variable-length results from the conversion. You should ensure that all hashes of data with the same length have the same length of output... -left-padding with 0
as needed. It is possible for something to hash to 0x0000000000 (hex).... which will give you 0-value number.bitCount
.
-
\$\begingroup\$ Thanks! Concerning the alphabet, Base64 uses "ABCD...abcd...0123+/" rather than "0123...ABC...abc+/" according to Wikipedia. I was thinking "Base64 but without the two last characters". \$\endgroup\$KajMagnus– KajMagnus2014年01月21日 00:45:46 +00:00Commented Jan 21, 2014 at 0:45
-
\$\begingroup\$ @KajMagnus you should be aware that if you were to use Base64 instead, it will be faster, and the bit-manipulations are much easier (and there are some easy-to-use libraries for it already). \$\endgroup\$rolfl– rolfl2014年01月21日 00:49:26 +00:00Commented Jan 21, 2014 at 0:49
-
1\$\begingroup\$ On my computer, SHA1 + Base62 is 2.6 times slower than SHA1 + Base64. And running SHA1 + Base64 takes 5 microseconds. — I'll probably generate a URL safe base 64 string instead, and strip any '-'; that actually works fine in my particular use case, and is faster and simpler. \$\endgroup\$KajMagnus– KajMagnus2014年01月21日 02:20:54 +00:00Commented Jan 21, 2014 at 2:20