as I was reading the Git Internals in the Pro Git book, I decided to see what objects I need to create to have the short hash say something funny. To do this, I have written the following code:
messages = ['badcafe','deaddad','badbeef','feedbac']
possible_letters = string.ascii_lowercase + string.ascii_uppercase + ' '
def getHash(data):
return hashlib.sha1(str("blob " + str(len(data)) + "0円" + data).encode('utf-8')).hexdigest()
def populate(n):
for data in product(possible_letters, repeat=n):
s = getHash(''.join(data))[0:7]
if s in messages:
print(s,data)
Unfortunately, this is too slow for n> 4. For smaller n, it only finds badcafe. Is there any way to speed this up?
1 Answer 1
That's a brute force search of hashes for a particular prefix, meaning it's complexity is \$O(2^n)\,ドル where \$n\$ is the number of bits in the prefix. In your case that's \2ドル^{28}\$.
You can't make it faster in terms of complexity or the hash would be broken, but you can increase the constant factor.
Using cProfile I see that out of 30 seconds of runtime the worst offenders are:
8042221 10.763 0.000 24.243 0.000 tmp.py:10(getHash)
5 5.138 1.028 30.701 6.140 tmp.py:13(populate)
8042222 3.805 0.000 3.805 0.000 {_hashlib.openssl_sha1}
8042221 5.143 0.000 5.143 0.000 {method 'encode' of 'str' objects}
8042221 3.876 0.000 3.876 0.000 {method 'hexdigest' of '_hashlib.HASH' objects}
8042221 1.320 0.000 1.320 0.000 {method 'join' of 'str' objects}
The actual hashing is done in hexdigest
and OpenSSL. Those pretty much cannot be reduced, so at best it could be made ~3 times as fast.
In python 2 you can remove the encode
invocation. That alone saves a few seconds. A set is faster for in
testing, so that's also an easy fix. Take advantage of hashlib.sha1
's copy
and update
by noticing the length stays constant and you are at 2x the speed:
def getHash(data, h):
h = h.copy() # "fork" the hash
h.update(data)
return h.hexdigest()
def populate(n):
h = hashlib.sha1("blob " + str(n) + "0円") # shared part of the hash
for data in product(possible_letters, repeat=n):
s = getHash(''.join(data), h)[0:7]
if s in messages:
print(s,data)
On Python 3 you can encode the letters into bytes and avoid encode
inside the loop, for similar performance.
You might still make it about 1.5x as fast if you replaced the product
-> join
algorithm with something faster (bytearray(n)
and depth first search?), but that's about it.